The Innerworkings of a Geek

Friday, February 27, 2009

LTSP vs Hybrid (Diskless node) clients

This week, we rolled out our new deployment of linux workstations. These systems are running SLED11 RC4, with Firefox, OpenOffice, etc. Featurewise they are the same as our OpenSUSE 11 LTSP machines. However, they're now Hybrid orDiskless nodes

Some history....
When I got into deployment of LTSP machines, I thought thin-computing would be pretty straightforward. At the beginning it was. We had only a few users, they were not power users (only Firefox, OpenOffice) However, the more users, the more people wanted to use them as normal workstations. People in the LTSP channels boasted how easy they were to administer, the reliability of just servers to maintain, etc, etc. I was excited when I got my first crack at a deployment. But would it match up to a normal workstation? In short the answer was no.

Why did we choose LTSP?

1) Reduce Costs
By having two servers operating for 40 users, we should be able to have 40 thin clients priced below $300/CPU. When I approached dell for a thin client terminal, they wanted $625 for just the CPU! I said no thanks, and instead went the custom route. We spent between $275-$225/ea CPU, plus ~$200 for Samsung 19" LCDs. Prices have come down on LCDs since then.
2) Reduce management
Two servers are much easier to update than 40 clients.
3) Workstation security
Data is always stored on redundant, backed up servers; not the client. The client also doesn't have local root and cannot install local apps, so the machine won't be infected with viruses.
4) Reliability
If the local machine crashed, it can easily be swapped out for another machine. Since the server is the only system being maintained, it should be more reliable.

What went wrong with thin-client computing for us? I wrote about it more here: http://www.fcdnet.org/japerry/2008/12/good-bye-ltsp.html

Why it failed:
1) They're slow.. painfully slow. You'd think with 2 8-core, 8GB, Raid 0 servers you'd have fast clients. When compiling, yah they're fast. Otherwise, they are really slow. Monika mentions that her work also uses thin-clients, only the windows variant. They too are slow and have reliability problems.
2) Unreliable. If the server has issues, all users are affected. We did everything possible to keep them running, but we hit a major unexpected roadblock here: Many apps are just not tested or designed to run in a multi-user environment. Especially file/network protocols. Novfs, NFS, NSS, SMB all one issue or another with multiple users on one machine. And with other apps we find 'odd' issues that just don't occur on single machines.

So what about a diskless node solution? Keep the drive remote and treat the client similar to a liveCD.....
KIWI-LTSP project is much more than LTSP. It also is for image automation. although still in beta, its used in many production processes. SuSE releases its Live/Install CD/DVDs using the kiwi imaging system. Suse just announced SuSE studio, which is a web based service to create custom disk and vmware images, essentially a front-end to the kiwi imaging system. But there are other things you can do with kiwi as well, including boot strapping a NFS, NBD, or AoE squashfs system.

Our Solution: AoE diskless nodes, using KIWI.
Cyberorg recommended I take a look at AoE support in KIWI. Work was done recently to allow AoE exports to mount in kiwi. Ata over Ethernet. The nice thing about AoE is that its performance is almost that of iSCSI. The disadvantage (some would say) is that you can only export it on the local lan, since its not routable. for our situation, this is fine, and good because it provides an extra layer of security. Our hybrid solution sits behind its own VLAN.

The other major advantage of AoE and KIWI is ease of image updates. Once the kiwi netboot initrd is setup, you can change your images without needing to rebuild kiwi. For our deployment, I have a dedicated hard drive for images. This allows us to fully test (IE reboot into) a snapshot of our image. Once testing is done, I reboot into my deployment machine, run a few custom scripts to sanitize /tmp, /var/log, /etc (get rid of udev, fstab, etc), and export the whole drive as a squashfs.

 mksquashfs /mnt/staging /srv/kiwi-images/`cat /mnt/staging/etc/suse-release`-`cat /mnt/staging/etc/kiwi-label`-`cat /srv/kiwi-images/version-incr`.img 

Once this is done, i can deploy the new image.
losetup -a

losetup -d /dev/loopN (where N is the loop devices already mounted

losetup -v -f /srv/kiwi-images/`cat /mnt/staging/etc/suse-release`-`cat /mnt/staging/etc/kiwi-label`-`cat /srv/kiwi-images/version-incr`.img

losetup -a // verify what loop device the image is deployed on
vbladed 0 1 eth1 /dev/loopN


One drawback to the process above is the need for everyone to shutdown their systems before the image can be deployed. There is a workaround to this. If you don't delete the previous image, you can change the pxeboot.cfg file to indicate the next loop device (probably /dev/loop1). Then, whenever someone reboots, they'll get the newest image, but currently logged in users aren't affected. Depending on your user base, you may want to incorporate a cron job to forcibly shutdown/reboot the machine (for example saturday morning, 3am). Since these are linux machines, some of our LTSP clients have had an uptime near 60 days. If you decide to deploy an image, and then delete the old one a month later, you may find two or three users calling the helpdesk wondering why they get 'cannot execute file: input/output error' and cannot reboot their machine without hard shutting it off.

Another solution is to create one base image, and perform updates entirely from scripts. For servers I could see this being useful, but for clients its a little more sketchy. Right now the image deployment allows a user to be at the login screen exactly one minute after they press the power button (and 20 seconds of that is before pxeboot even starts!) If we ran zypper to install custom packages, the boot time would increase considerably. So we install all needed packages on the system, and use configuration scripts to decide if something should start on boot.

Custom configurations
In LTSP, we could deploy the lts.conf file, which would setup sound, printers, and other special devices an individual user may have. Since we're no longer using LTSP, I needed to come up with a different solution. I created a script that gets called on boot that copies or appends files based on the hostname. This allowed me to get custom printer.conf files for users, which would append to the company-wide printers already installed on the image.

Home directories, stateful data storage
We use NFS to mount the /home directory from what used to be one of our LTSP servers. This allows users to keep all their files as they were. We decided against using NSS or Novell NCP for home directories because of input/output errors. /tmp is currently stored on the squashfs image, as are the logs. However, I think this could potentially pose a problem, since it could fill up ram. I'll probably make a new nfs share for /tmp.

Thin-Client specs:
For our new deployment, the thin-clients remain mostly the same. They still do not have local storage. When we went with LTSP clients, I wanted to go overboard on the specs. Many boast that 'Pentium II' CPUs w/32mb ram and 8mb video works great with LTSP. Since crummy small LTSP clients sold for the same as my desktops, I went the desktop option. What do they have in them?
AMD 3800 or AMDx2 4200 CPU
256 - 1GB RAM
CD/DVD ROM/RW drive
NVIDIA 6050/7150 onboard video -or- 9400GT
GigE ethernet
Turning thin-clients to the hybrid-clients only required more ram. Since ram comes a dime a dozen, this upgrade was very inexpensive. We found that systems running 256MB RAM could not display GDM, those with 512MB could run only one app at a time, and 1GB works ok. We're doing some exhaustive testing this week to see if 1GB is okay. If not, we will update the machines to 2GB. Guess how much it costs per machine to goto 2GB of ram? For 38 machines (after re-purposing the ram we have), it cost $360 from newegg. Thats right, $20/ea.

Performance improvements:
The first thing we looked at was video performance. We thought at first that nvidia 6050 cards were just crap and no better than intel or sis onboard graphics. Apparently LTSP was the problem. Tests with simple glxgears show a 200% improvement in frame rate, 9400GT got a 250% boost in performance.
Thunderbird used to render calendars poorly. So poorly that our power users had to use windows until we got this system up and running. Now? thunderbird displays every thing instantly.
QTIplot with graphs rendering. It would bring the LTSP clients to the brink of locking up (and many times it would lock up). Now? qtiplot renders perfectly.

In summary
LTSP was a great idea when the servers were less than the price difference of thin-client machines vs thick clients. All other benefits are solved with the hybrid-client solution. And since commodity computer parts are as cheap as thin-clients, there is no benefit of using LTSP.
If you want the benefits of LTSP without the drawbacks, take a look at the hybrid-client solution. Its cheaper, faster, and more reliable than thin-client computing.

0 Comments:

Post a Comment

<< Home