Tenets of Good Admins

There is no shortage of folks running around the world trying to create the next Facebook. They crop up regularly on Serverfault asking what it takes to handle 10,000 connections per second for their next big idea website. I don’t have any good examples to reference because they’re bad questions; bad questions are closed and deleted. What those questions would show is a common problem in the world of system administration: the basic lesson unlearned.

System administration, programming, or any highly skilled task has many aspects in common with driving a car (who doesn’t love car analogies?).  It is the case that being able to get through your driver’s education course does not make you capable of power-sliding in the snow, doing well in a Top Gear Power Lap, or even doing a decent job of parallel parking. Ideas you have about how a car handles may even be flat-out wrong. That’s the effect of being new, or being old-hat but never experiencing the right lessons.

So, after many years of working with systems and a few of toying with questions on StackExchange websites, I offer some proposals for the tenets of good system administration:

Always Measure

There are an ample supply of questions on Serverfault consisting of, “Is this faster or why is that slower, etc.” The answer to those questions in most cases is, “we don’t know you hardware, measure it and find out.” A computer consists of multiple finite resources and in most cases bottlenecks on one of those resources. Most of the time that resource limitation is available memory, processor speed, or disk speed. There are others besides the big three, but the answer is always to measure, locate, and alleviate in a cost-effective way.

Thus, when it comes to adding services don’t tell me SSL is too slow, but rather tell me what it limits you to for speed. A good example I came across this past week was figuring out the cost of / most cost effective way to provide SSL on EC2. I was impressed that the asker had done a lot of homework before and took a look at comparable benchmarks to determine that things “didn’t feel right.” I find many questions where this task isn’t completed. So, whenever you start, consider if $whatever is worth the cost, and do it with a “de minimis” in mind — don’t calculate the small stuff, but if a few thousand in hardware or time is on the line, think it out. Sometimes cost isn’t the only factor, but be aware of it.

Finally, do the default to start. What works 90% of the time for 90% of of folks will probably work for you, or at least be a good starting point to compare against. Optimizing for a target before you’ve been able to measure it is, thanks to the laws of thermodynamics, comparable in efficiency to lighting money on fire. Both will generate heat, but I wouldn’t count on much more besides some ash and a waste of resources.

Build Scalable Designs

What you do on one server, you can do on 100. The tools for this today are better than those of the past. I previously wrote about the benefits of using Puppet even when you have just one server to work on and certainly it scales beyond that very well. Separate your system into programs, configurations, and data. If it’s an installed piece of software, it should be installed by a package manager even if it is a custom build (learn RPM or DPKG). If you’re altering a configuring file in /etc, use some sort of configuration management tool (I advocate Puppet for deployment and git for configuration management). Data would be anything that isn’t reproducible or is expensive to recreate, and that’s basically everything left over. Databases, uploaded files, user home directories… these should all be backed up and tested.

Speaking of testing, do it regularly.  If you expect something to gracefully hot-failover in production, you better test it. If you want to restore a server, the same applies. Grab a spare computer that’s floating around and boot strapping it from no extra configuration or packages at all to just enough to talk to puppet should allow you to install all software and configurations, then restore your data. Does it take you five minutes of work or a day? If you don’t have a spare computer, rent an EC2 instance for a few pennies an hour.

Protect Against the Uncommon

If it blows up, what will you do? Notice now how scaling effectively and recovering from a fire have a really big overlap? Can you recover? How fast do you need to recover? What if your system is compromised? All the things that allow you to bring an extra machine online fast are same ones that allow you to scale up extra machines quickly. All the things that allow fast scaling of new machines allow quick replacement when something needs to go offline because of damage or intrusion. In short, include the entire scale entry here and add considerations for using different locations.

Use Secure Practices

Security is sometimes described as providing confidentiality, availability, integrity. I’ve covered a lot about availability this far, so another reference to the above for that category goes here. Also, I’m going to skip any talk about authentication in this entry, but that doesn’t mean it isn’t a very important thing. Rather, it’s a dead horse that needs its own article.

Integrity means knowing when things change. It is centered around detection and verification. Realizing that compromises to systems are never part of the design, you have must design your system to handle the unexpected. Logs should be centralized, verifiable, and time-coordinated (use NTP!).  Files that don’t match the package manager’s checksum indicate failing drives or mischief.  Detection tools like Tripwire can help as well if you use them well. Consider a well-written AppArmor profiles for your applications with carefully watched alerts for access violations because they indicate something that shouldn’t happen. Integrity is in many ways about having another way to check things. Think of alternate metrics and use them.

Confidentiality builds upon integrity in the sense that much of the groundwork for ensuring integrity can relate to ensuring confidentiality. When the integrity of the system controls to limit access are compromised, confidentiality can’t be ensured. Make sure you know what is changing on your system and you can detect the unexpected. Confidentiality is more of the same to protecting integrity, but in detecting reads. Consider “canary” entries in your database that are watched by an IDS and Watch for extremes of normal access as well. A user visiting account webpage on your system is normal and rarely worth alerting on. Users suddenly visiting their accounts page at five times the normal maximum rate for that time of day ought to be an alarm. Is everything you serve up under one megabyte? If you see a netflow that is larger than that, should it alarm you?

Keep Your Knowledge Fresh

Things change with time, but if you’re not talking with people around you, you might be left configuring your system in 2012 using information from 1998. When that happens, people create 128GB swap partitions on 64GB RAM machines (…that are only using 5GB of RAM anyway, probably because they didn’t understand what they tried to measure. That said, the asker shows promise for looking, asking, and having found a pattern to follow!). At the same time, some institutional memory just doesn’t make it because people don’t end up in the right circles to be exposed to it. That’s why SQL injection is still one of the most common problems for system compromise despite over a decade of background knowledge on the issue and solid practices such as binding variables. I originally started down the path of enumerating a few examples, but the OWASP Top 10 really gets the point across for issues among programmers. For sysadmins, out-of-date software is still on the top of the list.

When it come to staying current, the more face time you get outside your office, the better off you are. The more stories that are shared, the better off you and the entire community are. You get a quicker start and a better body of knowledge. You have a default for situations that you haven’t encountered before, and that’s critical for not reinventing the wheel… again.

Spend time on professional forums. Read questions, ask questions, answer questions. Go to conferences. Attend local meetups. Be a part of your field.

Never Work on a Machine, Apply a Configuration

If you’ve worked with Linux as a systems administrator, or even for your own services, you’ve almost certainly fiddled with a few config files on whatever machines you’re in charge of. You’ve also probably configured the same thing many times over on every new system you get. On every machine I’ve ever owned, I’ve made a user account for myself. They also all have ntp and ssh running. Every machine I work with has good reason to have my public ssh key, so I have to copy that. If I keep going for a while, I’ve suddenly got a list of things to take care of.

When I first got started playing with Linux over a decade ago, I evolved through a few systems until I started playing with Linux From Scratch as a distribution sometime in late 2000 or early 2001 (I’m registered user #23… and 553 because I forgot I registered the first time). After I built my box a few times over, including with KDE and kernel compiles that I remember taking 24 hours, I started getting really excited about the idea of packing my Linux From Scratch work. Next time I wanted to get all the new versions, why would I want to go back and try to remember all the flags I set? I never had more than one system to deal with at a time, but I knew I’d want to simplify the work because it would come up again.

Since then I’ve mostly lived in a Debian and Ubuntu world because I don’t want to go back and figure out those dependancies, wait 24 hours for KDE to build, or focus on all the software options. Yet even when I’ve whittled myself down (at times) to just one server, the package management system still leaves me needing to do mostly repetitive configuration work. Further, if I want to be good, I’ll keep my iptables configuration up to date. Scripting that is a nuisance, and it’s unfortunate when things are inconsistent.

Working on putting all my configurations into Puppet is quite a bit of work. The answer to that is to just be incremental. Wikimedia, like Rome, didn’t build their intense implementation in a day. The suggestion I’ve got is to just do it with the next change you make. Install Puppet on that once machine and use it to deploy that one configuration file. Drop it in the templates directory and don’t sweat the idea of variables. It won’t cost you more than a few minutes of extra time. Don’t worry about making everything, or anything, a variable until you need it.

class base {
 package { ntp:
 ensure => latest
}

service { $operatingsystem ? {
  "Debian" => "ntp",
  default => "ntpd",
 }:
 ensure => running
 }
}

I’ll never again install ntp from the command line, or worry about whether it actually runs. If I create a config file, I can repoint all my servers to different time masters at once. If I don’t want to worry about that, I can just leave it at the defaults that the package uses.

It’s going to pay off the next time you upgrade or replace that server and you don’t have to select all the different packages that need to be there. You won’t be fiddling with extra options or uninstalling excess software because the bare minimum and a puppet client will put all those packages you need in place and drop in your configs. Next time I want to test something, I can make a dev server on EC2 in minutes.

You get security because you get consistency. That tripwire, apparmor, or iptables setting you learned years ago and didn’t carry on to your next system because it  takes a long time to get right can now be copied everywhere without thought. You get time back without an upfront cost as you slowly roll every change in, even with just a few machines. You become a better systems admin, even if you’re just working on your one server.

At some point, you’ll find that putting a new machine in place with security, backups, user accounts, monitoring, dns, and the software you need at the start will be nothing more than a case of

node "newhost" {
 include base
}

Now, next time you upgrade that PHP script that’s in a .tar.gz for the 14th time this year and do the WordPress dance of “keep this directory, but copy all new files”, ask yourself: might it be worth it if I write a packaging definition file and just run that against the .tar.gz file next time?

8 Hours in an Emergency Room — Thoughts of Queueing

Yesterday morning I awoke to my girlfriend calling my name from three rooms away – “Jeff, Jeff… wake up Jeff. Jeff, I’m hurt!” Skip a lot of fast-moving details and we find ourselves in a room at the ER with the kind of real problem that gets you through triage in minutes… and in this case leaves you waiting unnecessary hours for your discharge. That we spent so much time in the room was a problem for the entire department as they were filling up the hallways with patients.

Emergency rooms known to those outside the medical profession for wait times. This is connected to the fact that you’re servicing the person who needs stitches or is running a high fever with the same resources that you’re servicing a person who came in from a helicopter on the roof with their arm falling off. That’s not enough of an explanation, however when the average wait just to be seen is 3 hours, 42 minutes.

Resource utilization and compartmentalized responsibility may be the biggest factors to address. In any system with a single lead, you have a clear setup. When multiple people are responsible for a process at multiple stages and with many who have similar authority involved, the system becomes too complex for the parts to work together on their own. If somebody doesn’t have an overview of the process, it can feel like trying to drive down a road with timed stoplights in the middle of the night. Failing to coordinate such a system can result in huge backups when there is traffic even though the number of vehicles per hour doesn’t change.

In a bustling emergency room that has multiple units, it may be the case that nobody is following the patient’s status throughout delivery of treatment to the same effect that as when nobody is coordinating stoplights. Triage bins patients according to priority and ensures that urgent cases are attended to first. Once they leave the waiting area, triage is uninvolved. Nurses prep patients, doctors attend to them, nurses carry out medication orders, and techs handle the more trivial issues such as monitoring blood pressure. Eventually the doctor will sign the patient over to another department or discharge them, but their attention is focused on medical issues, not on flow.

Consider the flow of customers and orders within restaurant. The host is aware of capacity and utilization; hosts seat people according to reservations and group size (triage).  Wait staff tend to guests determining their food orders (nurses). Some staff roam the restaurant filling up water glasses as needed (techs). Chefs prepare main dishes (doctors), and in a bustling operation many of them may work on a single dish (specialists). The trick is that all your food must come out quickly and at the same time. Wait staff can’t coordinate that because they’re tending to tables. Chefs need to focus on food itself (diagnosis / patients / complex interactions). The missing link is the expeditor. They manage all the kitchen queues and fire orders in a way that ensures everything is addressed at the proper time.

Hospitals generally lack a position that’s analogous to this. A non-medical individual who never interacts with patients and is in charge of movement probably boggles a few minds in the field. I was a bit boggled myself when I did some searching and realized that first good example of the idea forming in my head was implemented in an ER by a facilities cleaning company.

Queuing problems are not unique to hospitals in any way, and some great solutions for them come from outside. Besides the restaurant view, there are many other instances where an individual with overview of the holistic situation is a benefit. Being able to view an entire supply chain of parts can result in great increases in efficiency and improve reliability. Safety officers on fire grounds and hazmat scenes are an example where everybody is focused on safety and yet an individual not involved in fighting fire provides an overview to keep everybody safe because the interaction of various companies at once is complex. Police departments in growing numbers have civilians who determine resource allocation based on information from many officers and departments. Even the railroad I volunteer at benefits greatly from an outsider asking why all the rail isn’t pulled up at once and the idle backhoe used to grade everything instead of one section at a time.

Queuing is also a matter of consideration with a patient just sitting in a room. Whenever we needed attention, there was one button to push. Whether it was asking a trivial question that could wait 10 minutes or a spurting vein, there was only one button to push. While we always want that that big red button to be easily accessible for anybody, a patient who is capable of pressing smaller 5 or 10 minute request buttons, or even typing up their requests will appreciate not feeling demanding and will benefit the staff by allowing more efficient servicing of their requests.  A patient might alert us for medication when pain starts to increase rather than waiting for somebody to incidentally stop in or alerting only when it becomes seriously uncomfortable and requires immediate attention.

Having an overall view is needed to efficiently coordinate any complex systems with multiple pieces that all work together, whether it’s the traffic lights on the evening commute, different military units storming Normandy, or the person in your ER who reserves a CT scan ahead of time so that the contrast agent can be consumed just before an ultrasound and leave the patient absorbing their dye during an already needed test instead of waiting in a bed between tests.

The First One

The history of the world tells me that I first registered this domain in 2003. I filled it with random writings and a bit too much angst… but I did write my own SQL backend for it. The problem was the same as most blogs suffer — a limited audience with limited potential and temporal relevance. If I didn’t write regularly, it was valueless. The stuff that I wrote previously also lost value. While it’s amusing to look back on, I want something of lasting value.

These past years I’ve been been forced to learn to adapt in different ways. I’ve come to understand that there’s a useful limit to technical knowledge in the world. At a certain point, it matters less what you know and more how you can interact. Put another way, you eventually have to start trading on your name. For technology, I live in a pretty small city for that.

I’d say around June I decided to start fixing that. Stackoverflow.com was a useful site to me. It was mostly useful as a concept. While I’ve made my money in part by programming, I never felt it was my strongest calling (even if I did start enthusiastically looking at The Art of Computer Programming this week). A spinoff site, security.stackexchange.com really did do it for me, though. I like that feeling that pointing to my user account there is a way to say, “This is who I am, and I do know my stuff.”

I also found that writing outside that format was useful. I’ve been a contributor to the Security.SE blog for some time now, and plan to continue that. Yet, I think the things I learn and the traits I want to present extend beyond that. So, for an open forum where I can choose the topic, I give myself this goal: Write things with a technical focus that will individually and collectively be a useful reference to those who may read them.

Thus: for a reason to write, a place for my thoughts, and some good ‘ol shameless self-promotion, I have a real website.