There is no shortage of folks running around the world trying to create the next Facebook. They crop up regularly on Serverfault asking what it takes to handle 10,000 connections per second for their next big idea website. I don’t have any good examples to reference because they’re bad questions; bad questions are closed and deleted. What those questions would show is a common problem in the world of system administration: the basic lesson unlearned.
System administration, programming, or any highly skilled task has many aspects in common with driving a car (who doesn’t love car analogies?). It is the case that being able to get through your driver’s education course does not make you capable of power-sliding in the snow, doing well in a Top Gear Power Lap, or even doing a decent job of parallel parking. Ideas you have about how a car handles may even be flat-out wrong. That’s the effect of being new, or being old-hat but never experiencing the right lessons.
So, after many years of working with systems and a few of toying with questions on StackExchange websites, I offer some proposals for the tenets of good system administration:
There are an ample supply of questions on Serverfault consisting of, “Is this faster or why is that slower, etc.” The answer to those questions in most cases is, “we don’t know you hardware, measure it and find out.” A computer consists of multiple finite resources and in most cases bottlenecks on one of those resources. Most of the time that resource limitation is available memory, processor speed, or disk speed. There are others besides the big three, but the answer is always to measure, locate, and alleviate in a cost-effective way.
Thus, when it comes to adding services don’t tell me SSL is too slow, but rather tell me what it limits you to for speed. A good example I came across this past week was figuring out the cost of / most cost effective way to provide SSL on EC2. I was impressed that the asker had done a lot of homework before and took a look at comparable benchmarks to determine that things “didn’t feel right.” I find many questions where this task isn’t completed. So, whenever you start, consider if $whatever is worth the cost, and do it with a “de minimis” in mind — don’t calculate the small stuff, but if a few thousand in hardware or time is on the line, think it out. Sometimes cost isn’t the only factor, but be aware of it.
Finally, do the default to start. What works 90% of the time for 90% of of folks will probably work for you, or at least be a good starting point to compare against. Optimizing for a target before you’ve been able to measure it is, thanks to the laws of thermodynamics, comparable in efficiency to lighting money on fire. Both will generate heat, but I wouldn’t count on much more besides some ash and a waste of resources.
Build Scalable Designs
What you do on one server, you can do on 100. The tools for this today are better than those of the past. I previously wrote about the benefits of using Puppet even when you have just one server to work on and certainly it scales beyond that very well. Separate your system into programs, configurations, and data. If it’s an installed piece of software, it should be installed by a package manager even if it is a custom build (learn RPM or DPKG). If you’re altering a configuring file in /etc, use some sort of configuration management tool (I advocate Puppet for deployment and git for configuration management). Data would be anything that isn’t reproducible or is expensive to recreate, and that’s basically everything left over. Databases, uploaded files, user home directories… these should all be backed up and tested.
Speaking of testing, do it regularly. If you expect something to gracefully hot-failover in production, you better test it. If you want to restore a server, the same applies. Grab a spare computer that’s floating around and boot strapping it from no extra configuration or packages at all to just enough to talk to puppet should allow you to install all software and configurations, then restore your data. Does it take you five minutes of work or a day? If you don’t have a spare computer, rent an EC2 instance for a few pennies an hour.
Protect Against the Uncommon
If it blows up, what will you do? Notice now how scaling effectively and recovering from a fire have a really big overlap? Can you recover? How fast do you need to recover? What if your system is compromised? All the things that allow you to bring an extra machine online fast are same ones that allow you to scale up extra machines quickly. All the things that allow fast scaling of new machines allow quick replacement when something needs to go offline because of damage or intrusion. In short, include the entire scale entry here and add considerations for using different locations.
Use Secure Practices
Security is sometimes described as providing confidentiality, availability, integrity. I’ve covered a lot about availability this far, so another reference to the above for that category goes here. Also, I’m going to skip any talk about authentication in this entry, but that doesn’t mean it isn’t a very important thing. Rather, it’s a dead horse that needs its own article.
Integrity means knowing when things change. It is centered around detection and verification. Realizing that compromises to systems are never part of the design, you have must design your system to handle the unexpected. Logs should be centralized, verifiable, and time-coordinated (use NTP!). Files that don’t match the package manager’s checksum indicate failing drives or mischief. Detection tools like Tripwire can help as well if you use them well. Consider a well-written AppArmor profiles for your applications with carefully watched alerts for access violations because they indicate something that shouldn’t happen. Integrity is in many ways about having another way to check things. Think of alternate metrics and use them.
Confidentiality builds upon integrity in the sense that much of the groundwork for ensuring integrity can relate to ensuring confidentiality. When the integrity of the system controls to limit access are compromised, confidentiality can’t be ensured. Make sure you know what is changing on your system and you can detect the unexpected. Confidentiality is more of the same to protecting integrity, but in detecting reads. Consider “canary” entries in your database that are watched by an IDS and Watch for extremes of normal access as well. A user visiting account webpage on your system is normal and rarely worth alerting on. Users suddenly visiting their accounts page at five times the normal maximum rate for that time of day ought to be an alarm. Is everything you serve up under one megabyte? If you see a netflow that is larger than that, should it alarm you?
Keep Your Knowledge Fresh
Things change with time, but if you’re not talking with people around you, you might be left configuring your system in 2012 using information from 1998. When that happens, people create 128GB swap partitions on 64GB RAM machines (…that are only using 5GB of RAM anyway, probably because they didn’t understand what they tried to measure. That said, the asker shows promise for looking, asking, and having found a pattern to follow!). At the same time, some institutional memory just doesn’t make it because people don’t end up in the right circles to be exposed to it. That’s why SQL injection is still one of the most common problems for system compromise despite over a decade of background knowledge on the issue and solid practices such as binding variables. I originally started down the path of enumerating a few examples, but the OWASP Top 10 really gets the point across for issues among programmers. For sysadmins, out-of-date software is still on the top of the list.
When it come to staying current, the more face time you get outside your office, the better off you are. The more stories that are shared, the better off you and the entire community are. You get a quicker start and a better body of knowledge. You have a default for situations that you haven’t encountered before, and that’s critical for not reinventing the wheel… again.
Spend time on professional forums. Read questions, ask questions, answer questions. Go to conferences. Attend local meetups. Be a part of your field.