Monthly Archives: February 2012

Tales of Shmoocon 2012: Joining Labs and Building the Network

I have been going to DEFCON for three years now with a group of fellow hackers from my town. Some of them went to Shmoocon last year and reported that I should definitely come down for 2012. So, with all possible insanity in mind, four of us jumped in a car and drove 10 hours through the night to the con.

Because one of my group had signed up for this thing called “Labs,” we arrived at 6am and a day early. I found out the details that morning about how labs was a pre-registered thing that I hadn’t heard about in time and then I napped until just after 9.

Now, I’ve been to DC many times. Given the choice between wandering the city again and doing something exciting surrounding the con, I went for the con. I had made it a habit to get into guest-listed parties I didn’t know about until too late while at DEFCON and have made it a personal point of pride to keep up the habit every year. So, in a move that I’m sure you won’t be able to do next year after this is published, I strolled into the labs area at 9:30, sat down at the management network / core table, and became part of the team… which happened to include my friend and another local who I didn’t expect. Maine was very well represented.

Labs Begins

Labs is split up into several teams and I’m sure there was some consensus as to how this should be apportioned before I arrived. Each team had their own table around the edge of the room with the routing, switching, and firewall (“network”) group at a table in the middle.

At the start of the day, my new group talked about the services we wanted to roll out. We talked about monitoring, so I took a stake in deploying Nagios. We went around the table and called off IP addresses in order to assign to our machines and started hacking away.

Immediately I hit the same snag as everybody else. We didn’t have internet on our little network. We couldn’t look anything up, and even more critical we couldn’t install any software packages that weren’t on a disc already. Everything had to wait until we could get online, so I grabbed a long cable, some gaffer’s tape, and walked to the network group to ask for a VLAN to the Internet. I gave it a quick test before we plugged into the switch and my laptop suddenly pulling 100mbit to the whole world. We quickly throttled that down to the level we had paid for and with that I plugged in the the management network. It wasn’t the final solution — I knew we had to later change to a real switch and establish proper presence on the network, but it was the right-now solution that got us to work.

Building it Proper

Having spent some time preaching the ideas behind Puppet and loathing the Nagios config files, I made the decision to do all my Nagios setup with Puppet. The idea paid off very well. Machines were setup as Puppet clients and then I basically never touched them. From modifying the master configuration, a basic configuration was typed out that pushed my favorite utility packages (ntp, vim, etc) and registered the system with the Nagios server. I ate my own dog food and I loved it.

Where we orientally started tempting all our services on one server, we would deploy them to several respective machines by the end of the day. With a VM server established after lunch, machines were appropriated for different tasks. As our needs for machines changed, we found it easiest to just keep the same original starting image and use Puppet to push to them. First up was the LDAP server coming online. I dumped the configuration and certificate files from the LDAP system into Puppet and spent a few iterations testing it out, then moved it from the test machine’s config to the general template and it was live. When it caused problems on a machine that was cutoff for a while by firewall rules, I was able to carve out an if/else statement in the configuration template and exempt the system. When the internal DNS server came up, I pushed the new resolv.conf to everything.

Things broke too. They broke a lot, and sometimes we knew they’d break even before they did. It wasn’t a surprise when our team lost internet for a while when we moved to our “real” switch with the proper trunk configs. Things broke when Splunk chewed up the resources of the VM it was on and had to be moved somewhere with more power. They broke when the firewall started destroying every SSL connection after a few packets… and I mean every one on every port across every VLAN and the Internet, and only SSL — and only after negotiation. We scratched our heads when one VLAN on our DHCP server went stupid and couldn’t hear packets that were sent to it, and then we scratched our heads more when changing that machine’s routing table ended up working as a ghost fix for the problem. That was a real stumper because it was affecting inbound packets, but it was the answer that made it happen.

Almost Perfect

On the whole, things were fantastic. We broke off our own bits of the work and started in. Everybody picked off a section of work, everybody worked together helping with other people’s problems, and everybody lived the spirit of things. We could have done a better job of knocking out dependencies like the early internet connection, or getting everybody using a ticketing system right from the start to handle requests between different pieces of software. I’m surprised the firewall crew didn’t want blood by the end of the day with the number of adjustments they had to make between all the different teams and problems they encountered. We did everything verbally with them.

Yet, I think Labs shouldn’t be “quite there” any year for good reason. Labs is about experimenting and pushing the edge. Sure, we may tackle those tidbits next year, but it was about more than just a network. It was seeing 50 people in a room with only the basic managerial structure of, “You’re in this group,” do something big in a day. It was about experimenting. Our central logging was good practice, but it was also learning for at least six different folks that we pushed our logs out to for analysis and learning. It was about cracking jokes over an AirTunes box that was accidentally broadcasting the management network over open Wifi and being impressed at the wireless folks who were running the tools to detect those things.

I’ve done consulting for a while and I’ve seen a lot of networks over the years. In a days time, we shined and we built it right. I heard stories, I shared stories, I saw some impressive setups from garage leftovers and vendor diamonds, and I hacked until I was too tired to keep going.

… But Perfect!

I’ve built my own networks to play on, I’ve worked in lab setups that have their own dedicated 100 rack server rooms, I’ve built a lab network with over 125 machines, but I’ve never had as much fun or been as part of a smooth team as these 50 folks who work together only for a weekend. So, if they don’t give me grief for gaming the system to get in this year, I’ll be applying and attending next time.

Tenets of Good Admins

There is no shortage of folks running around the world trying to create the next Facebook. They crop up regularly on Serverfault asking what it takes to handle 10,000 connections per second for their next big idea website. I don’t have any good examples to reference because they’re bad questions; bad questions are closed and deleted. What those questions would show is a common problem in the world of system administration: the basic lesson unlearned.

System administration, programming, or any highly skilled task has many aspects in common with driving a car (who doesn’t love car analogies?).  It is the case that being able to get through your driver’s education course does not make you capable of power-sliding in the snow, doing well in a Top Gear Power Lap, or even doing a decent job of parallel parking. Ideas you have about how a car handles may even be flat-out wrong. That’s the effect of being new, or being old-hat but never experiencing the right lessons.

So, after many years of working with systems and a few of toying with questions on StackExchange websites, I offer some proposals for the tenets of good system administration:

Always Measure

There are an ample supply of questions on Serverfault consisting of, “Is this faster or why is that slower, etc.” The answer to those questions in most cases is, “we don’t know you hardware, measure it and find out.” A computer consists of multiple finite resources and in most cases bottlenecks on one of those resources. Most of the time that resource limitation is available memory, processor speed, or disk speed. There are others besides the big three, but the answer is always to measure, locate, and alleviate in a cost-effective way.

Thus, when it comes to adding services don’t tell me SSL is too slow, but rather tell me what it limits you to for speed. A good example I came across this past week was figuring out the cost of / most cost effective way to provide SSL on EC2. I was impressed that the asker had done a lot of homework before and took a look at comparable benchmarks to determine that things “didn’t feel right.” I find many questions where this task isn’t completed. So, whenever you start, consider if $whatever is worth the cost, and do it with a “de minimis” in mind — don’t calculate the small stuff, but if a few thousand in hardware or time is on the line, think it out. Sometimes cost isn’t the only factor, but be aware of it.

Finally, do the default to start. What works 90% of the time for 90% of of folks will probably work for you, or at least be a good starting point to compare against. Optimizing for a target before you’ve been able to measure it is, thanks to the laws of thermodynamics, comparable in efficiency to lighting money on fire. Both will generate heat, but I wouldn’t count on much more besides some ash and a waste of resources.

Build Scalable Designs

What you do on one server, you can do on 100. The tools for this today are better than those of the past. I previously wrote about the benefits of using Puppet even when you have just one server to work on and certainly it scales beyond that very well. Separate your system into programs, configurations, and data. If it’s an installed piece of software, it should be installed by a package manager even if it is a custom build (learn RPM or DPKG). If you’re altering a configuring file in /etc, use some sort of configuration management tool (I advocate Puppet for deployment and git for configuration management). Data would be anything that isn’t reproducible or is expensive to recreate, and that’s basically everything left over. Databases, uploaded files, user home directories… these should all be backed up and tested.

Speaking of testing, do it regularly.  If you expect something to gracefully hot-failover in production, you better test it. If you want to restore a server, the same applies. Grab a spare computer that’s floating around and boot strapping it from no extra configuration or packages at all to just enough to talk to puppet should allow you to install all software and configurations, then restore your data. Does it take you five minutes of work or a day? If you don’t have a spare computer, rent an EC2 instance for a few pennies an hour.

Protect Against the Uncommon

If it blows up, what will you do? Notice now how scaling effectively and recovering from a fire have a really big overlap? Can you recover? How fast do you need to recover? What if your system is compromised? All the things that allow you to bring an extra machine online fast are same ones that allow you to scale up extra machines quickly. All the things that allow fast scaling of new machines allow quick replacement when something needs to go offline because of damage or intrusion. In short, include the entire scale entry here and add considerations for using different locations.

Use Secure Practices

Security is sometimes described as providing confidentiality, availability, integrity. I’ve covered a lot about availability this far, so another reference to the above for that category goes here. Also, I’m going to skip any talk about authentication in this entry, but that doesn’t mean it isn’t a very important thing. Rather, it’s a dead horse that needs its own article.

Integrity means knowing when things change. It is centered around detection and verification. Realizing that compromises to systems are never part of the design, you have must design your system to handle the unexpected. Logs should be centralized, verifiable, and time-coordinated (use NTP!).  Files that don’t match the package manager’s checksum indicate failing drives or mischief.  Detection tools like Tripwire can help as well if you use them well. Consider a well-written AppArmor profiles for your applications with carefully watched alerts for access violations because they indicate something that shouldn’t happen. Integrity is in many ways about having another way to check things. Think of alternate metrics and use them.

Confidentiality builds upon integrity in the sense that much of the groundwork for ensuring integrity can relate to ensuring confidentiality. When the integrity of the system controls to limit access are compromised, confidentiality can’t be ensured. Make sure you know what is changing on your system and you can detect the unexpected. Confidentiality is more of the same to protecting integrity, but in detecting reads. Consider “canary” entries in your database that are watched by an IDS and Watch for extremes of normal access as well. A user visiting account webpage on your system is normal and rarely worth alerting on. Users suddenly visiting their accounts page at five times the normal maximum rate for that time of day ought to be an alarm. Is everything you serve up under one megabyte? If you see a netflow that is larger than that, should it alarm you?

Keep Your Knowledge Fresh

Things change with time, but if you’re not talking with people around you, you might be left configuring your system in 2012 using information from 1998. When that happens, people create 128GB swap partitions on 64GB RAM machines (…that are only using 5GB of RAM anyway, probably because they didn’t understand what they tried to measure. That said, the asker shows promise for looking, asking, and having found a pattern to follow!). At the same time, some institutional memory just doesn’t make it because people don’t end up in the right circles to be exposed to it. That’s why SQL injection is still one of the most common problems for system compromise despite over a decade of background knowledge on the issue and solid practices such as binding variables. I originally started down the path of enumerating a few examples, but the OWASP Top 10 really gets the point across for issues among programmers. For sysadmins, out-of-date software is still on the top of the list.

When it come to staying current, the more face time you get outside your office, the better off you are. The more stories that are shared, the better off you and the entire community are. You get a quicker start and a better body of knowledge. You have a default for situations that you haven’t encountered before, and that’s critical for not reinventing the wheel… again.

Spend time on professional forums. Read questions, ask questions, answer questions. Go to conferences. Attend local meetups. Be a part of your field.