On storing passwords: Don’t go all Ashley Madison; use write-only access.

Let’s start with the basics: you’re supposed to hash your passwords. Basic cryptographic algorithms are a long-gone recommendation there: you should be using PBKDF2, bcrypt, or scrypt. That’s the standard that we call good, but there’s more that you should do beyond how they’re written to disk.

Now that you’ve stored them securely, make sure you don’t go Ashley Madison: don’t use passwords as a basis for anything! In fact, you should never read a password hash for any purpose other than to compare it to a hash. It should be absolutely impossible for your webservers to read the password hash. Webservers should only have access to meta information about the password hash: things like the salt and the number of rounds. The actual comparison to check if the password is valid for a user should be a function call that returns true or false. You can implement this as a stored procedure in your database engine or provide a dedicated authentication service. It’s fine if your webserver can insert or update a password, but it should not have the permission to select it.

Which brings about the next part: if you can’t read a password, you certainly shouldn’t have cause to display it, and thus there are so many things wrong here:

Finally for good measure, I give this admonition for anybody creating session tokens for a browser: create these with random values and store these in their own table so they can be cleared out server-side. It should be possible for you as an administrator to terminate some or all logged in sessions without altering anything else. Each session should have its own random cookie. Facebook, for example, provides this functionality very cleanly:

Facebook allows you to end individual login sessions

But why is my music offline? BGP tracing post-Hurricane Sandy

At the time, I couldn’t reach them from California either. That presented a bit of curiosity, so I started with a traceroute to figure out what might have happened.

$ traceroute di.fm
 traceroute to di.fm (, 30 hops max, 60 byte packets
 5 xe-10-1-0.edge1.SanJose1.Level3.net ( 5.906 ms 5.902 ms 6.568 ms
 6 vlan90.csw4.SanJose1.Level3.net ( 6.524 ms 6.278 ms vlan80.csw3.SanJose1.Level3.net ( 6.688 ms
 7 ae-71-71.ebr1.SanJose1.Level3.net ( 6.478 ms 4.454 ms ae-61-61.ebr1.SanJose1.Level3.net ( 4.625 ms
 8 ae-2-2.ebr2.NewYork1.Level3.net ( 77.596 ms 77.703 ms 77.687 ms
 9 ae-92-92.csw4.NewYork1.Level3.net ( 78.191 ms ae-62-62.csw1.NewYork1.Level3.net ( 93.865 ms 93.851 ms
 10 ae-81-81.ebr1.NewYork1.Level3.net ( 77.554 ms 77.763 ms ae-71-71.ebr1.NewYork1.Level3.net ( 77.761 ms
 11 ae-2-2.ebr1.Newark1.Level3.net ( 89.285 ms 78.158 ms 88.893 ms
 12 ae-1-51.edge3.Newark1.Level3.net ( 77.901 ms 77.647 ms 78.381 ms
 13 * * *

It looks like I can get to Level 3’s network, but something is breaking either within their system or when it hands off to another network. Often the first step is to check from a few other locations. I tend to log into my EC2 instance in Virginia when I want a view from somewhere else, but I’m confident right now the whole network connection is going wrong since I’m seeing statements on Twitter too. So, we’ll move to the next step: the Looking Glass.

Looking glasses are machines on provider networks that provide information to the public and other providers to help with diagnosing problems. BGP4.as provides a good list of looking glass systems.

I first started with the Hurricane Electric looking glass. Their system won’t let me link, but typing in and selecting “BGP Route” would give me the information I was looking for. The key point is that the address I was looking for was advertised by AS29791. Voxel.net is now part of Agile Hosting Solutions which is owned by Internap and indeed visiting Voxel.net will bring you to Internap.com.

We can learn lots about an Autonomous System through its Peering Database entry as well. Between the traceroute and the peering database entry, we can presume that their system went offline in the 165 Halsey Street “meet-me” room in Newark, NJ.

Knowing this, we can try to find other peering points that might work besides, letting us know if the network is down only for people entering via Level 3. Searching for networks Voxel peers with or other exchange points can be helpful for this. A less involved method is to cue up a looking glass site and select many different locations to ping from at once.

I tried to paste some text, but WordPress decided to use the “Format Like Hell” option and manually pasting in the CSS was… unpleasant. Regardless, from the screenshot we can see that the site is now online, but still hobbling from Fremont. Every time I run traceroute, the mci1.he.net route (Level3 Kansas City) is ok, but the fmt1.he.net (Hurricane Electric Fremont 1) link shows packet loss in the Voxel network.

As a network provider, this would be a time to possibly kill the route. As an end-user, if it’s bad enough I can try to call any of the providers in my upstream path to them as seen from my traceroute and convince them to change the peering. If I can’t get any help, I can also tunnel my connections to a machine that does have a working path. Most people shouldn’t be trying to get BGP routes updated, but network administrators should be familiar and know who to call that is on a network operations center list. See also the NANOG mailing list.

With that, I can now find a better way to stream my online music and know what part of the Internet is broken.

LinkedIn hash leak analysis

Today it was announced that LinkedIn was compromised at some point and 6.5 million unsalted SHA1 hashes were posted. LinkedIn has since confirmed that the hashes relate to accounts from them. Before the official announcement, though, I was curious.

Trying to confirm

The first question was, “Is this real?” Since I’ve had an account on LinkedIn for years and I assume this password file would be highly linked to, I guessed that it would probably be indexed by Google in minutes if it were posted in plaintext. I searched for the SHA1 value of some candidate passwords, but they didn’t get any results. That’s probably a good thing for me.

Without this lead, though, it took some searching of stories to find and download a copy of the full list. Given that, I again searched for possible passwords that could be associated with the account, and searching against the file I again found no matches.

Outside sources

This is where the human networking aspect of the profession comes into play. I heard from a few people that I respect and consider to be likely good sources of information after I linked them to the file, and while some folks like me didn’t find their password there, others did. The one that I really latched onto was an individual who found the SHA1 hash of a 30 character random password in the list.

It’s real: now what?

First off, shame on LinkedIn. They failed to take simple steps to protect leaked passwords, and for that it’s relatively easy to attempt to crack them.

Why is the list not comprehensive? Initially when I scrolled through the file, I saw an odd pattern: the characters in columns 7 and 8 were staying the same when everything changed. They were either “a8” or “a9”. It would be very strange to sort a file by those columns and to include only a portion. To check that what I was seeing at top and bottom of the file where the case, I ran it through a quick series of pipes:

cut -c 7,8 hashes | sort | uniq -c

It turned out I was wrong and distribution was rather uniform. I haven’t come up with a solid explanation as to why the file seems clustered around those offsets in the hashes, but I’m guess it relates to the fact that already compromised passwords have had the first characters of the hash replaced with a string of zeroes.

Why is it just hashes and not usernames or email addresses paired? Whoever compromised the accounts is holding onto something valuable. Because people tend to use the same usernames and passwords across several different sites and email addresses are often linked part of that, whoever compromised the accounts has information that can be sold on the black market. Releasing the hashes offloads the work of breaking them, but the value is in tying them back to an account. It would be bad to have any financial passwords tied to the LinkedIn site’s email / password combination. It would be devastating if one used that password for their email account as access to an email inbox often provides the ability to reset paswords.

The simple reminder

Change your passwords if you used the site. Change any passwords that are the same. Even if you use a simple variant of the same base password, try to have some variety between sites. Make certain that you use a unique password for your email. Consider using unique passwords for other high-value accounts. Ideally, every account would have its own unique complex passwords, but the bounds of human memory are often a challenge to that.

In summary, duplicate passwords might cost you by providing access to an important site if it uses the same password as a low-value site. With vendors adding features and once-free sites making more use of financial data including credit card information, things that used to be low-value to you may now be higher risk.


Tell me what I need to know RIGHT HERE AT THE TOP

A user asks a question in 3 parts. The title, then two paragraphs.

Title: Run compiled c program without ./

At this point I’ve already formulated an answer: “You can put “.” at the end of your path. There are risks to that, but it will solve your problem.”

The user clarifies in the first paragraph:

Quick question. After compiling a program, I always have to explicitly tell it to run from the current directory by prepending ./ to the name. For example $ ./testprog -argument1

The user has again demonstrated again that what I’m answer is correct. Perfect! At this point, I’m done thinking. The user, however, continues:

Is there a way to set up some kind of alias which would allow me to call it simply by saying $ testprog -argument1 , essentially allowing me to run the program from any location?

I’ve already registered everything and the last part doesn’t contradict what I’ve got in my head, so I’m done. Now, what are the 3 most important words in that question? They are from any location. Re-iterating the most important part of a topic in a long piece of writing is good, but introducing the most important part of your question at the end of it is very bad.

Let’s beat this horse to death:

Imagine you’re calling 911 (or 000100108111112119, or 999 depending on where you live… wow). What discussion are you going to have? “I was watering my flowers and I saw a little bug, so I kneeled down, blah, blah… part of a wall fell down, my leg is broken and squirting blood.” You’re either going to tell me something like, “I’m have an exposed broken bone and arterial bleeding from the leg. I’m at 123 Any Street, Middle of Nowhere, South Pole,” or you’re going to end up passing out before you finish the call. You’ve told me what you need, how urgent it is, and where to find you. Do that with your questions, and do it in that order.

Why does would 911 ask for the nature of the emergency before the location? It allows them to prioritize your call if there’s an overflow (or know that this is the 10th call about that same car accident in 3 minutes), to select the resources that need to be sent, and then where they need to go because there’s a lag of at least 30 seconds between when you tone out a fire truck or ambulance and the time the vehicle makes it out the door.

Here’s another life experience: “Jeff. Jeff, wake up. Jeff? Jeff, help I’m hurt.” Want me to move in 3 words? Start with help I’m hurt and I’ll be on my feet before the 5th word drops. Try anything else and it might be a minute before I’m moving.

So, with all this blabbering, here’s the “how to” lesson:

How do I determine the important part?

Write whatever you feel you want to ask. Remove a sentence. Remove another. Keep pulling out sentences until you’re down to the last one. Now put them back in reverse order of what you removed and make it sound like something normal from that. Alternately, try to build a tree structure where each level is more specific than the last. Save my soul for this one, but if it’s your thing consider the Twitter Rule of Importance: can you express your thought in 140 characters? Start with that converted to normal English.

Whichever way you look at it, reducing your thoughts to a bare minimum will help with your ordering. Provide all the details, but tell me the important part first, especially if it defines the scope of your problem.

When working with a non-professional, the professional has to spend the time to ask questions helping them to determine the importance of things. When you’re a professional speaking to another, save their time because you know enough to understand which parts should be important.

When scripting is too damn slow

Israel Torres wrote a nice write-up on Google hacking to dig up some Wolfram Alpha API tags. This isn’t about that, but I thought the article was an interesting read. What this is about is something I noticed in his API generation script: it was damnably slow.

Now, before I go into my bit about this, let me emphasize: This is used only as an example to learn from; it is not a personal criticism of Israel in any way. Why put that in big bold letters? Because I’ve run into too many people who think they are smug and that writing something like this would be a way to prove personal superiority to somebody else. That’s not what this is about.

What is important is recognizing places where tight loops might need something a little lower level or faster. Israel’s code suffered a few things that can easily bite somebody: poor randomness from the shell’s $RANDOM and starting a ton of processes in a tight loop. In fact, the example scripting starts a new process (/usr/bin/printf) for every two bytes printed.

From his article:

Generating a million AppIDs takes under an hour on a modern system and validating them takes even longer (about 6 times longer). Interestingly, out of 1 million generated AppIDs only about 100K are unique; generating a true 1M unique IDs would take even longer! (See Figure 13 below)

Well, let’s check out Figure 15 from there instead as that’s the code:

That didn’t feel right to me on sight, so I tried a quick bit of coding:

The results were much more favorable when using C, and on a 3 year old MacBook laptop at that:

$ time ./a.out > /dev/null

real 0m1.337s
user 0m1.323s
sys 0m0.005s

For another 30 seconds of CPU work, I could pipe it to “sort | uniq | wc -l” and verify that I had a million unique entries.

In a first for me, I actually wrote a code snippet in C… then went back to my default go-to language for a speed comparison. Quickly cutting it into Perl, the code runs in 10 seconds.

In short, think about when your scripts might be slow on a production system, or when you’re generating a lot of data in a tight loop. It’s often best to write whatever gets the job done fastest overall. For the most part, whatever you can write quickly in works just fine, but sometimes remember performance — especially if what you wrote is going to spawn a lot of processes.

While I’m talking about “gotchas,” give yourself a pat on the back if you noticed what would be a huge mistake in anything involving cryptography: I didn’t seed my random number generator. The C code will generate the same list every time.

Valuing your own time

This is not a post about performance, or things that are fast in machine time. Without doubt, I dislike inefficient systems and software. Yet, I don’t always focus on that.

A couple week ago, somebody asked, “What’s with all the infosec people using Macs?” There’s a longer answer of mine there, but if I’m going to reduce my thoughts to a few sentences, there they are:

  • My maintenance and setup vs. use ratio is low
  • It fits into my command-line world
  • If it breaks, I can probably get it fixed by somebody else in 24 hours.

Sure, I built Linux From Scratch those years ago like I talked about in an earlier post. I watched KDE take 24 hours to compile and I did it a few times over. It’s a great thing to do, and I suggest everybody who really wants to know Linux knows that

"make menuconfig"

is all about. But, if I interview you, you better not plan on deploying that on my network. Your primary machine should be running something else. Sure, it’s a great hobby and the knowledge is fantastic, but keeping on top of every package’s software update and security issues isn’t the right way to spend your time.

Cars and motorcycles… I like to work on each of them, but I didn’t buy any of them with the focus of working on them. In fact, I want my ratio of time spent maintaining the vehicle versus using it to be as low as possible. I don’t do an oil change just because I like to change oil. I take pride in the fact that I do my own work, but I don’t do my own for the pride of it. So, in the world of motorcycles where I have to open the top of the engine and adjust my valve clearances to thousands of an inch… I look at how often I have to do that as a purchasing consideration.

My computer is the same way. I didn’t buy my primary computer to spend my time tinkering on it. Sure, I tinker with it and I launch VMs inside and build virtual networks… but I do that when I want to accomplish something, even if that’s learning.

The point is: value your time. In fact, up until now, every post I’ve made has related to this concept: value your time. Find a way to keep yourself focused on the things you can’t automate. Move closer to the office. Remember that it costs more than $3 to get something that’s across town and the shipping price might be worth it.

  • I once wrote my own blogging software in 2003 using PHP. Now I use WordPress.
  • When I ordered my latest desktop (I haven’t had one hooked to a monitor in over five years), it was cheaper for the same parts through Dell than was a beige box that I was going to assemble. Hello Dell.
  • I feel like a total badass when I write something in C. Usually I’m in Perl, though because it takes about 15 lines of C to do one line of Perl and I don’t have to think about buffers.

Tales of Shmoocon 2012: Joining Labs and Building the Network

I have been going to DEFCON for three years now with a group of fellow hackers from my town. Some of them went to Shmoocon last year and reported that I should definitely come down for 2012. So, with all possible insanity in mind, four of us jumped in a car and drove 10 hours through the night to the con.

Because one of my group had signed up for this thing called “Labs,” we arrived at 6am and a day early. I found out the details that morning about how labs was a pre-registered thing that I hadn’t heard about in time and then I napped until just after 9.

Now, I’ve been to DC many times. Given the choice between wandering the city again and doing something exciting surrounding the con, I went for the con. I had made it a habit to get into guest-listed parties I didn’t know about until too late while at DEFCON and have made it a personal point of pride to keep up the habit every year. So, in a move that I’m sure you won’t be able to do next year after this is published, I strolled into the labs area at 9:30, sat down at the management network / core table, and became part of the team… which happened to include my friend and another local who I didn’t expect. Maine was very well represented.

Labs Begins

Labs is split up into several teams and I’m sure there was some consensus as to how this should be apportioned before I arrived. Each team had their own table around the edge of the room with the routing, switching, and firewall (“network”) group at a table in the middle.

At the start of the day, my new group talked about the services we wanted to roll out. We talked about monitoring, so I took a stake in deploying Nagios. We went around the table and called off IP addresses in order to assign to our machines and started hacking away.

Immediately I hit the same snag as everybody else. We didn’t have internet on our little network. We couldn’t look anything up, and even more critical we couldn’t install any software packages that weren’t on a disc already. Everything had to wait until we could get online, so I grabbed a long cable, some gaffer’s tape, and walked to the network group to ask for a VLAN to the Internet. I gave it a quick test before we plugged into the switch and my laptop suddenly pulling 100mbit to the whole world. We quickly throttled that down to the level we had paid for and with that I plugged in the the management network. It wasn’t the final solution — I knew we had to later change to a real switch and establish proper presence on the network, but it was the right-now solution that got us to work.

Building it Proper

Having spent some time preaching the ideas behind Puppet and loathing the Nagios config files, I made the decision to do all my Nagios setup with Puppet. The idea paid off very well. Machines were setup as Puppet clients and then I basically never touched them. From modifying the master configuration, a basic configuration was typed out that pushed my favorite utility packages (ntp, vim, etc) and registered the system with the Nagios server. I ate my own dog food and I loved it.

Where we orientally started tempting all our services on one server, we would deploy them to several respective machines by the end of the day. With a VM server established after lunch, machines were appropriated for different tasks. As our needs for machines changed, we found it easiest to just keep the same original starting image and use Puppet to push to them. First up was the LDAP server coming online. I dumped the configuration and certificate files from the LDAP system into Puppet and spent a few iterations testing it out, then moved it from the test machine’s config to the general template and it was live. When it caused problems on a machine that was cutoff for a while by firewall rules, I was able to carve out an if/else statement in the configuration template and exempt the system. When the internal DNS server came up, I pushed the new resolv.conf to everything.

Things broke too. They broke a lot, and sometimes we knew they’d break even before they did. It wasn’t a surprise when our team lost internet for a while when we moved to our “real” switch with the proper trunk configs. Things broke when Splunk chewed up the resources of the VM it was on and had to be moved somewhere with more power. They broke when the firewall started destroying every SSL connection after a few packets… and I mean every one on every port across every VLAN and the Internet, and only SSL — and only after negotiation. We scratched our heads when one VLAN on our DHCP server went stupid and couldn’t hear packets that were sent to it, and then we scratched our heads more when changing that machine’s routing table ended up working as a ghost fix for the problem. That was a real stumper because it was affecting inbound packets, but it was the answer that made it happen.

Almost Perfect

On the whole, things were fantastic. We broke off our own bits of the work and started in. Everybody picked off a section of work, everybody worked together helping with other people’s problems, and everybody lived the spirit of things. We could have done a better job of knocking out dependencies like the early internet connection, or getting everybody using a ticketing system right from the start to handle requests between different pieces of software. I’m surprised the firewall crew didn’t want blood by the end of the day with the number of adjustments they had to make between all the different teams and problems they encountered. We did everything verbally with them.

Yet, I think Labs shouldn’t be “quite there” any year for good reason. Labs is about experimenting and pushing the edge. Sure, we may tackle those tidbits next year, but it was about more than just a network. It was seeing 50 people in a room with only the basic managerial structure of, “You’re in this group,” do something big in a day. It was about experimenting. Our central logging was good practice, but it was also learning for at least six different folks that we pushed our logs out to for analysis and learning. It was about cracking jokes over an AirTunes box that was accidentally broadcasting the management network over open Wifi and being impressed at the wireless folks who were running the tools to detect those things.

I’ve done consulting for a while and I’ve seen a lot of networks over the years. In a days time, we shined and we built it right. I heard stories, I shared stories, I saw some impressive setups from garage leftovers and vendor diamonds, and I hacked until I was too tired to keep going.

… But Perfect!

I’ve built my own networks to play on, I’ve worked in lab setups that have their own dedicated 100 rack server rooms, I’ve built a lab network with over 125 machines, but I’ve never had as much fun or been as part of a smooth team as these 50 folks who work together only for a weekend. So, if they don’t give me grief for gaming the system to get in this year, I’ll be applying and attending next time.

Tenets of Good Admins

There is no shortage of folks running around the world trying to create the next Facebook. They crop up regularly on Serverfault asking what it takes to handle 10,000 connections per second for their next big idea website. I don’t have any good examples to reference because they’re bad questions; bad questions are closed and deleted. What those questions would show is a common problem in the world of system administration: the basic lesson unlearned.

System administration, programming, or any highly skilled task has many aspects in common with driving a car (who doesn’t love car analogies?).  It is the case that being able to get through your driver’s education course does not make you capable of power-sliding in the snow, doing well in a Top Gear Power Lap, or even doing a decent job of parallel parking. Ideas you have about how a car handles may even be flat-out wrong. That’s the effect of being new, or being old-hat but never experiencing the right lessons.

So, after many years of working with systems and a few of toying with questions on StackExchange websites, I offer some proposals for the tenets of good system administration:

Always Measure

There are an ample supply of questions on Serverfault consisting of, “Is this faster or why is that slower, etc.” The answer to those questions in most cases is, “we don’t know you hardware, measure it and find out.” A computer consists of multiple finite resources and in most cases bottlenecks on one of those resources. Most of the time that resource limitation is available memory, processor speed, or disk speed. There are others besides the big three, but the answer is always to measure, locate, and alleviate in a cost-effective way.

Thus, when it comes to adding services don’t tell me SSL is too slow, but rather tell me what it limits you to for speed. A good example I came across this past week was figuring out the cost of / most cost effective way to provide SSL on EC2. I was impressed that the asker had done a lot of homework before and took a look at comparable benchmarks to determine that things “didn’t feel right.” I find many questions where this task isn’t completed. So, whenever you start, consider if $whatever is worth the cost, and do it with a “de minimis” in mind — don’t calculate the small stuff, but if a few thousand in hardware or time is on the line, think it out. Sometimes cost isn’t the only factor, but be aware of it.

Finally, do the default to start. What works 90% of the time for 90% of of folks will probably work for you, or at least be a good starting point to compare against. Optimizing for a target before you’ve been able to measure it is, thanks to the laws of thermodynamics, comparable in efficiency to lighting money on fire. Both will generate heat, but I wouldn’t count on much more besides some ash and a waste of resources.

Build Scalable Designs

What you do on one server, you can do on 100. The tools for this today are better than those of the past. I previously wrote about the benefits of using Puppet even when you have just one server to work on and certainly it scales beyond that very well. Separate your system into programs, configurations, and data. If it’s an installed piece of software, it should be installed by a package manager even if it is a custom build (learn RPM or DPKG). If you’re altering a configuring file in /etc, use some sort of configuration management tool (I advocate Puppet for deployment and git for configuration management). Data would be anything that isn’t reproducible or is expensive to recreate, and that’s basically everything left over. Databases, uploaded files, user home directories… these should all be backed up and tested.

Speaking of testing, do it regularly.  If you expect something to gracefully hot-failover in production, you better test it. If you want to restore a server, the same applies. Grab a spare computer that’s floating around and boot strapping it from no extra configuration or packages at all to just enough to talk to puppet should allow you to install all software and configurations, then restore your data. Does it take you five minutes of work or a day? If you don’t have a spare computer, rent an EC2 instance for a few pennies an hour.

Protect Against the Uncommon

If it blows up, what will you do? Notice now how scaling effectively and recovering from a fire have a really big overlap? Can you recover? How fast do you need to recover? What if your system is compromised? All the things that allow you to bring an extra machine online fast are same ones that allow you to scale up extra machines quickly. All the things that allow fast scaling of new machines allow quick replacement when something needs to go offline because of damage or intrusion. In short, include the entire scale entry here and add considerations for using different locations.

Use Secure Practices

Security is sometimes described as providing confidentiality, availability, integrity. I’ve covered a lot about availability this far, so another reference to the above for that category goes here. Also, I’m going to skip any talk about authentication in this entry, but that doesn’t mean it isn’t a very important thing. Rather, it’s a dead horse that needs its own article.

Integrity means knowing when things change. It is centered around detection and verification. Realizing that compromises to systems are never part of the design, you have must design your system to handle the unexpected. Logs should be centralized, verifiable, and time-coordinated (use NTP!).  Files that don’t match the package manager’s checksum indicate failing drives or mischief.  Detection tools like Tripwire can help as well if you use them well. Consider a well-written AppArmor profiles for your applications with carefully watched alerts for access violations because they indicate something that shouldn’t happen. Integrity is in many ways about having another way to check things. Think of alternate metrics and use them.

Confidentiality builds upon integrity in the sense that much of the groundwork for ensuring integrity can relate to ensuring confidentiality. When the integrity of the system controls to limit access are compromised, confidentiality can’t be ensured. Make sure you know what is changing on your system and you can detect the unexpected. Confidentiality is more of the same to protecting integrity, but in detecting reads. Consider “canary” entries in your database that are watched by an IDS and Watch for extremes of normal access as well. A user visiting account webpage on your system is normal and rarely worth alerting on. Users suddenly visiting their accounts page at five times the normal maximum rate for that time of day ought to be an alarm. Is everything you serve up under one megabyte? If you see a netflow that is larger than that, should it alarm you?

Keep Your Knowledge Fresh

Things change with time, but if you’re not talking with people around you, you might be left configuring your system in 2012 using information from 1998. When that happens, people create 128GB swap partitions on 64GB RAM machines (…that are only using 5GB of RAM anyway, probably because they didn’t understand what they tried to measure. That said, the asker shows promise for looking, asking, and having found a pattern to follow!). At the same time, some institutional memory just doesn’t make it because people don’t end up in the right circles to be exposed to it. That’s why SQL injection is still one of the most common problems for system compromise despite over a decade of background knowledge on the issue and solid practices such as binding variables. I originally started down the path of enumerating a few examples, but the OWASP Top 10 really gets the point across for issues among programmers. For sysadmins, out-of-date software is still on the top of the list.

When it come to staying current, the more face time you get outside your office, the better off you are. The more stories that are shared, the better off you and the entire community are. You get a quicker start and a better body of knowledge. You have a default for situations that you haven’t encountered before, and that’s critical for not reinventing the wheel… again.

Spend time on professional forums. Read questions, ask questions, answer questions. Go to conferences. Attend local meetups. Be a part of your field.

Never Work on a Machine, Apply a Configuration

If you’ve worked with Linux as a systems administrator, or even for your own services, you’ve almost certainly fiddled with a few config files on whatever machines you’re in charge of. You’ve also probably configured the same thing many times over on every new system you get. On every machine I’ve ever owned, I’ve made a user account for myself. They also all have ntp and ssh running. Every machine I work with has good reason to have my public ssh key, so I have to copy that. If I keep going for a while, I’ve suddenly got a list of things to take care of.

When I first got started playing with Linux over a decade ago, I evolved through a few systems until I started playing with Linux From Scratch as a distribution sometime in late 2000 or early 2001 (I’m registered user #23… and 553 because I forgot I registered the first time). After I built my box a few times over, including with KDE and kernel compiles that I remember taking 24 hours, I started getting really excited about the idea of packing my Linux From Scratch work. Next time I wanted to get all the new versions, why would I want to go back and try to remember all the flags I set? I never had more than one system to deal with at a time, but I knew I’d want to simplify the work because it would come up again.

Since then I’ve mostly lived in a Debian and Ubuntu world because I don’t want to go back and figure out those dependancies, wait 24 hours for KDE to build, or focus on all the software options. Yet even when I’ve whittled myself down (at times) to just one server, the package management system still leaves me needing to do mostly repetitive configuration work. Further, if I want to be good, I’ll keep my iptables configuration up to date. Scripting that is a nuisance, and it’s unfortunate when things are inconsistent.

Working on putting all my configurations into Puppet is quite a bit of work. The answer to that is to just be incremental. Wikimedia, like Rome, didn’t build their intense implementation in a day. The suggestion I’ve got is to just do it with the next change you make. Install Puppet on that once machine and use it to deploy that one configuration file. Drop it in the templates directory and don’t sweat the idea of variables. It won’t cost you more than a few minutes of extra time. Don’t worry about making everything, or anything, a variable until you need it.

class base {
 package { ntp:
 ensure => latest

service { $operatingsystem ? {
  "Debian" => "ntp",
  default => "ntpd",
 ensure => running

I’ll never again install ntp from the command line, or worry about whether it actually runs. If I create a config file, I can repoint all my servers to different time masters at once. If I don’t want to worry about that, I can just leave it at the defaults that the package uses.

It’s going to pay off the next time you upgrade or replace that server and you don’t have to select all the different packages that need to be there. You won’t be fiddling with extra options or uninstalling excess software because the bare minimum and a puppet client will put all those packages you need in place and drop in your configs. Next time I want to test something, I can make a dev server on EC2 in minutes.

You get security because you get consistency. That tripwire, apparmor, or iptables setting you learned years ago and didn’t carry on to your next system because it  takes a long time to get right can now be copied everywhere without thought. You get time back without an upfront cost as you slowly roll every change in, even with just a few machines. You become a better systems admin, even if you’re just working on your one server.

At some point, you’ll find that putting a new machine in place with security, backups, user accounts, monitoring, dns, and the software you need at the start will be nothing more than a case of

node "newhost" {
 include base

Now, next time you upgrade that PHP script that’s in a .tar.gz for the 14th time this year and do the WordPress dance of “keep this directory, but copy all new files”, ask yourself: might it be worth it if I write a packaging definition file and just run that against the .tar.gz file next time?

8 Hours in an Emergency Room — Thoughts of Queueing

Yesterday morning I awoke to my girlfriend calling my name from three rooms away – “Jeff, Jeff… wake up Jeff. Jeff, I’m hurt!” Skip a lot of fast-moving details and we find ourselves in a room at the ER with the kind of real problem that gets you through triage in minutes… and in this case leaves you waiting unnecessary hours for your discharge. That we spent so much time in the room was a problem for the entire department as they were filling up the hallways with patients.

Emergency rooms known to those outside the medical profession for wait times. This is connected to the fact that you’re servicing the person who needs stitches or is running a high fever with the same resources that you’re servicing a person who came in from a helicopter on the roof with their arm falling off. That’s not enough of an explanation, however when the average wait just to be seen is 3 hours, 42 minutes.

Resource utilization and compartmentalized responsibility may be the biggest factors to address. In any system with a single lead, you have a clear setup. When multiple people are responsible for a process at multiple stages and with many who have similar authority involved, the system becomes too complex for the parts to work together on their own. If somebody doesn’t have an overview of the process, it can feel like trying to drive down a road with timed stoplights in the middle of the night. Failing to coordinate such a system can result in huge backups when there is traffic even though the number of vehicles per hour doesn’t change.

In a bustling emergency room that has multiple units, it may be the case that nobody is following the patient’s status throughout delivery of treatment to the same effect that as when nobody is coordinating stoplights. Triage bins patients according to priority and ensures that urgent cases are attended to first. Once they leave the waiting area, triage is uninvolved. Nurses prep patients, doctors attend to them, nurses carry out medication orders, and techs handle the more trivial issues such as monitoring blood pressure. Eventually the doctor will sign the patient over to another department or discharge them, but their attention is focused on medical issues, not on flow.

Consider the flow of customers and orders within restaurant. The host is aware of capacity and utilization; hosts seat people according to reservations and group size (triage).  Wait staff tend to guests determining their food orders (nurses). Some staff roam the restaurant filling up water glasses as needed (techs). Chefs prepare main dishes (doctors), and in a bustling operation many of them may work on a single dish (specialists). The trick is that all your food must come out quickly and at the same time. Wait staff can’t coordinate that because they’re tending to tables. Chefs need to focus on food itself (diagnosis / patients / complex interactions). The missing link is the expeditor. They manage all the kitchen queues and fire orders in a way that ensures everything is addressed at the proper time.

Hospitals generally lack a position that’s analogous to this. A non-medical individual who never interacts with patients and is in charge of movement probably boggles a few minds in the field. I was a bit boggled myself when I did some searching and realized that first good example of the idea forming in my head was implemented in an ER by a facilities cleaning company.

Queuing problems are not unique to hospitals in any way, and some great solutions for them come from outside. Besides the restaurant view, there are many other instances where an individual with overview of the holistic situation is a benefit. Being able to view an entire supply chain of parts can result in great increases in efficiency and improve reliability. Safety officers on fire grounds and hazmat scenes are an example where everybody is focused on safety and yet an individual not involved in fighting fire provides an overview to keep everybody safe because the interaction of various companies at once is complex. Police departments in growing numbers have civilians who determine resource allocation based on information from many officers and departments. Even the railroad I volunteer at benefits greatly from an outsider asking why all the rail isn’t pulled up at once and the idle backhoe used to grade everything instead of one section at a time.

Queuing is also a matter of consideration with a patient just sitting in a room. Whenever we needed attention, there was one button to push. Whether it was asking a trivial question that could wait 10 minutes or a spurting vein, there was only one button to push. While we always want that that big red button to be easily accessible for anybody, a patient who is capable of pressing smaller 5 or 10 minute request buttons, or even typing up their requests will appreciate not feeling demanding and will benefit the staff by allowing more efficient servicing of their requests.  A patient might alert us for medication when pain starts to increase rather than waiting for somebody to incidentally stop in or alerting only when it becomes seriously uncomfortable and requires immediate attention.

Having an overall view is needed to efficiently coordinate any complex systems with multiple pieces that all work together, whether it’s the traffic lights on the evening commute, different military units storming Normandy, or the person in your ER who reserves a CT scan ahead of time so that the contrast agent can be consumed just before an ultrasound and leave the patient absorbing their dye during an already needed test instead of waiting in a bed between tests.