Thursday, February 13, 2014

People, the secret to uptime.

To achieve high uptimes we sysadmins have to have collaborate. High uptimes are difficult if not impossible unless there is more than one person working on it.

To explain this lets first think about building a simplified highly reliable web site. We employ RAID of some sort in the disks. We have multiple physical servers. They may be in a cluster behind a cluster of load balancers. There will multiple links to the Internet via multiple providers. We will have primary and redundant DNS servers. There will multiple power supplies, UPS, and generators. I could go on but the pattern is obvious. The point is we aggressively engineer around all the single points of failure (SPOF) to maintain uptime. We assume failure will happen and build our systems to see failure as “normal” rather than surprising. High Availability (HA) is just this.

What about our people? They are just as crucial part of the system, caring for it, patching, monitoring, etc. Just like servers people are fallible. We get sick, need to sleep, and, as a favorite professor put it, get hit by the beer truck on occasion. To achieve reliability we need to eliminate SPOF in people as well. So long the hero, hello the team.

Sure, we want to be heros. As our vision and ambitions grow we become limited by the reach of our own arms. With others we can tackle larger problems, do it better, faster, and more consistently. If I am asleep I know my fellow team member is watching my systems. If my team member goes on vacation or is sick I don't fear looking after their systems.

How do we become team players? Study people! Learn psychology, speech communication, improve your writing, and so on. These classes should be taught in college but seldom are but that does not make them any less important. By understanding what motivates people (your self included) how to communicate clearly, and how to negotiate in collaborative both the quality and quanity of work improves.

With this knowledge you will also need to set aside and not insignificant amount of your day working on the people be it in your team, your manager, or others with-in and and outside your organization who make your work possible. Just as important as coming up with a good system design is spending time memorizing your peers children's names, being supportive when they screw up, buying coffee for the team on occasion, interrupting your work to lend a helping hand, teaching what you do, documenting what you know, and so on. Of course this needs to be kept in balance with your needs. However without giving yourself over to spending significant time (25% +) supporting others how can you expect the same in return. If you don't set up your Jr. team member with good documentation to read at 3AM when they get woken up they will be left with nothing but calling you when an alert comes up.

Almost all large technical achievements are built by foundation of people working together. What are doing to make you foundation strong?