May 30, 2017
It’s like having every company being part of an insane Red Queen’s race:


“Well, in our country,” said Alice, still panting a little, “you’d generally get to somewhere else—if you run very fast for a long time, as we’ve been doing.”
“A slow sort of country!” said the Queen. “Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!”
In our previous post, we presented one of the reasons for dissatisfaction of most IT managers: the enormous cost and effort that is necessary just to “keep the lights on”, that most surveys estimate at around 75%; which means that three quarter of IT costs are non-productive.
The first step in going out of this race is recognizing that there are several distinct activities that constitute this “keeping the lights on”. They are:
- Business projects: what your CEO asks you to have, because they are directly tied to the company performance. If it’s necessary to make revenues, it’s a business project.
- Internal projects: what you need to have to provide business projects. Installing new network devices, decommissioning a data centre or moving into a new one, etc.
- Operational change: Patching, security upgrades, vendor software updates, problem resolutions, managing users, anything that changes something of business or internal projects.
- Unplanned work: also called recovery work, firefighting, whatever takes you away from meeting your goals.
Most of these activities are easy to automate, and software vendors have done a great work of simplifying things. What has been missing is how to handle unplanned work – when a software update kills your server, when a ransomware attack has encrypted all your files, or worse.
The secret is in neutering unplanned work – create the necessary conditions for being able to recover to a working condition faster, and if possible without user intervention. This has been explored in the past, and in particular with a research project called “Recovery Oriented Computing”, from Stanford University and the University of California, Berkeley; the main point was to accept that disruptions were inevitable, and that it’s much more important to reduce MTTR (Mean Time To Recovery) compared to maximizing MTBF (Mean Time Between Failures). So, instead of adding complexity to prevent disk failures to appear (as in RAID systems) it is much easier to accept that some disks will fail, and take into account that while creating your system.
NodeWeaver is designed on that foundation, to handle in an automated way most issues that would require a human intervention: disk degradation, disk failures, network interruptions or packet loss/corruption, physical node failures, by embedding a complex runbook within each node – so that without the need of central coordination, the operator needs only to check that enough resources are left available for the system to operate properly. The same capabilities are open to the user as well – the administrator can create its own scripts that take information from the internal configuration database and act on the system to change its status. If you reduce unplanned work to a background noise you don’t have to act on anymore, all your productive time become free to focus on things that matters.