This is a guest post by Stephen Nelson-Smith (@lordcope)
Puppet, Chef, Fabric, Capistrano, Ruby, Python, Github, EC2, Heroku – the list goes on. These are just some of the disruptive forces that have revolutionised what it means to manage systems in the twenty first century. One of the most interesting and important trends to emerge from this paradigm shift is the desire and willingness to treat Infrastructure as Code. Although not synonymous with Devops, its methods and objectives are both supportive and complementary. I think it’s fair to say that if you’re doing Devops today, you’re probably treating your infrastructure as code.
Jesse Robins describes the goal brilliantly:
“Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare metal resources”
In practice this means rethinking the way we design, build and manage our systems. We need to break infrastructure down into modular services, and tie them together in an automatable way. In order to do this at scale, and in a cloudy world, we’ve found it’s necessary to move beyond shell scripts and ssh for loops. This is where tools like Puppet and Chef come in – they enable and encourage us to abstract the problem, and define it in a high level language – to write code that captures the essence what we need to achieve. Each of these services will be made up of snippets of code, with varying degrees of complexity. In a large and involved infrastructure, the size of the endeavour is not insignificant. The move to Infrastructure as Code means, quite simply, that the task of building and maintaining a modern server environment begins to look a lot like the task of managing a software project. Of course this is entirely in keeping with the Devops way – as we break down barriers between developers and sysadmins, and start collaborating and working out how to solve problems together, it seems natural that both the methods and the output start to look less like traditional operations tasks, and begin to take on some of the flavour of the programming world.
This is all brilliant and exciting and avant garde and fashionable and hip… but all is not well: a storm is brewing. With great power comes great responsibility, and as I look around, I’m seeing a lot of problems emerging in organisations starting to put these kind of ideas into practice.
Here are a few symptoms:
- Sprawling masses of Puppet code, duplication, contradiction and a lack of clear understanding of what it all does.
- A fear of change – a sense that we dare not meddle with the manifests or recipes, because we’re not entirely certain how the system will behave.
- Bespoke software that started off well-engineered, and thoroughly tested, littered with TODOs, and FIXME and quick hacks.
- A sense that, despite the lofty goal of capturing the expertise required to understand an infrastructure in the code itself, if one or two key people were to leave, the organisation or team would be in trouble.
There are significant implications associated with starting to treat our infrastructures as code. These implications are all based around a simple observation. If our infrastructure is code, if our environments are effectively software projects, then it’s incumbent upon us to make sure we’re applying the lessons learned by the software development world in the last ten years. It’s incumbent upon us to think critically about some of the practices and principles that have been effective there, and start to introduce our own practices that embrace the same interests and objectives. Unfortunately many of the embracers of infrastructure as code have had little exposure to these ideas. The result is a needless repetition of some of the same mistakes.
As a corrective, I think we need to start thinking about six key areas of our Infrastructure Code:
One of the principles of Extreme Programming that I find both challenging and elegant is the principle of the simple design. When writing software we should seek to Do The Simplest Thing That Could Possibly Work (DTSTTCPW). This is a extraordinarily beautiful concept, and when applied well, produces great results. The goal is to make our systems simple enough to meet -only- today’s requirements. The benefit of this is that it encourages simplicity in the design and implementation, which trickles down into the code. The simpler the solution, the simpler the manifest, the easier it is to maintain, understand, share and improve.
I suggest we apply this principle, with care, to our infrastructure design. When I do this with my teams, I get everyone together around a whiteboard, and we try to think creatively about possible solutions to the technical challenge we’re presented with. Our emphasis and bias in proposing solutions is to favour known, tested, understood and reliable technology; to think about building the very simplest, leanest thing that will do the thing we need to do, as a first iteration.
Unfortunately it’s very difficult to get right at a number of levels. The trouble is that this frequently gets truncated to: Do the simplest thing. This means that code gets written that may not be fully fit for purpose – there’s no good doing the very simplest thing without the second aspect – that could possibly work. Without care, DTSTTCPW becomes DTDTTIFTO – Do The Dumbest Thing That I First Thought Of.
To guard against this error, sometimes I enforce the practice of solving a problem three ways. We work together to come up with three entirely different approaches to solving the same problem, and then return to the three proposals another day and with the benefit of a bit of subconscious processing, with the objective of making team decision based on the simplest solution.
I recommend we make this quest for simplicity a defining feature of the Devops approach to infrastructure – that we strive to achieve it by firmly removing unnecessary or deferrable features and technologies from the real concerns at hand.
Collective ownership is all about encouraging cooperation and collaboration within a team. If we can generate a sense of solidarity, and make the infrastructure code we generate feel like something we all share, in my experience the quality seems naturally to improve. In the XP world, the term we use is ‘metaphor’ – the team comes up with an imaginative depiction of what the application is like, in some way. I’m not convinced this maps perfectly onto infrastructure as code as, to a degree, there is less scope for genuinely blue-sky thinking, but nonetheless, getting people together to agree on the shape and direction and feel of the system they are building is hugely important.
I recommend two ways to enhance the sense of collective ownership within Infrastructure as Code. Firstly, apply what David Allen calls The Natural Planning Model when designing aspects of a system or solving a problem. In brief this means following the following five steps:
- Clarify the purpose and principlesGet together and answer the following two questions:
Why are we doing this? Drill right down to the business value, to focus attention on what the key success criteria are, and to increase motivation and understanding of the purpose driving the piece of work
I would let any other individuals or teams carry out this piece of work as long as…. As long as what? As long as it’s completed in 5 days? As long as it’s done in Chef? As long as it’s documented in the Wiki? As long as it works on FreeBSD?
- Visualise extravagent successBrainstorm what the solution would be like if it were *awesome*. If you could go forward in a time machine to the completion of the piece of work, and look back, and this: “wow, we really nailed this”, what would be the things you saw that would make you say that?
- BrainstormHaving understood the purpose and principles, and having set your imagination free to visualise the most amazing solution possible, start to brainstorm possible components of the solution. This might include technologies, ideas, possible problems, thinks to research, people to speak to. Go for quantity not quality.
- OrganiseAs the brainstorming comes to a conclusion, you will naturally start to notice clusters of ideas. Start to organise these more formally into sub projects – areas to follow up, perhaps suggesting a particular team member take responsibility for a particular area.
- Establish Next ActionsDon’t do anything else until you’re agreed who is going to do what, when, and what the follow up is, for each of the areas that came out of the organisation phase.
The result of applying this process as a team is that it bonds people, and generates a sense of belonging. The whole team unites around the shared and understood purpose, the vision of what it would be like if it were awesome, and who is responsible for making parts of it happen.
Secondly, I recommend pairing wherever possible when actually writing the Puppet manifests or Chef recipes. Encourage pairing, and enforce rotation of pairs, so everyone gets to be involved with every aspect of the project. I’ll have more to say on the subject of pairing in a minute.
Another important principle in the XP world is that of code review. This simply means getting other people in the team to see, assess, validate, check and if necessary improve upon the code being developed. One obvious way to achieve this are to implement some kind of post commit hook, so when code is checked into version control, all members of the team are notified, perhaps by email or Jabber. This has the benefit of increasing visibility of changes, and has a subconscious benefit too – somehow the brain registers that it’s noticed that James made a change to the memcached module recently when it starts to think about that area of the code. This also discourages lazy commit messages such as “doing more stuff to the MySQL replication class” – that’s meaningless, and when the whole team sees these messages, you’re likely to get someone highlighting this and asking for a bit more information. Finally, it also brings attention to the number of commits being made. If it’s known that a given pair is working on one aspect of the infrastructure for the day, but no commit messages have been seen, this could indicate that that pair is stuck, or is not checking in frequently enough. A similar, less intrusive approach is to keep an eye on a feed of commit messages. This could be via Github, or with a hook that produces an RSS feed, or IRC or Jabber notification bot.
The very best way, the absolutely most effective way to guarantee code review in your team is to pair. Pairing is difficult to get right – it’s challenging and uncomfortable at first, but once a team starts to get the hang of it, the benefits they reap are extraordinary. We get code review for free – every line of every manifest or recipe has been looked at and thought about by two people. We get fewer mistakes – countless times I’ve had my partner point out a flaw in my idea or implementation. We get more done – the combined effect of being less easily interrupted, like prone to procrastination, and less likely to make dumb mistakes means that an effective pair will, in my experience, always be more productive than two individuals. When the benefits of automatically enhanced ownership, shared knowledge, and collaborative design are added into the mix, I think it makes a compelling case. How to go about making this happen is a different subject, and one which I intend to cover shortly on my own site http://agilesysadmin.net.
To ensure maximum reusability, and to make sharing and pairing as painless as possible, it’s important for any software project to define and then adhere to a set of coding standards. This can be as simple as following conventions around indenting style, line length, variable names, and so forth. For Puppet and Chef, and any other Ruby development, I’d encourage you to follow the general conventions of the Ruby on Rails community. For Python, I’d suggest you follow PEP-8. I’ve worked in places where coding standards are enforced with the use of code sniffers that reject commits if they don’t follow the standards. I have some sympathy for this approach, but I tend to think the best results come when teams have agreed to standards (as a part of collective ownership), and enforce them rigorously themselves.
A bigger area is what is considered ‘best practice’ in terms of how to lay out modules and recipes. A word of warning here – I don’t like the term ‘Best Practice’ – we can’t emprirically determine that a practice is best – all we can do is state that given what we know right now, this looks like an effective approach. Best practices are really snapshots of good ideas at a point in time – they’re not static. With this in mind, I recommend that if you come up with an approach that you think represents such a snapshot, make it public. Get other people’s input, stick it on a public blog or wiki, and see what kind of feedback you get. There are a few such things out there – I’d recommmend to take a look, and adjust or adopt according to the experience and feeling of the team. I don’t much believe in ‘best practice’ but I do believe firmly that when smart people start to solve problems, patterns emerge, and that publishing and learning from these patterns is valuable, and begins to bring about a shared vocabulary which helps the community to grow and mature.
The idea of testing infrastructure as code is an interesting one. At a superficial level it seems a little unnecessary, certainly at the level of behaviour-driven testing. It would be somewhat redundant to write a declarative test of the form “it should install nginx” for some code which says package nginx, ensure => installed. Unit testing of Puppet manifests and Chef recipes is possible, by diffing catalogues and resource lists, but there’s a general sense that this may be a misguided approach. The real area where there is value to be had, and risk to be reduced, is in ensuring that one’s code produces the environment needed, and in ensuring that there have not been side effects altering other aspects of the infrastructure. The first could be written as behaviour-driven acceptance tests, but realistically I think what we’re really describing here is application level monitoring. If you don’t care about your service enough to monitor it, why are you bothering with it at all? If your monitoring is done well, you’ll have tests that report on the meaningful behaviour of your application – are transactions being made? Can people create or view other content. Investing time in this level of monitoring for your live infrastructure will be an investment well made. However, in reality we want to reassurance that our code does meet requirements before pushing it live – we don’t want to be using our monitoring system against the live site as the primary way to test and prove the viability of our latest Chef cookbooks. Instead I advocate setting up a virtualised copy of your production environment (to sensibly prudent degrees of down-scaling). Appy the same application monitoring – verify that the critical user journeys and pieces of functionality are available, and discover first hand whether any side effects occur in this environment, enabling apprpriate corrective action.
The ready availability of virtualisation makes this very simple. As well as various public cloud providers, projects such as Vagrant, OpenQRM, make it trivial to fire up virtual machines on demand. When integrated with Chef or Puppet, we have the ability rapidly to deploy a functional clone of the whole live environment and test it in a reasonably live-like manner.
The main problem with trying to test infrastructures like this is that even with fast machines, and the latest software, it’s an inherently slow endeavour. In mainstream software development, it’s considered an anti-pattern if your tests take more than about five minutes to run. However, firing up a whole suite of machines and running Puppet or Chef to produce a complete, integrated environment is always going to take time. The result of this is that the feedback loop is inevitably going to be slower than in the traditional case. The risk with this approach is that without good discipline and commitment to making this kind of detailed end-to-end integration testing work, people will start hacking and hoping, and technical debt will inevitably accrue.
I think the answer is just to accept that the cadence will be different. Just like in the old days of needing to send off code to be compiled, we need to work within the necessary constraint that a sometimes large and complex testing environment needs to be bootstrapped as part of the process. I’ve said it before and I’ll say it again – we cannot expect there to be a one-to-one mapping of ideas and practices from the software development world – I don’t see any way that we could get the feedback cycle any faster – we just need to understand the value and work within the constraints we have.
What about test-first infrastructure programming? I think there’s something to be said for this. Writing a cucumber-nagios monitor that starts off failing, then deploying the code and watching the monitor go green is an attractive proposition. I think it also helps us to think about monitoring as a critical and integral part of the service we’re developing. On the other hand, there’s something to be said for writing the code that defines the service – its behavour, requirements etc in a way which includes monitoring at deployment time. An example might be that a recipe or manifest for an NFS server might include steps to register with a central monitoring solution, and trigger the application of a given template. I think this is an elegant idea that is not in contradiction to a test-first approach – a combination of both approaches is probably the best. One thing I would caution against is using test-first programming to discover design. Doing this can have the unfortunate effect of calcifying bad design decisions. To an extent in order to benefit from test-driven design, you need to have a high degree of confidence that your approach to the problem is fit for purpose. If you don’t know that your solution has proven to work well elsewhere – if you sense it is in any way speculative – I would advise that a better approach would be first to spike your proposed suggestion, assess whether the approach already works, then throw it away and implement it in a test-first manner.
Testing infrastructure as code is an important and developing area that still needs more practical testing. I’m not 100% certain that the standard pattern of Red, Green, Refactor fits comfortably, but I think it’s a valuable thought exercise in its own right.
This brings me to my last area – refactoring. As agile developers, we are typically doing one of two things. We’re either adding functionality – which means writing tests to capture the requirement, and then writing code to make them pass, or we’re making changes to already written code to improve understandability, and to reduce the cost of future modifications. A key aspect of this that that we’re improving the code without changing its behaviour – it doesn’t require new testing, it just requires that the current tests continue to pass. There are many reasons to refactor – firstly, refactoring improves design. This is especially my experience with Puppet – without refactoring, the design of a set of manifests will decay. Refactoring manifests frequently results in a simpler approach, with less code, and fewer chances of mistakes. The more code there is the harder it is to modify, and since one of our objectives as agile programmers is to embrace change, anything that reduces the barrier to making changes is good. Secondly, refactoring makes software easier to understand. By building a habit of refactoring recipes and manifests we not only make our own code easier for future people, or other team members to understand – it’s also a good way to become familiar with other people’s code. Thirdly, refactoring helps to find bugs and design flaws – when we habitually rework code, and expose ourselves and our work to critical scrutiny, these things tend to drop out, resulting in fewer unexpected and unwanted problems in production. Finally, refactoring is actually more efficient in the long run – simply because the biggest impediment to rapid progress is a clumsy or bad design. Without refactoring, a bad design and some bad habits can quickly become a millstone around the team’s neck. Introducing and practicing refactoring as a core principle in our infrastructure as code endeavours will make us more productive and better programmers.
Refactoring is a huge subject with many deep and important principles. I’m calling for its introduction and adoption as a core Devops principle. In order to do this I’d offer the following guidelines:
- Collect patternsDesign patterns and refactoring seem to go hand in hand. Patterns are the encapsulations of good ideas that came about as the result of refactoring. In that sense, patterns are a target for refactoring activities. Martin Fowler describes the relationship between refactoring and patterns as “patterns are where you want to be; refactorings are ways to get there from somewhere else.”
Patterns emerge over time, by doing, as I suggested above – by making bold statements,by sharing approaches, and by opening our ideas up to public scrutiny. Within the Devops world, this is beginning to happen…
- Let refactoring happen naturally as a by-product of pair programmingI’d avoid ‘scheduled’ refactoring – refactor according to need, such as when adding new functionality, to aid understanding of the system, when fixing a problem, or when sensing that you’re duplicating yourself.
I love Kent Beck’s claim that he’s not a great programmer – he’s a good programmer with great habits. If ever such an attitude was needed, it’s when the rapid influx of ideas from the software development world meets the traditional domain of the sysadmin. I believe that Infrastructure as Code is a great approach, and one which is central to the Devops way of doing things. However, we’re naïve if we think we can gain the benefits of the programming world without incurring some of the risks and costs. By thinking hard about how to address code design, ownership, standards, review, testing and refactoring, we face up to the implications of this new way of working, and we stand the best chance of putting in place good practices to give us the best chance possible of excelling.
About the author
Stephen is a Technical Manager and Devop based in Hampshire, UK. He’s the founder of Atalanta Systems and the author of the Agile Sysadmin blog.