Benjamin Chun
Create A Release Management Team
This team’s primary responsibility is constructing the mechanisms to deploy the best configurations into production. They won’t be doing the builds; instead, they’ll be designing the build. You’ll want your best and brightest for this critical task. Take the most senior IT operations folks and get them out of reactive, firefighting and into a more proactive, structured regimen of release engineering designed to decrease downtime and increase your successful change rate. To reduce complexity and cost while improving the manageability of your operations environment, note the following benefits of rebuilding infrastructure as opposed to repairing (in increasing order of importance):
- An automated process takes a known amount of time, as opposed to firefighting which almost always takes longer than original estimates
- It introduces less configuration variance, as opposed to repeated break & fix cycles (“What all did we just change to get it to work now?”)
- They are usually better documented and less complicated so they can be monitored by junior staff which frees up senior staff from firefighting
- When senior staff break free of firefighting, they can work on new build projects fixing other systemic issues (addressing the real root causes of issues instead of just their symptoms)
Create A Repeatable Build Process
From the “Find Fragile Artifacts” step I introduced last week, we have our starting point in this next project – namely, getting fragile infrastructure replaced with stable builds that can be run by junior staff. Infrastructure blueprints should be stored in an online, protected, backed-up repository known as a Definitive Software Library (DSL). This library not only houses all the relevant software applications your environment requires to operate, but also license keys, patches, etc. The release management team should now consider the following:
- Identify the common set of essential services and components used across your infrastructure (OSs, applications, business rules, and data)
- Create a list of standardized components, called a “build catalog.” Look for ways to create components that can be reused and combined to create standardized configurations (i.e. Apache and Oracle both installed on a Solaris OS)
- For each component in the build catalog, create a repeatable build process that generates it. The goal is to have an automated build system that can provision the package by “pushing a button” (i.e. NIM for AIx or Jumpstart for Solaris)
- Any testing environment should be isolated from the production network to ensure that it does not disrupt production systems and to make sure that all dependencies outside the test environment are fully documented and understood
- Ensure that you can recreate each system with a bare metal build. Our goal is a repeatable process that eliminates anything tedious and error-prone, as well as reducing work and errors (i.e. virtual images from Xen or VMware)
- For critical HA or load-balanced environments develop reference builds that can provision a box from bare metal without human intervention (i.e. triggered as a workflow action based on a certain Nagios alert)
- When the build engineering process has been completed, store them in the DSL
Create And Maintain The Definitive Software Library (DSL)
Let’s face it – it can be downright horrifying having to check out some source code from a development repository for release at 1am in the morning. As a sysadmin, you cringe while executing the build script, cross your fingers and plead “Will it build without any errors?” If you weren’t able to convince the development group to adopt a Committer Role, here’s your second chance. Let the release management guys build it for you and commit this into the DSL (and this at 1pm in the afternoon when they can easily go track that developer down who made that “oh, just one more quick fix” commit) for the operations team to confidently release.
- Designate a manager to maintain the DSL who will be responsible for authorizing the acceptance of new applications and packages
- Create an approval process for accepting items into the DSL
- Establish a disparate network to house the DSL and the required build servers (i.e. a DMZ)
- Any software accepted into the DSL (both retail and internally developed applications) must be under revision control
- Audit the DSL to ensure that it contains only authorized components
Close The Loop Between Production And Pre-Production
So you have a beautiful test environment upon which you’re running production infrastructure – or so you think. Do you have automated detection of changes on all production servers? Are you sure that not a single developer has access to production? Even if only your operations staff has access, as most systems nowadays number thousands of files and perhaps hundreds of separate configurations, the odds are against them for remembering to document every single production change. You have to use a some automated configuration management tools (like Cfengine or Puppet) to ensure that production builds stay in sync with the golden builds in your DSL.
If you’ve made it to here, you should have reduced the amount of unplanned work down to 15% or less. The ITIL Institute reports that by drastically reducing configuration counts, you can significantly change the staffing allocation from unplanned to planned work, and consequently increase the server to sysadmin ratio. Now you’ve:
Stay tuned next week for some concluding words about the Visible Ops Handbook and how to pull it all together!
I just discovered this blog. Really enjoying the emphasis on automation and actually doing things rather than talking about them. Last week I had an ITIL software vendor tell me that their CMDB wasn’t actually usable (I couldn’t get at the data to use for application runtime config), but was there for “management”.
That’s why Agile Web Operations is particularly welcome reading right now.
LikeLike
Thanks, Julian!
I also think it’s time that we stop paying lip service to best practices and start doing things the right way. But, convincing those “management” guys to give us the necessary resources (time, tools, hardware) has always been a big stumbling block.
I certainly hope folks take these ideas into work with them and convince their bosses that they really can improve things.
LikeLike