Today we noticed that a feature we released yesterday created a really nasty side effect. While experimenting with a new content structure for AUDI, which was the only one we wanted the search engine bots to crawl, we accidentally invited the bots to crawl all content for all brands. As the content of all other brands was not migrated to the new content structure this bore the risk of spoiling our search engine ranking of our most important pages and thereby hurting our bottom line. Ouch. It was time for an emergency fix to stop the bots from reading too much stuff which might confuse their judgement about our site.
Root Cause Analysis
It’s bad enough that such things happen once. But what I want to avoid by any means is that such a thing might happen again. Just fixing it and then going on as usual doesn’t cut it for me. I really want to understand all the factors which led to such an error. It’s not about blaming anyone but it is about learning what happened, and, even more important, what to change to prevent it from happening again.
So, which forces were at play? Our usual process for critical changes like yesterday’s looks like this:
- The Business Owner enters a user story into our backlog. Usually the story describes how a certain feature shall look.
- The developer starts implementing the story in a test driven manner
- As soon as the story is ready for test, the Business Owner tests it and signs it off
- If it’s critical enough, I usually do some more rigorous testing for side effects
- Then the story goes live
Usually, this process ensures that we avoid releasing badly broken stuff. We have enough quality gates in the way to catch nearly all the bad guys. So how could this one sneak through? Thinking about it for a minute, I immediately came up with a list of stuff to further investigate:
- Was there too much logic in the views? Is the stuff really unit tested?
- We already developed the same story once before and dropped the result as it didn’t suffice. So this was the second approach to the same issue. It might be the case that we didn’t judge the side-effects as serious as the first time (wear-off effect)
- Only “positive” testing. Sign off testing focuses quite mainly on “Is the new stuff there?”. I have the feeling that we don’t do enough “negative” testing like “Is this old feature broken now because of the new stuff?”
- As I was busy with other tasks, and we already tested it once, I didn’t give it a second look
Looking at those points, two root causes seem to surface: Lack of well defined, acceptance test criteria and lack of peer review.
Writing Bullet Proof Acceptance Test Criteria
While fleshing out a story, we do a very detailed concept of the desired effects: How shall a new page look like?, Shall the new links be ‘follow’?, etc. Sometimes we mention some invariants like: “Keep all other pages like they are” or “This does not apply for that page”. But what we usually leave out is describing test cases to verify that everything works as expected:
- Given the brand AUDI, the links in the “top-link-box” shall be follow
- Given any other brand, the links in the “top-link-box” shall be nofollow
So far it worked out, but this time we got bitten 😦
Good acceptance test criteria should guide a tester to validate all critical points of a new feature. The most difficult part here is to find out which side-effects might come up and how to test for them. Finding those usually requires a detailed understanding of the dependencies within the application. Those are usually best understood by the developers. Because of that, it’s usually a good idea that the developer adds some test criteria as well. He knows better than anyone else what might break.
Peer Review Saves Your Day
As we’re a very small team where each of us has comparable experience, we trust each other completely. This is a good thing, of course. But, it leads to the negative side effect that we don’t do enough code reviews. In my experience, everyone can only profit from a code review: The reviewer practices his code reading and feedback giving skills, and the coder being reviewed gets to practice explaining his code and perhaps some additional insight into any trouble spots.
The same is true with testing. Too often we, the tech folks, think: Business signs it off, so it’s good to go, right? But, we should know better. More than once, I’ve double checked a signed off story and found bugs in it. So even for testing, peer review is a helpful thing. It should also help in improving your test cases (see above).
Actions To Improve The Status Quo
Neither of these root causes can be quickly fixed. Both require changing habits rather than just flipping a switch. What might help improve the situation with acceptance test criteria is adding a template for User Stories which explicitly asks for this test criteria. It could look like this:
Steps to verify this story: * * * Steps to verify nothing related is broken: * * *
Cucumber, a tool to make feature documentation executable, might be one step further in that direction.
Finding concrete measures for increasing our frequency of code reviews is a little harder and will need some team discussions. Let’s see what we come up with.
Our tightly knit team is doing a great job already. Still, if a critical bug sneaks through, I want to use it as a chance to further improve our processes and as a reminder to remain vigilant. Continuous improvement of processes is as important as keeping your code clean!