You’ve heard it before, right? The standard answer given by so many developers when faced with a broken feature on the test server: “…but it runs on my box…”. Oh yeah, one of my favorites. You’re supposed to get this released and they can only come up with this lame excuse.
Why does every developer on the planet answer in this same way? It’s probably safe to assume that, indeed, it does run on their box. So what’s the real problem here? The developer wants to see his feature running live in production just as much as you and the product owner. Now you tell him that his code doesn’t even run. Even if you didn’t intend to blame him, he’ll always interpret it that way. Because his brainchild does not perform, he takes it personally. He really put a lot of effort into designing and coding. He created clean, tested code, and made sure that his feature really delivers what the users need. After weeks of effort, you come back to him and say that it doesn’t work. Such motivation killers don’t exactly create a healthy work environment, much less foster professional respect.
To get this feature back on track, the first thing you have to do is align with the developer. Make sure he understands that both of you are on the same team and have the same goal. Then try to work through all the things the developer changed on his machine. What’s different from the deployment environment? Did he install any additional libraries? Did he tweak a config file or registry setting? Which versions of certain libraries are installed on his box and on the target system? Sometimes there are even multiple versions of the same library installed. So which one is really used by the application?
While you walk through all those points make sure you take notes, or, even better, document every command run in a wiki page for later reference. I copy&paste every command we run during such an exercise, describe the exact location of menu items, note down every input into any configuration screens, and even make some screenshots. You can’t get too detailed. On the contrary, it drives me crazy if I managed to get something running once, but, a month later, I’m clicking through a bunch of menus hunting for that one crucial option which will make it run.
Let’s assume you finally discovered all the necessary things to start up the application on the test system. You’ve already come a long way by documenting all the requirements. The next developer will be able to take your wiki page and install the system with ease. Right?
Wrong. Another gotcha is lurking just around the corner. Oftentimes, developers work on an empty development database (read about the trouble with using an empty development database), and their machines have different hardware configurations from the reference and production systems. The feature might even have been developed on Windows while deployment happens on Unix based servers.
The differences in database load, hardware specifications and operating system environments can lead to many unpleasant side-effects. While a single development box might have 4 GB of RAM, the deployment environment might run with multiple virtual machines having only a fraction of that 4 GB available. Since the developer never had to consider memory constraints, the application might bring down the server pretty quickly.
So you have to talk with the developer again and make him aware of the limitations and non-functional requirements of running the application in production. I’ve seen a lot of developers simply not being aware of all these things. They were never trained to think about what might happen to the application when it’s actually used. In fact, only very experienced developers seem to have the broad overview of these additional tasks which need to be considered during development. Thus, it’s critical that experienced developers, architects and operations guys sit together with the more junior developers and walk through their code to get a better idea of what it might mean to actually run it in production. In What Sysadmins Want, Dan has addressed some of the requirements of web operations in greater detail.
What are your experiences with the “It runs on my box” syndrome? How do you address such problems in your environment?