Hitting hard limits

25 July 2014
As a general rule I shy away from writing about work-related incidents while still at the company in question, but in this case the company is putting a lot of resources into analysing (and hopefully correcting) what went wrong, so I decided to draft a contemporary article while the details were still fresh in my mind. As usual I won't mention the company or the relevant client, but those out there who know the details I have deliberately omitted will already know who and what I am talking about. In short the company failed to deliver an expected product release, and in my view it was an outcome that is to be expected given limitations imposed on the development process.

Background

Although it is an aside for the purposes of this article, my current company uses the (notionally Agile-orientated) Scrum development paradigm, the main feature in question here being the three-week time-boxed sprint duration. Most of the actual developmental part of new features are supposed to be done entirely within a single sprint, and a sprints as a whole finish with the delivery of a CD image containing the latest release of the software product. This software goes through two levels of testing:

Unit testing: The unit tests are white-box tests that are kept alongside the business logic (i.e. the program code of the product itself), and are run before changes become part of the master repository. Developers can also run them on demand on their local working copy of the source code
Regression testing: The regression tests are the main product QA (Quality Assurance) procedure, for which a distinct feature of this testing is that it consists of end-to-end black-box tests aimed at checking external effects rather than internal software state. These are ran overnight and are maintained by a separate systems test team.

Feature requests are performed via stories, which go through grooming (a name with awful connotations) & planning (breakdown into tasks), and completion of these stories is governed by a definition of done. Definition of Done is when features are formally available for customer use, and it is a common occurrence for features to be technically available long before they are ‘done’, although it is considered a bad sign for a lot of stories to be in such a state at the end of a sprint. This article concentrates on an incident related to regression testing going wrong as the result of a disruptive story.

Asking for trouble

For some time I had been concerned about limits on what could be done due to the limited time that sprints last. Sprints are notionally three weeks long, but only two of these weeks can really be budgeted for implementation of stories (i.e. use-cases/features), and this was rather tight for many of the ones implemented in the recent past. Even then there have been stories where I felt that the scope of what was implemented had been cut further than it ought to, leaving a half-completed Frankenstein's monster that was likely as not to cause problems down the road. In short there was little slack in the system for things cropping up, yet it was precisely the type of conditions that invited pitfalls.

There was one story in my team a few sprints ago that highlighted particulars of the whole development approach which I believed would sooner or later result in a major screw-up: The attitude towards effort estimates. It seemed as if these were more of a choir than genuine input into project planning, and I have known deliberately inaccurate estimates to be put forward for purposes of expedience in satisfying bureaucracy. To me this seemed like text-book conditions for SNAFU, and at the very least showed apathy towards how things are officially meant to be done. As it happened questionable effort estimates were not a factor in the sprint failure, but prevailing conditions that this is all indicative of meant it was not much of a surprise that something did eventually go wrong.

The incident

The story that kicked everything off was one that was the remit of another team, but due to its nature affected everyone. It consisted of a fundamental change to how the software operated, replacing a cross-referencing scheme with one based around hierarchical overriding, and the resulting disruption to other teams as they made changeovers meant that a lot of stories failed to reach definition of done. While in hindsight some aspects could have been handled better, on the whole I think such disruption was inevitable. If done in isolation this particular story may have just about fitted into a sprint, but as part of a larger development effort it was basically a no-hoper for going smoothly. In any case disruption to other teams' stories was not the real problem.

Testing train-wreck

While it did not take that long for other teams to fix things up so that the business logic was working and respective unit tests passing, the real disruption came because the changes were a major spanner thrown into the machinery that is the regression testing framework. Normally if a test fails the automated clean-up scripts revert the test environment (i.e. the system configuration) back to a known-good start state, but in this case some test failures were leaving the environment in a corrupted state that the clean-up procedures could not rectify. As a result all subsequent tests in a test group also failed, if not skipped entirely. It was because of this that by the end of the sprint, only 30% of regression tests were confirmed as passing, and therefore the product release candidate was no in a fit state for delivery. It was bad enough that only a quarter of stories were within reach of “done”, but a release delivery failure is a serious matter.

One of the problems of a stuffed up regression test system is that bugs that would normally be discovered overnight were going unreported for several days, and together with other coincidental complicating events that added to delays, this all resulted in a large buildup of bugs. It was therefore decided that the current sprint be one dedicated to clearing up this backlog, particularly as a significant portion of bugs were in themselves major issues. One of the hazards with such bugs is that the fix for them may actually break functionality elsewhere, and this hazard is all the much worse when a lot of issues are being fixed at the same time, which is the type of problem vigorous testing regimes are meant to check for. Throw in a blocker bug or two that break things like the software installation procedure, and the pace of progress slows to a crawl.

A real show-stopper in fixing all the issues was a shortage of hardware, as many of the more severe bugs were ones that only manifested themselves when using physical hardware rather than virtualised systems on developers' desktops. One such bug assigned to me I had to make semi-speculative changes and then wait for the systems test people to do the actual testing. This was a setup that made me decide that coming in over the weekend would not have helped, as I had no direct access to test systems. I'm personally not too bothered about weekend working, but a lot of other people who have families consider it a red line. In any case the overhang of issues was such that my expectation was the current sprint to be a delivery failure as well.

Root cause

While there were things that still could have been done differently to contain the problems, the whole story that kicked everything off was risky from the start, as it was clearly hitting hard limits imposed by sprint duration, completeness requirements, and resource availability. The whole build system setup was an accident waiting to happen as 95% of the time it works well, but when things do go wrong, it takes a lot of effort to un-stuff the whole system.

Likely outcome

At time of writing I am back in the UK to clear up some personal business, but looking at my email there seems to still be two or three issues that in themselves if not resolved would fail the current sprint, and I have doubts whether they can all be closed (i.e. fixed and checked) in time. Expected case is a month knocked out of the overall project schedule by all of this. The company doesn't live dangerously like my previous one did, and my team came out quite well being the only one to clear their backlog and still have enough time to tick off all the done criteria for its stories, so there is no personal risk as a result of this all. Nevertheless I see this as the consequence of underlying problems, and if the company mishandles the enquiry it will put a large dent in confidence with them.