“When your life depends on speed” says Rangaswami, “you don’t waste time checking.” The message here is all about working on improving the process so that you can have the confidence to spend less time checking the output. It comes back to Seth Godin’s point about Andon systems - cycles we put into improving the process are significantly more powerful than the same cycles put into checking the output.
Currently the IT media is fawning over Wednesday’s public beta release of Microsoft’s Windows Server Longhorn Beta 3. With most customers planning to wait at least a year before upgrading to Longhorn, a lot of focus will be on the organizations willing to jump into the game of “Russian Roulette” when opting to upgrade. For those that do, there will be some additional overheads: hardware (I’m assuming they are going to run in parallel with current systems for a while), software (obviously) and training. But the main focus will be the comments on installing, reinstalling, configuring and managing this system. Oh, did I mention reinstalling…?
Obviously many, many companies are reliant on Microsoft’s products (us included) to run their business, and it has become the norm to expect quick bug patches and late releases of Service Packs as experience has demonstrated. There seems to be an almost audible ‘groan’ in the industry when something goes bad with a Microsoft product, whether it is a slip in release date or security issue, followed by an ‘understanding’ sense of “Well, it’s Microsoft, they’ll get it fixed soon”.
Joe Wilcox’s article in eWeek, Microsoft’s Big Bang is When? is an unsurprising stream of unreconciled comments from various product managers, vice presidents and the marketing department. Another common part of Microsoft’s reputation in the past.
In my opinion, these issues are largely a reflection of the quality of the various Microsoft products’ code bases. The rewrite of the Vista code back in 2004 must have sent shivers down the spine of management at the time. However, this decision was made on the basis that the quality of that code would have not been anywhere near what has been released this week, so the fact that Microsoft actually made that tough decision must be applauded. It is rare to see such things reported in the media, and moves the industry toward raising the ‘quality’ issue and demonstrating that quality does matter to us. It reminds me of the ‘prototyping’ practice – how many teams have just continued to build an application on top of prototype code rather than scrapping the prototype and rewriting a better quality application?
The recent flurry of security holes found and patches delivered to key Microsoft products also serves to taint their reputation but being the organizational giant they are, with millions of happy users around the world, the image they project with a great marketing campaign and the reputation with home users, significantly outweighs their quality problems.
My conclusion? You could not honestly say that code quality plays a big role in Microsoft’s success. You’d have to say that they are better at marketing than at creating high quality applications. But, it’s truly encouraging to see them making bold decisions to improve the quality of their output, which sets a great example for the rest of us, who may not have the, uh, inertia of a company like Microsoft.
After I posted about non-helpful error messages, I got a handful of suggestions for worse examples than the one I cited. This one wins the award for most unhelpful, in my opinion (thanks Michael) and I’m afraid it came from Lotus Notes again. Don’t get me wrong, I like Notes and use it every day. But this message, on trying to delete a calendar entry? Puhlease! I wasn’t even applying for a job!
My initial reaction to Andrew Binstock’s post on Effectiveness of Pair-wise Tests, was: “Excellent! A way of getting a better return on developers’ testing time. Surely that’s what every manager wants!”
Resource allocation for testing legacy code (I steal Michael Feathers’ definition here – legacy code is any code that hasn’t been unit tested) is one of the biggest factors when considering code quality improvement initiatives. It’s the same old story: the development team understands that this should be done and [most] understand the benefits it will bring, however, the business demands that applications get to the customer on time regardless (and I do understand the business reasons for this).
In one example, Binstock gives results from BJ Rollinson’s testing of attrib.exe (which takes a path + 6 optional args) using Pairwise testing as:
Minimal: 9 tests, 74% code coverage, 358 code blocks covered.
Pairwise: 13 tests, 77% code coverage, 370 code blocks covered
Maximal: 972 tests, 77% code coverage, 370 code blocks covered
77% code coverage represents a very good return for the work involved, but “hold on”, I thought, “this means that almost a quarter of this code is still untested”. I haven’t seen the details, but I would be willing to bet that part of the untested code is exception handling code (please correct me if I’m mistaken, Mr. Rollinson).
There is no doubt that new methods to get best practices such as unit testing into the mainstream development lifecycle (as efficiently as possible) represent much needed and welcome progress. I also believe that setting a procedure for unit testing legacy code is even more vital (especially in the short term). Not only can lower hanging fruit be plucked to make commonly used parts of the application more robust, but also one of the most common security violations (the dreaded output to the screen of a stack trace when an error occurs) can easily be eliminated.
In almost every application I have seen unit tested, this lack of attention to exception handlers is a common oversight. Every developer knows that users have an uncanny knack of using our applications in ways that we never intended. When that happens, we need to consider what they see. If you are working for a financial company on some sort of web banking/payments application, what information could your users get access to?
Before I disappeared on vacation last week, I was pointed to the Washington Post article about world-class violinist Joshua Bell, who was convinced to participate in an experiment. Bell posed as a busker and played in a DC Metro station one Friday morning for 45 minutes. It’s a longish article, and worth the time to read through, but in case you don’t have that kind of time available, I’ll tell you the results: out of roughly 1,000 commuters who walked past him, only one recognized him, and he collected a grand total of $32.17.
What does that say about the quality of the music that Bell played (on his $3.5 million Stradivarius violin, by the way)?
To me, it says that the perception of quality is context-specific. If you pay $100 for a seat at Carnegie Hall, you are expecting a level of quality and would be disappointed if the violinist didn’t meet your expectations. Your expectations of quality in a DC Metro station are significantly lower, so that even when you are presented with a high level of quality, in that context, you don’t recognize it.
How does that relate to better understanding software quality? Well, first, I think there is a basic level of quality that can’t be violated. If I had picked up Bell’s Strad and attempted to play Bach’s Chaconne, I wouldn’t have made a red cent. More likely, I’d have been tossed out of the station (I have never touched a violin in my life). At the other end of the spectrum, Bell’s level of quality is expensive, but has its place - sort of like the space shuttle software, which is often cited as the most expensive but also the most reliable software on the planet (well, strictly speaking not on the planet, but you know what I mean).
The lesson for most of us, I believe, is that we should put thought into the level of quality that is appropriate for our application. Most business users will tolerate a certain level of defects, because the cost of zero defect software (or as close to it as we can achieve) is prohibitive. The problem that we have as an industry, I believe, is that we don’t know enough about the predictors of software quality to be able to make informed decisions about fitting the amount of effort we put into quality to the context in which the software will be used.
Postings will be light to non-existent over the coming week as spring vacation is upon us. I for one will have limited access to blogging facilities…
Lotus Notes threw me an error message the other day: “Object variable not set”. That message probably means something to somebody, but it sure doesn’t mean anything to me, let alone someone in the marketing department who may be even less familiar with software foibles than I am. Many error messages can be resolved to something more meaningful through a simple Google search. Which made me think: isn’t this just laziness on the part of the programmer? Why leave it to the user to have to go Google the problem to find out what is going on when the guy or gal who wrote the code could easily have spent a bit more time explaining what is happening, and what my options are to resolve the problem.
Acually, I don’t think it is laziness. To me, this is another example of the kind of thing that happens all the time in commercial software development - timescales get squeezed and the first thing to go is error handling code. All we can do is push back and make people understand the implications of cutting corners.
I posted last week about the benefits of making software metrics data presentation engaging; the more compelling you can make your data presentation, the more likely users will engage and get value from the data. Yesterday, Jeffrey Fredrick’s post caught my eye and pointed me to the Eclipse-XPS plug-in that shows the results of JUnit test runs using the Dell XPS’ built-in LEDs. Cool stuff. It put me in mind of the Ambient Orb - one of a category of devices that Ambient describes as “glanceable internet appliances”.
I wonder if anyone is using an Orb to monitor their percentage code coverage…
I spent three hours communicating with HP last night. At about 7:00 in the evening, my home-based HP Photosmart D7360 printer died with the error “Carriage jam. Clear the jam and press OK to continue.” Naturally, the first thing I looked for was a paper jam - no sign of any problems. I Googled the problem and was unable to come up with anything useful. OK, when all else fails, you have to resort to the dreaded Vendor Technical Support.
But this post is not about the quality of the printer, or even the software that controls it. Printers die. Hardware fails. These things happen. This post is about HP support’s buggy online chat system, and more specifically, how some thought should be put into what happens when it fails, and how to recover. I think there are some good lessons in my experience for us all.
So, being the hopeless geek that I am, I decided to forego the phone as a means of communicating with HP tech support, preferring instead to use the online chat system. After a minute or so my call is answered, I explain my problem, and start to go through the usual diagnosis steps. Look for paper jams. Power cycle the printer. Disconnect the USB cable and wait 30 seconds. On and on. Finally, the tech decides that what I have is a hardware error, asks for my serial number, and goes away to check the warranty (the printer is only a couple of months old). Then - bam! The chat session dies. Nothing I can think of will bring it back to life.
Frustrating, yes, but not the end of the world. I start a new chat session (choosing IE as the host this time rather than Firefox, just in case). After a minute or so, my call gets connected to a different tech (possibly in a different location - maybe even a different country, who knows). It takes me a while to explain what just happened, but the tech is sympathetic. Do I have a case number, he asks. No. OK, then there is nothing he can do; we have to start over from scratch. “Is there any paper jammed in the printer?” he asks. You can guess the rest of the story from here - after about another 20 minutes of diagnosis, this time the chat window just disappears. Gone. Nothing. Finally, I resort to using the phone, and eventually, after a total of three hours, a new printer is being shipped to me.
So what’s the lesson, from a software quality standpoint?
I think that as software developers, we have a tendency to think that things will never go wrong - or at least that it happens so infrequently that we don’t really need to worry about workflow. Exception handlers are usually the least tested parts of an application. After all, they are only going to be exercised once in a blue moon - possibly never - why should we waste time testing them, or even thinking about how users interact with them? If HP’s chat software had a way of resuming a dropped session - or at least hooking me back up to the same person I was speaking to when the line got dropped, I would be a significantly happier customer.
How much time do you spend considering workflows under error conditions?
Marketing guru Seth Godin wrote yesterday about how, in any industry, deadlines have a tendency to encourage mediocrity. “Every quarter”, he says, “your company ships new products or services. And every quarter, someone says, “under the circumstances,” or “given the deadline” or “with the team we had available”… it’s the best we could do.” He talks about the Japanese manufacturing concept of Kanban (later corrected to Andon), which is a way of stopping an entire production line if a fault is detected, so the fault can be fixed before any more units are produced.
That set me thinking about how that concept could (or should) be applied to software development. We all know the scenario: the estimates have been done, the project is on track. Suddenly, something changes. Our competitor just released a new version of their product, or the CEO just had a great idea for a new feature. Either the release date needs to be brought forward, or we need to add new features that weren’t planned. Either way, we need to do more work in less time.
Of course, it can’t be done. Unless we are willing to give up something else - usually design (or redesign) work, unit test development, adherence to coding standards, or something else that will adversely affect the quality of the finished product, and will probably come back to bite us later. We justify it to ourselves by saying “I’ll come back and fix it later, when I have time”. Guess what… that time never comes.
I wonder how many of us feel empowered enough to “stop the production line” by refusing to ship a product that we know is going to give us problems downstream.