How it works Static code analysis Technical paper

Choose a category:

Raising awareness

October 31st, 2007 by Nigel Cheshire. Posted in Enerjy

Bob Charette pointed me to an article in today’s Wall Street Journal about the Microsoft team that processes all those “Report this problem to Microsoft” messages that get sent any time something crashes in Windows. It’s certainly comforting to me that at least someone actually looks at those things. The article goes on to talk about the trials and tribulations of trying to track down bugs in complex software, and in particular how users are often asked to help with the process:

The experience of dealing with a software bug can involve an emotional journey that starts with helpless blind rage and ends with, if not a Kübler-Ross style of acceptance, then at least a little empathy.

The encouraging thing to me about this article is that it brings us another step toward a more general awareness of software quality issues. There are a few things that can happen to make the industry sit up and take notice of this issue: legislation, the attention of the legal profession, and the raising of awareness.

Last week’s announcement of the formation of SAFECode is interpreted by some as an attempt to head off the threat of legislation governing software quality. We have already seen examples of lawyers picking apart software quality as part of their case. And every time software quality is written about in a major non-trade publication, the caused is advanced.

Baseball statistics and the future of software quality

October 29th, 2007 by Nigel Cheshire. Posted in Software Quality

As a relatively recent import to the United States of America, I am more familiar with soccer (”football”) than with baseball. I became a U.S. citizen in 2004 - the same year the Boston Red Sox reversed the 86 year-old Curse of the Bambino, and won the World Series. And now, three years later, they’ve done it again.

Babe Ruth

I haven’t yet made the transition, though, from bandwagon jumper to true fan. That is in part, I think, because this game is so steeped not just in jargon, but in an almost impenetrable array of statistics. The game of baseball must have been invented by statisticians; data about every player, game, team, and ballpark in baseball history is diligently recorded and then sliced, diced and analyzed every way imaginable.

But to what end? It seems to me that the best use of statistics is in helping to predict the future. In other words, we know that when David Ortiz steps up to the plate, he brings a career batting average of .289 and an on-base percentage of .384. (Those stats are .332 and .445 for the 2007 season, by the way.) Those may be good numbers, but what do they tell us about how Big Papi is going to hit today? What about next week, or next season?

In March 2003, when, let’s face it, it seemed pretty unlikely that the Sox were going to win the World Series the following season, Newsweek published an interesting article about the team. It talked about how general manager Theo Epstein was doing something almost unheard of in major league baseball. He was assembling a team of “no-name journeymen” at “relatively short money”:

Why? Largely because Epstein, just 29 years old, is turning to sophisticated computer analyses of baseball statistics to evaluate players, rather than relying on century-old standards like batting average, home runs and runs batted in.

Imagine that. We have baseball statistics going back to at least 1871, and yet it wasn’t until about 120 years later that Oakland GM Billy Beane started to pioneer the idea of actually analyzing the stats to try and make a correlation between current statistics and future results. In other words, which statistics are actually predictors of future performance. Epstein picked up on that idea and backed it up with money to build a team based on those predictors.

You don’t need me to tell you what happened to Epstein’s Red Sox - the “idiots” who “just went out to swing the bats and find the holes”. The very next season, they broke the 86 year-old curse to win the World Series.

So, what does this have to do with software quality? Well, it seems to me that we, as an industry, are in the same state as baseball was a few years ago. Actually, we’re not even that far advanced - we collect some statistics, but not many. But the key issue is that we really don’t know what to do with the stats that we do collect. We have no idea whether unit test coverage, cyclomatic complexity, coding standards violations or anything else are really predictors of future defect rates.

But the good news is, that the only reason we don’t know these things is because no-one has ever done the analysis to find out. This is a solvable problem, and that is what we have committed to doing here at Enerjy. Our team of applied probability experts have spent the past 6 months or so analyzing literally hundreds of different statistics across thousands of source code files and correlating those with defect rate statistics to answer the question: what combination of source code metrics gives us the best indication of future bugginess?

We hope to have that question answered early in 2008. Once we do, we’ll let you know. And by then, we’ll already be looking forward to next season’s opening day.

Glitch Watch - Faulty pacemaker software blamed for death

October 26th, 2007 by Nigel Cheshire. Posted in Glitch Watch

Yet another “disgruntled slot machine winner” story hit the Glitch Watch in-tray this week, as Gary Hoffman, a retired Albuquerque, NM city employee, sued the Sandia Resort and Casino in New Mexico. Hoffman thought he had won almost $1.6m while playing the machine earlier this year (not unreasonable, given that the machine announced his win to him), only to be told that the win was a mistake caused by a software error. Hoffman’s case may never even make it to court, since the casino is owned and operated by a Native American tribe. New Mexico law generally does not allow tribes to be sued in state court over a contract dispute.

Now to Philadelphia, where Quantina Moore-Perry of Greensboro, NC, admitted exploiting a software glitch in home shopping network QVC’s system to the tune of more than $400,000 worth of goods. Apparently, Moore-Perry discovered that if she ordered goods from QVC and then immediately canceled the order, she would be credited, but the goods were still delivered to her. She allegedly worked the scam on more than 1,800 items until someone buying the items on eBay noticed that they arrived in QVC packaging, and alerted QVC. As we have observed before, you can’t necessarily rely on users to report software errors that they find.

Finally, a disturbing story from Charleston, WV, where Augustine Bryant has sued a physician after he implanted a recalled pacemaker in her husband, alleging that is what caused his death. Bryant’s husband died in October 2005, after the pacemaker, which was implanted in August of that year, failed. According to the Charleston Daily Mail, the pacemaker was recalled by its manufacturer in June 2005 because of a software problem that caused it to malfunction. Although Bryant died more than two years ago, it only just became known that his pacemaker was included in the recall.

Calling Washington, DC

October 24th, 2007 by Nigel Cheshire. Posted in Software Quality

If you are in the Washington, DC area, you will want to check out Stelligent’s roundtable event on TDD on October 30th. I was fortunate enough to attend one of their roundtables in Boston on Agile, and it was a fun time. Andy Glover has an uncanny knack of putting together a really interesting group of people and seeding the conversation in a way that makes it flow. Well worth the price of admission (free ;-))

By the way, these guys are smart. They have conveniently scheduled the event between games 5 and 6 of the World Series. Although, by that time, I expect we will already be celebrating a Red Sox victory.

Beware the wild goose chase

October 22nd, 2007 by Nigel Cheshire. Posted in Software Quality

Error messages, to my way of thinking, give us a fabulous opportunity to help users to help themselves. But they also give us an opportunity to send users on a wild goose chase, if we’re not careful.

Case in point. My wife runs a mail order business. She uses UPS to do her shipping, and she loves their Worldship program, which lets her ship a package from her desktop. Over the weekend, she asked me to help her set up a new workstation with the UPS software.

Here’s how it’s supposed to work. PCs on the same network share a SQL Server database to hold zip code, pricing and shipping history information. You set up one machine as the “LAN Administrator”, and other machines as “Remote workstations”. The LAN Administrator machine is where the database resides. The help files guide you through the process of installing the software on a remote workstation and hooking it up to the LAN Administrator. Sounds simple enough. Everything appears to go well with the installation, until we fire up Worldship on the remote workstation. Then we see this:

dump2.jpg

You can guess what happens next. I will spare you the gory details, other than to reveal what the real problem was. It turns out that you have to go to the LAN Administrator workstation and enable remote workstation access from there (two clicks of the mouse). A small but crucial step that was omitted from the installation instructions. But that’s not my point. Most people don’t read instructions anyhow (I only read them after I got the error message).

I can understand why whoever coded that dialog box thought it would be helpful to add a troubleshooting tip. But an error message dialog is not the place for this kind of advice, unless you can actually be 100% sure that you know what the problem is. There could be many reasons why the application was unable to connect to the main database - in this case, unfortunately, firewall settings were not one of them.

Glitch Watch - Software implicated in death of 9 soldiers

October 19th, 2007 by Nigel Cheshire. Posted in Glitch Watch

A mixed bag this week. First, to Tokyo, where 2.6 million commuters were impacted by a software bug in the automated tickets gates at all 662 stations in the Tokyo metropolitan area. The problem was caused by an overflow in the amount of data being sent to the gates to update them with information about stolen cards. Supplier Nippon Signal Co. worked through the night to fix the problem.

Meanwhile, in Jefferson County, Alabama, the county Finance Committee is struggling with the implementation of a new finance system. In addition to the $9.5m already spent on the system, the county is now recommending hiring 9 additional technicians at a cost of more than $700,000 to fix the remaining problems. The original spring launch date was pushed back by months, then once the system was up and running, vendors were not being paid. Commission President Bettye Fine Collins described the problems as “bumps in the road”, and said that “it’s a great system.”

Finally, to South Africa, where the National Defence Force is investigating whether a software glitch was to blame for the death of 9 soldiers and serious injuries to 14 others during an exercise last Friday. According to spokesman Brigadier General Kwena Mangope, an automated, computer-controlled anti-aircraft gun “opened fire uncontrollably, killing and injuring the soldiers.” According to the gun’s manufacturer, Oerlikon, the gun was never designed to be automated, but the South Africans, still operating under an arms embargo at the time, made the modifications without Oerlikon’s involvement.

oerlikon.jpg

By the way, this happened the same week that the U.S. Air Force announced that its new F-35 Lightning II fighter will use speech recognition software “to manage various aircraft subsystems”…

Glitch Watch - Railroad edition

October 12th, 2007 by Nigel Cheshire. Posted in Glitch Watch

More than 2,000 commuters on the Long Island Rail Road (LIRR) had their credit cards billed twice last week when the system received more transactions than usual. Apparently a record 30,000 tickets were sold last Monday morning, compared to a typical daily average of not much more than half that number. The flood of extra transactions caused an overflow of an as yet unknown transaction limit, which then caused some number of cards to be charged twice. At least the LIRR was proactive in its response, sending out 18,000 emails alerting passengers to the problem, and posting a note on its web site.

lirr.jpg

Meanwhile, down under in Adelaide, Australia, (human) guards have been posted on TransAdelaide trains to protect commuters from doors that allegedly “fly open” while the train is moving. The problem has been diagnosed as a software fault, and the company is reportedly spending $35,000 on an upgrade to fix it. According to Australia’s Transport Minister Patrick Conlon, the guards have been added out of “an abundance of caution,” and he described media reports that the trains are unsafe as “a beat-up”.

Logging - are we focusing on the right issues?

October 10th, 2007 by Rich Sharpe. Posted in Software Quality

Joseph Ottinger recently posted an informational piece, Logging API Choices on TheServerSide.com.
What amazed me about this piece was the fact there were over 30 comments in less than 24 hours. Obviously this is an issue of interest to many people. I wonder why.

With the exception of a very small percentage of applications where formal auditing is required, the main reason for logging is to help with debugging. One commenter estimates that 90% of logging is used for debugging, which I would even suggest is on the low side. Does it really matter which logger we use? Clearly, people find this to be an interesting topic, but I wonder whether this is really what we should be focusing on. Does it add business value? No. Does it add anything to the application for the user? No. Does it increase skill sets for developers? No.

The volume of comments is a testament to the generally accepted view that programming is difficult and rarely goes to plan.
How many managers have time budgeted for ‘reviewing logs’ in the project plan? Yet, it happens almost daily, on every project. Also, it would be interesting to discover the number of ‘logging’ lines of code as a percentage of your application (or per KLOC) and to compare the areas of logging to entries in your bug tracking system to determine if there is a correlation between logging and buggy code.

Rather than spend time discussing which logger to use, I would think it would be better for teams to discuss why the logging is there in the first place. It could be an indication that the code is too complex and could benefit from refactoring.

The missing link - how coding standards relate to bugginess - part 1

October 8th, 2007 by Mark Dixon. Posted in Coding Standards, Software Quality

Nigel posted a couple of times last week about the importance of coding standards. This has been a big deal for me ever since the first day, 15 years ago, that I started working on a project that was large enough to need two developers. The other developer on my team was talented and wrote honest, reliable code. The problem was, it was never clear to me that his code was that good. He just wasn’t an idiomatic programmer and so he might code a loop one way on Monday and the same loop a completely different way on Tuesday. Just simple changes in naming and use of conventions would have increased my productivity significantly as I tried to absorb and understand his code.

Intuitively, it seems clear that code you can read quickly is code that you can understand quickly. Which means faster, more thorough code reviews, and more accurate maintenance. If that’s true then, to cut straight to the point, you’d expect code that doesn’t follow standards to contain more bugs than code that does. This is a hypothesis that we can test, and one of things we’re working on here at Enerjy is finding out whether there is a link between coding standards and bugginess. In preparation for that, I’ve been running static analysis over several major open source projects. We’re not ready to talk about our final results yet, but in the mean time, I’d like to share some interesting findings from the static analysis.

For context, this is an ongoing project that we will continue to blog about. But as of now, we have analyzed about 8,300 files across 11 open source projects.

First the good news. There are two main categories of violation that never showed up throughout the entire life of the projects. The first was naming standards. One of the best things Sun did from day one with Java was to have an opinion on naming style and to religiously stick to it in all of their tutorials, books and articles. Open source contributors appear to have been paying attention; not once did we find a class, interface, or interface method name that deviated from Sun’s original standards. There was a little more variance in method names, with 86 files using non-standard method names. Field names were even more diverse (256 files with non-standard field names) although these were primarily in three projects that had their own, slightly different coding standards.

The second, slightly surprising, category of violation that never showed up, relates to use of esoteric language features like finalization and customized serialization. Not once did anyone declare a finalize method incorrectly, or attempt to explicitly call one. No-one ever caught ThreadDeath without re-throwing it. And the serialization customization methods readObject and writeObject were at least correctly declared. Although this initially took me by surprise, I now think it makes sense. I’ve been coding Java for maybe seven years and I still don’t fully get serialization, which means I have to reach for the JPL book every time I need to use it. I’m guessing I’m not alone in this, which means that these more obscure features tend to be used carefully, and with reference to the docs each time. I’m not sure making your API so complicated that users need to reach for the manual every time they use it is the best design approach, but it does seem to make sure people use the API correctly!

That was the good news. The bad news is that time was wasted by developers tracking down and fixing bugs that could have been caught instantly (using static analysis) as the code was written. I’m not going to name names, but I thought it might be interesting to share some of the code snippets that could probably have been better written. I won’t spoil the fun by explaining all the bugs - if anything isn’t obvious then feel free to post a comment.

This went 25 days before it was fixed:

String url_str;
...
url_str.replaceAll( " ", "" );

Not to be outdone, this version of the same problem existed for over 3 years until IntelliJ IDEA’s code audit found it:

String line;
while ((line = reader.readLine()) != null) {
line.trim();

Anyone care to guess what value this code returns? Fortunately this bug only lasted two days:

try
{
if(...)
{
return "Path Finder";
}
}
finally
{
return "Finder";
}

I’ll end with one of my favorites - probably because I do it so often myself. This one took 5 months to find:

public void setPropertyOnDelete(String propertyOnDelete) {
propertyOnDelete = propertyOnDelete;
}

This one only took two months:

protected PluginViewWrapper(PluginViewModel _model)
{
model = model;
}

The point here is not to single out any projects in particular. These are mistakes that we’ve all made dozens of times throughout our careers. However, there is no need for these mistakes to last beyond the first time you save your file, and I think it is an important move forward for our industry to understand that.

Glitch Watch - Newt Gingrich encounters naked Second Lifer

October 5th, 2007 by Nigel Cheshire. Posted in Glitch Watch

The ongoing saga of problems at the Los Angeles Unified School District (LAUSD) that we reported about a month ago has triggered the ire of columnist Ray Richmond, writing for the LA Daily News. “How could it be,” asks Richmond, “that teachers - already perpetually underpaid - would be forced to take out second mortgages, bail on loans and uproot their families over a chronic software glitch?” He reports that, in addition to the $55m that has already been paid to Deloitte for the system, the LAUSD has opted to pay Deloitte a further $37m “to repair something that has never in fact worked.” As of today, about 5,000 employees are expected to receive inaccurate paychecks, according to the LA Times - that number showing no decline from the previous two months. Although some have been underpaid, the vast majority of the errors have been overpayments.

To Macau now, where casino gaming supply company Shuffle Master said that 47 of its slot machines have been shut down awaiting software fixes after a player thought he had won 42 million Hong Kong dollars, only to be told that he hadn’t. By the way, 42 million Hong Kong dollars is about $5.4m USD - enough to get excited about, I would think.

42 million Hong Kong dollars is also about 1.4bn Linden Dollars, the currency of Second Life, which is where our favorite story of the week comes from. Former U.S. House Speaker Newt Gingrich decided to host a Q&A session on Second Life last week. According to the Atlanta Journal-Constitution, on arrival at the digital amphitheater, Gingrich was immediately approached by “a lovely young digital lady, who arrived moments before her clothes did.” According to David Cassel, one Second Life user commented: “Materializing nude just moments before your clothes are rendered on the screen in Second Life is just as likely to be a software glitch as a deliberate act.”