My friend Bill pointed me to this. Warning - ripe language alert!
- People should be able to easily recognize what the metric means;
- The method of calculating the metric should be shared and easy to understand;
- The data should be derived relatively easily from a trusted source;
- It should be actionable, i.e. when the metric goes off target, it should be understood what actions are needed to correct it.
- Professionalism: As long as the development team understands why the metrics are being used, i.e. the business value they provide, or the quality process that they are a part of, then I believe developers will support the effort. As professionals, they should understand that their work needs to be measured against organizational or industry standards (especially in a regulated industry), but oftentimes managers do not spend enough time explaining why quality metrics are being put in place.
- Education: Integrating metrics into a feedback loop provides a great educational tool. Visibility of common mistakes help developers self-improve, and steps can be taken to fill skill gaps. In the long term, that will lead to more job satisfaction. Too often, metrics are not shared widely enough or discussed with the team on a regular basis.
- Pride: Software Development is something that people want to do! I have never met anyone who was forced into the job because they could not find anything else to do. Most developers want to produce the highest quality work possible, and they are proud of their output. If somebody/something can easily point out how they can improve their craft, the more mature developers will pay attention. As long as the metrics have some credibility.
I have always been suspicious of using the McCabe Cyclomatic Complexity metric (which counts the number of different executable paths through a module) as a measure of quality. Common sense dictates that the more paths through a program, the more complicated it is; but does that really mean parts of my program have bugs if they have a high McCabe value?
The metric was created in 1976, and there is little published evidence over the last 30 years from real world projects to indicate how useful it actually is as a measure of quality. Many people are somewhat familiar with the thresholds in the SEI table, although there has been precious little research that attempts to correlate a McCabe value in excess of, say 10 with an increased probability of “bugginess.”
Over the last year, we have performed a historical analysis of tens of thousands of source code files, applying individual metrics to them, of which McCabe was one. For each file, we analyzed the metrics, along with the defect rates for that file, and did the correlation.
The graph below shows the correlation of Cyclomatic Complexity (CC) values at the file level (x-axis) against the probability of faults being found in those files (y-axis).
The results show that the files having a CC value of 11 had the lowest probability of being fault-prone (28%). Files with a CC value of 38 had a probability of 50% of being fault-prone. Files containing CC values of 74 and up were determined to have a 98% plus probability of being fault-prone.
From this analysis, my suspicions about this metric have been quashed and, if we know nothing else about a file except that it has a high Cyclomatic Complexity value, we can now have more reason to believe that it is likely to cause problems (keeping in mind of course that there are no guarantees about anything in this life).
Our technical paper provides more information about how the data was collected and how the model to determine fault-prone files was applied.
As you may know, we’ve spent much of the past year analyzing which code metrics actually turn out to be good predictors of bugginess. We found that the McCabe Cyclomatic Complexity metric is actually pretty effective at predicting how buggy a piece of code is likely to be. It’s one of the many metrics that goes to make up the overall Enerjy score.
Rich has pulled out some interesting data from the analysis, and will be blogging about it in more detail later this week. But meanwhile, he stuck his video camera in the faces of a few well-known speakers, authors and bloggers to see what they thought.
Last week, an article on sqazone reported on the results of an independent study commissioned by Forrester Consulting into large development organizations. The conclusion, in a nutshell, was that “the cost and complexity of metrics collection, and the reliance on superficial metrics – conspire to deter application development organizations from attempting to improve their metrics programs.”
This is a sad but true observation on an area we have been evangelizing about for a few years now. Implementing a metrics and measurement program is not easy, and interpreting the data and feeding it back into the SDLC in a meaningful way is harder still.
Coincidently, I asked several speakers at the Agile Development Practices Conference in Orlando, FL, why they thought organizations were slow to create or adopt a formalized metrics program. Here are their thoughts.
If you’ve been reading here for a while, you will know by now that we have been doing a lot of number-crunching here at Enerjy lately. We’re on a quest to find a correlation between metrics and defect rates, and to answer the question: “what metrics are good predictors of bugginess?”
There is a different, but in a way, very similar quest going on at Netflix, where, for the past 13 months or so, they have been running a competition to try and improve their recommendation engine. That’s the software that figures out that, if I gave Bambi five stars, I would probably also enjoy watching Reservoir Dogs. Clearly, improving the recommendation engine is important to Netflix: back in October 2006 they offered up $1m as a prize to anyone who could improve their current algorithm by 10%. Despite the best efforts of the nearly 24,000 teams working on the problem, no-one has achieved the 10% improvement yet, and as of now (December 2007), the best percentage improvement stands at 8.5%.
But the job of a recommendation engine is very similar to what we are doing here. We’re analyzing tens of thousands of source code files, collecting data on a couple of hundred metrics on each one, and then statistically correlating those to the number of defects that were found in each one of those files. The good news is that we’ve found some strong correlations between certain combinations of metrics and defect rates, that allows us to make a pretty accurate prediction of whether the code we are looking at is going to be bug-prone or not. We’ll be launching a product based on that analysis in the new year. Once we’re done with that, maybe we’ll take a crack at the Netflix Prize…
By the way, if you’re interested in reading more about number crunching, and some of its applications in everyday life, I can highly recommend Super Crunchers by Ian Ayres. Fascinating stuff.
Last week, I had the opportunity to have lunch with Zach Gemignani, co-founder of Juice Analytics. Juice is mainly focused on business analytics, but their interest (and, clearly, talent) in the area of data visualization had me add their corporate blog to my Google Reader a while ago.
If you’ve had any interaction with Enerjy in the past, you’ll know that we are big on metrics. And, after sitting with Zach and eating lunch while looking out the window at Motif Number 1, I thought about his post back in July about the importance of choosing the right metric. This is a great post that really lays out a process for thinking about what metrics to track, something that we don’t spend nearly enough time on, in my opinion.
We often find ourselves with an uphill battle when it comes to code quality metrics, simply because people’s experience of tracking the wrong metrics has given the whole subject a bad name. The most often quoted example in our world is measuring lines of code per unit time. But there are others. I once worked in a development shop where we were actually bonused on the number of bugs we fixed in unit time. On our own code. So, in a sense, the more bugs we introduced - then found and fixed - the more money we made.
Central to Zach’s piece about metrics are what he calls the four dimensions of a good metric:
Good, well thought out stuff, and to my mind, the very act of thinking through the implementation of a metric is where many organizations will get the most value.
There are many tools available to provide developers with metrics to help determine the quality of the code they are writing. However, there are situations where using tools in isolation, or not understanding the subtleties of the data they are giving you can result in indications that your code quality is higher than it actually is. This is especially true when tools are run over large bodies of legacy code.
This can be best demonstrated using an example:
JDepend is a free tool that ‘traverses Java class file directories and generates design quality metrics for each Java package’ (there is also a plug-in for Eclipse users). It is a good tool that helps detect smells such as high coupling and circular dependencies. No matter how good the tool is though, its results can be misinterpreted.
One of the features of JDepend is to show a graph that represents each package’s ‘Distance from the main sequence’. This metric is an indicator of the package’s balance between abstractness and stability and can help you get a sense of the maintainability of the package in the future. The values range from 0-1, with 0 representing highly unstable packages and 1 representing stable packages. (For more explanation of these types of metrics, see Andy Glover’s article on coupling metrics here.)
The screenshot below shows a selection of packages from the Jedit project (used for example purposes only). The packages with green dots on the graph show a close balance between abstractness and stability, while the one black dot represents a more unstable package that I may want to investigate further.
Narrowing it down, I quickly find the offending package and to appease my boss (who wants all code to have green dots) I can game the results of JDepend by simply adding an empty Interface to the package, thereby increasing its abstractness value which results in the package moving closer to the Main Sequence:
Adding an empty interface to a package just to satisfy a metric adds redundant code, looks ugly and is, well, plain stupid. I don’t believe anyone would actually do this to satisfy a green dot on a graph.
The reason I use this example is this. Imagine running a tool like JDepend over some legacy code to determine whether the code needs refactoring or a redesign for quality purposes (e.g. near the end of the project). If all the packages show a green result on the graph, can you be satisfied that your code is of high quality? Not really, as you are probably only looking for code that fails to satisfy these thresholds, rather than looking more closely for reasons why the ones that pass do pass.
Can we trust these tools to give us reliable metrics on the quality of our code? Is the practice of using these tools a waste of time? The answer to both these questions is a resounding ‘No!’
When using a tool, any tool, you must truly understand the details of what the tool actually does, look behind the raw information it gives and put the results in context. Using JDepend, does a Coupling ratio of 3 mean your code is great quality or full of bugs? How do you know, unless you understand the context of the code and package? Also, this tool is not actually telling you whether what it is reporting is good or bad, but giving you the information to investigate and decide for yourself. This is an important point as sometimes these types of tools are misinterpreted (or marketed) as providing a high/low quality mark, when really they are not. Therein lies the danger of using colored icons such as black, red or green dots.
In this example, it would be better to run JDepend after a static code analyzer (again, properly understood and configured). That way, the analyzer can detect issues such as empty code blocks or interfaces and unused imports, and eliminate them before running JDepend.
One of the most difficult questions I am asked by customers is: “As a manager, what incentives can be used to get developers to pay attention to metrics and hit thresholds I’ve set?” This question was discussed at a round table by a few attendees here at Agile 2007 this week.
One company represented here had offered cash bonus incentives for individuals and teams that hit thresholds for desired metrics. Inevitably, people started to ‘game’ the system and, since there was only one prize offered per month, after the first month the developers found reaching these numbers mundane and de-motivating. The company ended up paying the bonus in a round robin format to try to keep up morale. – Not exactly the desired outcome!
The best answer I can offer to this problem is to appeal to developers’ sense of professionalism, education and pride, in that order.
If a metrics program is put in place, care should be taken to fully explain the reasons why they are being used and what, if any, feedback loop is going to accompany them. If developers are told how these metrics are going to help them in their job, and the metrics are based around proven quality measures, rather than being productivity related, then the metrics are more likely to be embraced. Of all these issues, the feedback loop is the most important. After all, if people feel that the metrics are not being used, why bother collecting them in the first place?
Michael Feathers’ recent blog post Prosthetic Goals and Metrics That Matter that references Brian Marrick’s Code Coverage Misuse paper, seemed to cast a negative light on code coverage tools. I was glad to see that in a subsequent comment, he clarified his statement by saying that it was setting coverage numbers as an organizational goal rather than simply the use of coverage tools that he objects to.
Feathers states: “You can’t measure quality with [code] coverage.” I don’t necessarily agree with this. Just setting a threshold of say, 90% as an incentive marker is, I agree, not a good idea, because it encourages developers to look for the ‘easy’ tests to push their numbers up. However, with a good program of test reviews in place, setting goals can be motivational for developers and enhance the quality of the application, especially if thresholds are set low to start with, and gradually changed over the life of the project.
Note I said “changed” rather than just “increased.” If a major design change is applied, a lot of code and tests may be removed, which will throw off coverage numbers.
Is comparing these numbers a bad thing? I don’t believe so. It means developers who perhaps were not testing before are now gaining experience in writing tests, and a safety net is being put in place for more of the code (I agree that testing simple Getters and Setters is a waste of time though). If your team includes experienced developers, and peer/team lead reviews are occurring (which should be the case), any persistent gaming of these numbers by individuals will be recognized over time.
Using a code coverage tool will not only increase the quality of the application by showing what has been tested, but also what has not been tested. On many consulting engagements, I have used a code coverage tool to point to areas in the code that have, as yet, not been tested. I regularly find that within these applications, similar patterns present themselves (a frequent occurrence is a lack of testing of exception handling) and with the tool I use, the owner of the code is identified as well. I can quickly point out obvious areas in the code that would be good candidates (and high priority) for testing, and discuss suitable training areas for individuals who are consistently missing these areas that should be under test.
I believe we are at a time in the software industry where quality issues are being discussed more openly, and motivation to look into these quality areas needs to be encouraged. I believe that it is the responsibility of management and experienced technicians who have seen the lack of quality on prior projects, not to mention the results, to motivate and yes, maybe give incentives to teams that are pro-active in implementing quality procedures. However, a good review process to stop any gaming of these numbers is crucial for success.
I would argue that any tool, used properly, that can be incorporated into an automated build process (and used by individuals) to enhance the quality of a developer’s work and therefore any application they are contributing to, is a Good Thing
Here at Enerjy, we are putting a lot of effort into improving the way that code metrics are presented. We think that the key (or at least a major key) to improving code quality lies in identifying and tracking code metrics. We also think that metrics are only useful if (a) you choose the right metrics to start with, and (b) they are presented in a meaningful, engaging, easy-to-understand format. With that last point in mind, I often spend time looking for good examples of well executed data visualization techniques.
Meaningful, engaging, easy-to-understand. That’s what we will be striving for in our next major product release, which we are planning for early 2008.