Why are there so many hurdles to efficient SAM benchmarking?

Two opposite sides
When dealing with Software Analysis and Measurement benchmarking, people’s behavior generally falls in one of the following two categories:

“Let’s compare anything and draw conclusions without giving any thought about relevance and applicability”
“There is always something that differs and nothing can ever be compared”

As often, there is no sensible middle ground.

Risk Detection and Benchmarking — Feuding Brothers?

Risk detection is the most valid justification to the Software Analysis and Measurement activity: identify any threat that can negatively and severely impact the behavior of applications in operations as well as the application maintenance and development activity.
“Most valid justification” sounds great, but it’s also quite difficult to manage. Few organizations keep track of software issues that originate from the software source code and architecture so that it is difficult to define objective target requirements that could support a “zero defects” approach. Without clear requirements, it is the best way to invest one’s time and resources in the wrong place: removing too few or too much non-compliant situation in the software source code and architecture, or in the wrong part of the application.
One answer is to benchmark analysis and measurement results so as to build a predictive model. This application is likely to be OK in operations for this kind of business because all these similar applications show the same results.
Different needs?
On the one hand, by nature, benchmarking imposes to compare apples with apples and oranges with oranges. In other words, measurement needs to be applicable to benchmarked applications — stability over time — so as to get a fair and valid benchmarking outcome.
On the other hand, risk detection for any given project:

benefits from the use of state-of-the-art “weapons”, i.e., the use of any means to identify serious threat, that should be kept up-to-date every day (as for software virus list)
should not care about fair comparison. It’s never a good excuse to say that the trading applications failed but that it showed better results than average
should heed contextual information about the application to better identify threats (an acquaintance of mine — a security guru — once said to me there are two types of software metrics: generic metrics and useful ones), i.e., the use of information that cannot be automatically found in the source code and architecture but that would turn a non-compliant situation into a major threat. For instance: In which part of the application is it located? Which amount of data is stored in the accessed database tables — in production, not only in the development and testing environment? What is the functional purpose of this transaction? What is the officially vetted input validation component?

Is this ground for a divorce on account of irreconcilable differences?
Are we bound to keep the activities apart with a state-of-the-art risk detection system and a common-denominator benchmarking capability?
That would be a huge mistake as management and project teams would use different indicators and draw different conclusions. Worst case scenario: Project teams identify a major threat they need resource to fix but management indicators tell the opposite so that management deny the request).
Now what?
Although not so simple, there are steps that can be taken to bridge the gap.
It would be to make sure:

that “contextual information” collection is part of the analysis and measurement process
that a lack of such information would show (using the officially-vetted input validation component example, not knowing which component issues are a problem that would impact the results; not an excuse for poor results which much too often the case
that the quality of the information is also assessed by human auditing

Are your risk detection and benchmarking butting heads ? Let us know in a comment. And keep your eyes on the blog for my next post about the benefits of a well-designed assesment model.

The Current State of Application Quality

On Wednesday, October 20 at 8:30 AM, our SVP, Chief Scientist, and Head of CAST Research Labs, Dr. Bill Curtis keynoted the TesTrek conference in Toronto, Canada.

Appmarq Makes Software Quality Benchmarking a Reality
Bill’s talk, “The Current State of IT Application Quality” covered key results from our latest study on application software quality from around the world – 288 applications from 75 companies in all major industry sectors including the public sector.
For years we’ve had software timeline and budget benchmarks. Now we have what are arguably, the most important data – the quality of the product produced.

The results Bill presented are from our Appmarq database – the only repository of software product quality metrics in the world. Appmarq makes benchmarking software quality a reality.
You can check out the results in the executive summary of the CAST Worldwide Application Quality Study – 2010.
Appmarq enables us to calculate the Technical Debt of a typical application in your portfolio. Here’s where you can read more about how we calculate Technical Debt.

Size Always Matters

Corner of 26th St. and 6th Ave in NYC at 2 AM? Good to be big!
Middle seat from Los Angeles to Sydney? Good to be small!
Size always matters. Software is no exception. Software measurement expert Capers Jones has published data on how software size is fundamental. If you know size you can derive a lot of useful things — an accurate estimate of the defect density, for example. To see the complete list, go to Capers Jones’ presentation and check out slide #4.
Having an accurate function point count for your critical apps is great.  The problem is it’s expensive, it takes too much time, requires expertise you don’t have, and distracts your team from their day job.
The rest of this post is about how CAST’s function point automation works and how it solves the problems covered in the previous paragraph.
Over the last few years, we’ve really bulked up our function point counting capabilities. Think Arnold Schwarzenegger 1975 Mr. Olympia competition.
If you know a bit about function points, you know how incredibly hard it is to automate function point counts starting from source code as the input. That’s because function points capture functionality from the end user’s perspective. This functionality is encapsulated in calls from the GUI layer to the database layer.
To do what CAST does, you have to be able to analyze these calls and reverse engineer the end user experience! That’s a tall order, but that’s exactly what we did over the last 5 years of intense R&D and field testing.
This intense effort has led to a three key breakthroughs.
Breakthrough #1: Micro function points. The CAST function point counting algorithm is sophisticated enough to count micro-function point changes — the result of small enhancements that can quickly add up.  These are impossible to count manually, but they’re easily picked up by CAST’s automation.
Breakthrough #2: Enhancement function points. Like Microsoft Word’s Track Changes capability, CAST remembers exactly which data and transaction elements have been added, modified, and deleted in a series of changes made to a project. So you no longer have to worry about ignoring work that is necessary but doesn’t necessarily change the function point count.
Breakthrough #3: Calibration with function point experts in the field. We’ve been working with partners like David Consulting Group to ensure our automated counts are well within the accepted variance of counts.
Fast, low cost, benchmarkable function point counting? Automated size measurement is the answer!