Software Risk: 4 Case Studies in Software Quality and Software Schedules by Capers Jones

This post is taken from Capers Jones, VP and CTO, Namcook Analytics LLC original paper Software Risk Master (SRM) Estimating Examples For Quality and Schedules.




Originally published on: September 29, 2015


Poor software quality is an endemic problem of the software industry.  This short paper provides four case studies of varying quality levels ranging from very high quality to very poor quality.  In turn, these four case studies also have varying schedules that range from 10 months to 18 months for a software application of a nominal 1000 function points in size.

The Software Risk Master (SRM) estimating tool of Namcook Analytics predicts quality and schedules, and shows the combined impacts of a variety of technology and personnel factors.  Software Risk Master (SRM) shows the results of a variety of important software development factors including CMMI levels, methodologies, reuse, complexity, team experience, programming languages, and quality control techniques.

SRM predicts staffing, costs, schedules, and quality levels prior to requirements.  This short paper shows approximate results using graphs to illustrate how various factors interact.

Four Case Studies in Software Quality and Software Schedules

Assume a 1000 function point IT project such as an accounting package and 53,000 Java statements for all cases.  Assume 100 users of the software after deployment.  Assume average complexity (problem, code, and data) in all four cases.  The assumed test stages in the article include 1) unit test, 2) function test, 3) regression test, 4) component test, 5) usability test, 6) performance test, 7) system test, and 8) acceptance test.

Case 1 Best Quality: If you are at CMMI level 5; use TSP, use SEMAT, use Java, have an expert team and use pre-test inspections and static analysis, 20% reuse, plus certified test personnel your defect potential will be about 2.50 bugs per function point and defect removal efficiency (DRE) will be about 98%. You will deliver about 50 bugs to clients of which 8 will be high severity. There should be 0 security flaws latent in this example’s software.  Schedules would be:

Requirements/design         __ __ __ __ 4 months

Coding                                                     __ __ __ __ __ __  6 months

Inspections                                          __   __   __    __   4 months

Static analysis                                        __         __      __   3 months

Testing                                                             __ __ __ __ 4 months

Net schedule                     __ __ __ __ __ __ __ __ __ __   10 months

The 8 high-severity defects at deployment should be reported by users and eliminated within about 2 months from deployment of the software.  Customer satisfaction would probably be “excellent.”  There should be 0 error-prone modules in this application.


Case 2 Agile Development: If you don’t use CMMI but have a good team and use Agile, Java, 10% reuse, and also static analysis but no inspections and no certified test personnel your defect potential will be about 3.00 bugs per function point and your defect removal efficiency (DRE) will be about 93%.  You will deliver about 210 bugs to clients of which 26 will be high severity.  You will have 6 sprints. (If you also use pair programming your coding schedule will be 9 months and your net schedule will be 14 months.)  There would probably be 3 latent security flaws in the software at deployment.  Schedules would be:

Requirements/design         __ __ __ __  __ __ 6 months

Coding                                              __ __ __ __ __ __   6 months

Static analysis                         __        __       __   3 months

Testing                                                          __ __ __ __ __   5 months

Net schedule                    __ __ __ __ __ __ __ __ __ __ __   11 months

The 26 latent high severity bugs and 3 security flaws would probably be found and reported by customers over a 6 month period.  There might be “bad fixes” or new bugs accidentally included in bug repairs.  The U.S. average for bad-fix injections is about 7%.  Customer satisfaction would probably be “very good.”  There should be 0 error-prone modules in this application.


Case 3 Average: If you are at CMMI 3 and use Iterative development and have an average team that uses static analysis but no reuse, no inspections and no certified test personnel your defect potential will be about 3.5 bugs per function point and your defect removal efficiency (DRE) will be about 90%.  You will deliver about 350 bugs to clients of which 52 will be high severity.  There would probably be as many as 7 latent security flaws in the application after deployment. Schedules would be:

Requirements/design         __ __ __ __ __ __ __7 months

Coding                                                         __ __ __ __ __ __ __  7  months

Static analysis                                                 __   __  __  __  4 months

Testing                                                                     __ __ __ __ __ __ __   7 months

Net schedule                   __  __ __ __ __ __ __ __ __ __ __ __ __ __ __  15 months

The calendar period needed for users to find and report the 52 high-severity defects would probably approach 12 months after deployment.  Customer satisfaction would probably be “not happy.”   There might be as many as 5 new bugs after deployment accidentally introduced as “bad fixes” while attempting to repair bugs in the deployed version.   There would probably be at least 1 error-prone module in this application.


Case 4 Worst: If you are at CMMI level 1, use waterfall development, Java, have a novice team that does not use either inspections or static analysis or certified test personnel and have 0% reuse your defect potential will be about 4.5 bugs per function point and your defect removal efficiency will be about 87%.  You will deliver about 585 bugs to clients of which 100 will be high severity.  There may be as many as 12 latent security flaws in this version of the application.

This project will be later than planned, over budget, and have a high probability of being canceled and a high probability of litigation if it is an outsource project.  This project has so many bugs that testing schedules are three times longer than planned.

Requirements/design        __ __ __ __ __ __ __ __ 8 months

Coding                                                         __ __ __ __ __ __ __ __   8 months

Testing                                                         __ __ __ __ __  __ __ __ __ __ __ __ 12 months

Net schedule                   __  __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __    18 months

Assuming that the project is actually delivered and the ROI does not turn negative due to the cost and schedule overruns, it would probably take about another 18 months for users to discover and report the 100 high-severity latent defects.

There is also a significant chance of hackers exploiting the 12 latent security flaws.  The number of bad-fix injections or new bugs accidentally introduced while trying to fix released bugs would probably approach 10.  Customer satisfaction would probably be “dissatisfied” to the point of threatening litigation if the project is an outsource project.   There would probably 3 to 5 error-prone modules in this application.

Projects like this are far too common even in 2015, and explain why CEO’s often have low opinions of the competence of software teams compared to other forms of engineering.


Summary and Conclusions

As of 2015 software quality and software schedules range from quite good to extremely bad.  A combination of factors causes these large ranges.

Overall Software Risk Master (SRM) estimates use combined data on the 5 CMMI levels, on volumes of reusable materials, on work hours from 71 countries, on productivity and quality from 70 industries, on 54 project classes and types; on 58 software development methods such as Agile and TSP, on 79 programming languages such as Java, Ruby, Objective C; and on project sizes that range from 1 function point to over 300,000 function points.  Overall data from about 25,000 projects has been examined.  Due to the patent-pending early sizing method in SRM, it can produce full estimates prior to requirements or 30 to 180 days earlier than other estimating methods.

This short paper illustrates how combinations of experience and technology factors cause large ranges in both quality and schedules.  But note that several other important factors were held constant in this paper in order to highlight the importance of the factors that were modified.  For example complexity was “average” in all four samples, and all four used the same Java programming language.

If a low-level language such as assembly were used in conjunction with very high complexity scores the results would be much worse than shown here.

If a high-level language such as Objective C were used in conjunction with very low complexity the results would be even better than the best case shown here.

Software schedules and quality are based on combinations of a variety of technology factors and team experience levels.  This short paper gives a brief overview of how experience and technology factors interact.

For the overall software industry the main cost drivers in rank order include:

  1. Finding and fixing bugs
  2. Producing paper documents
  3. Coding
  4. Creeping requirements
  5. Project management

Software Risk Master can show the relative schedules and costs for all five of these major cost drivers.  NOTE: creeping requirements are not shown in this paper, but average about 1% per calendar month.


Definitions of Terms

Agile development:  As of 2015 Agile is the #1 development methodology in the world.  It is derived from a famous meeting of software experts in 2001 which created the “agile manifesto.”  Agile is characterized by daily “scrum” status meetings; by dividing development into a number of “sprints”; and by special techniques such as “user stories” for requirements; story point metrics, velocity, and other special methods.

Bad fixes:  This term refers to new bugs accidentally injected by bug repairs for existing bugs.  The term was coined by IBM circa 1965. About 7% of U.S. bug repairs have new bugs in them.  In one lawsuit where the author was an expert witness the vendor tried 4 times to fix a bug in a financial package.  Each fix not only failed to fix the original bug but added new bugs.  Finally after 9 months the 5th attempt both fixed the original bug and added no new bugs.  But by then the plaintiff had lost over $3,000,000 in consequential damages.

Capability Maturity Model Integrated (CMMI)®:  This term defines a formal method of evaluating software development capabilities developed by the Software Engineering Institute in 1986.  It uses formal assessments by means of certified assessors and formal checklists of key process topics.  There are 5 levels of the CMMI®: 1 Initial; 2 managed; 3 defined; 4 quantitatively managed; 5 optimizing.  The CMMI is widely deployed among military software organizations but barely known by many civilian industries such as banks and insurance.

Complexity:  The scientific literature contains over 25 kinds of “complexity” including semantic complexity, mnemonic complexity, flow complexity, and many others.  For software there are also many complexity definitions some of which are ambiguous.  Probably the most common software complexity metric is “cyclomatic complexity.”   However Software Risk Master (SRM) uses three forms of complexity for project estimates:  1) problem complexity; 2) code complexity; 3) data complexity.  These are scored by clients on a scale that runs from 1 to 11, with 6 being the mid-point or average value.  A score of 1 is very low complexity while a score of 11 is very high complexity.

Defect potentials: This metric was developed by IBM circa 1970 and is the sum total of probable defects that will be found in requirements, architecture, design, code, user documents, and bad fixes or secondary bugs in defect repairs themselves.  Defect potentials are measured with function point metrics because requirements and design bugs often outnumber code bugs and cannot be measured with lines of code.  The U.S. average circa 2015 is between about 2.00 bugs per function point and 6.00 bugs per function point, with average values for 1000 function-point projects being close to 4.00 bugs per function point.

Defect removal efficiency (DRE):  This metric was developed by IBM circa 1970 and first used to prove the effectiveness of formal inspections of software work products such as requirements and code.  DRE is usually measured by recording all bugs found during development and comparing them to bugs reported by users in the first 90 days of usage. If developers find 90 bugs and users report 10 bugs, then DRE is 90%.  As of 2015 measured DRE levels range from about a low of 80% to a high of about 99.5%.  An average U.S. value for DRE circa 2015 would be about 92.5%

Error-prone modules (EPM):  Software bugs are not randomly distributed through the modules of large applications, but tend to clump in a very small number of modules.  About 50% of the bugs in IBM operating systems were in less than 5% of the modules.  In the IBM IMS data base there were 425 modules.  Of these over 300 were zero-defect modules with no customer bug reports.  About 57% of IMS customer bug reports were in 31 modules developed by one department whose manager did not want to use formal inspections.  Other companies such as AT&T and Motorola also independently noted error-prone modules in their software.  In fact EPM seem to be an endemic problem of large software packages.  Software Risk Master (SRM) predicts the probable number of EPM as a standard quality output.  Pre-test inspections and static analysis can find and remove error-prone modules.  Ordinary testing has not been effective as a deterrent against EPM.

Function point metric: This metric was developed by A.J. Albrecht and colleagues at IBM White Plains circa 1973, and placed in the public domain by IBM in 1978.  Today in 2015 function point counting rules in the U.S. are controlled by the International Function Point Users Group (IFPUG), although other forms of function point are used in Europe.  Function points are the weighted combinations of inputs, outputs, inquires, interfaces, and logical files.   Function point metrics are the #1 metric used for software circa 2015.  The governments of Brazil, Italy, Mexico, Malaysia, and Japan all require use of function point metrics on bids and government software contracts.  Data in this paper assume IFPUG function points version 4.3.  There are ISO standards for function point metrics, and also formal training and a formal certification examination.  Function points can measure non-code defects and work such as requirements and design that cannot be measured using the older “lines of code” metric.

Inspections: This is a manual form of defect removal invented by IBM circa 1970 and having over 45 years of empirical data.  With inspections small teams including a moderator go over text and code line by line.  Inspections can either be 100% of a deliverable or can focus only on the most critical sections.  Inspections average about 85% in defect removal efficiency.  Inspections are used on requirements, architecture, design, code, documents, and even test plans and test cases.

Iterative development: This methodology is an alternative to the older “waterfall” model of software development, which assumed waiting until the entire application was designed and then  carrying out sequential building tasks one by one.  In iterative development software applications are divided into smaller increments that can be built partly concurrently.

Lines of code (LOC):  This metric is the oldest known software metric and can be dated back to the 1950’s.  Unfortunately this metric lacks ISO standards, lacks formal training, and lacks certification examinations.  Worse, the software literature is inconsistent in LOC usage.  Some journal articles and books use “physical lines of code” which can include blank lines and comments; other articles use “logical code statements” or the actual code that contains computer commands.  There can be a 500% difference between logical and physical code size.  A survey of software journals (IBM Systems Journal, Cutter Journal, Crosstalk, IEEE Software etc.) by the author noted about one third of published articles used physical LOC, one third used logical statements; and the remaining third did not identify which version was used.  A failure to identify which specific metric is used is sloppy and unscientific, but unfortunately far too common in the software literature.

Software Engineering Methods and Theory (SEMAT):  This is a new initiative launched circa 2009 that is intended to add formal analysis and discipline to the somewhat chaotic concepts of traditional software engineering.  SEMAT can be used in conjunction with methodologies such as Agile and TSP.  SEMAT concepts include an essence and kernel of software concepts, and an expanding set of guidelines and principles.  SEMAT is not a “methodology” but rather a set of formal concepts that can augment methodologies.  As of 2015 empirical data on the success of SEMAT is sparse.

Static analysis: This is a form of software defect removal that does not execute the code or examine it while it is running.  Static analysis tools are rule-based engines that scan code and look for predetermined patterns of common software error conditions.  Commercial and open-source static analysis tools have been available since about 1987, and have substantial proof of efficacy.  Static analysis averages about 55% DRE for code bugs, but sometimes tops 75%.   There are some “false positives” or code sequences misidentified.  Static analysis is fast and inexpensive and normally used prior to code inspections and prior to testing.  Only about 25 languages out of 2,500 are supported by static analysis.  However all of the high-usage languages such as C, C++, Java, etc. are supported.  Older and more obscure languages such as MUMPS, CORAL, and CHILL are not supported.  These could be added under a custom contract with one or more static analysis vendors, if there is an urgent need.

Team experience levels:  Software experience and skill levels vary widely, which is also true for other professions.  In order to show the impact of ranges of experience, Software Risk Master (DRM) uses a 5-point scale for experience with users evaluating their development teams.  The team experience scale range is 1) Experts; 2) above average; 3) average; 4) below average; and 5) novice.  However SRM also applies this range to specific occupation groups including software engineers, testers, project managers, technical writers, quality assurance, and even to clients (inexperienced clients provide very poor requirements).  With the SRM experience scale decimal values are accepted, such as 1.50 or 3.75.

Team Software Process (TSP): This software development methodology was developed by the late Watts Humphrey.  It has been published in several forms since about 2000, including Watts’ own book Introduction to Team Software Process.  There is a related methodology called personal software process (PSP) that is usually combined with TSP.  The TSP methodology is endorsed by the Software Engineering Institute (SEI) and has substantial empirical data.  TSP is one of the most effective methodologies for large software applications > 1000 function points in size.  TSP is a “quality strong” development method.  Inspections are an integral part of TSP.

Testing: This is the classic form of software defect removal that has been performed since the early 1950’s.  As of 2015 there are more than 20 forms of testing.  The assumed test stages in the article include 1) unit test, 2) function test, 3) regression test, 4) component test, 5) usability test, 6) performance test, 7) system test, and 8) acceptance test.  Testing operates on running code and uses pre-determined test cases and test scripts to seek out potential errors.  Testing has several problems that need additional study.  Most forms of testing have only about a 35% DRE, so at least 8 kinds of sequential testing are needed to top 80% in overall testing DRE.   There are also errors in test cases themselves and duplicate test cases as well.  A study by IBM of regression test libraries found about 15% duplicate test cases and 5% of test cases with errors.  Running duplicate tests or running test cases with errors adds nothing to DRE but does add to testing costs.

Waterfall development: This software development methodology is the oldest formal software methodology and dates back to the 1950’s.  It is called “waterfall” because a graphic illustration of this method shows information flowing from one activity to another, in fashion that resembles the steps of a waterfall.  The normal activities of waterfall development include requirements, design, coding, and testing.  Among the historical problems of the waterfall method is attempting to define a complete system before starting, and trying to finish every step before starting the next step; i.e. waiting for all requirements before starting design.   Although waterfall has been reasonably successful for small projects, it is often associated with failures, cost overruns, and schedule delays on applications > 1000 function points in size.  There are many software activities that span multiple phases, including integration, quality assurance, and technical manual production.

Get Your Free White Paper And Learn How Software Analysis Can Help Your Business

“The Science of IT Planning and Budgeting” helps senior leaders challenge the status quo and to understand how a research-based, automated solution transforms this important process by injecting it with fact-based, objective insight.

Your Information will be kept private and secure.