Here we go again. You probably have heard, since it’s been reported everywhere, that American Airlines was grounded Tuesday, leaving passengers stranded for several hours due to a “computer glitch” in the reservation system. Because of the glitch, gate agents were unable to print boarding passes; and some passengers described being stuck for long stretches on planes on the runway unable to take off or, having landed, initially unable to move to a gate.
We’ve made it a point on our blog to highlight the fact that software glitches in important IT systems — like NatWest and Google Drive — can no longer be “the cost of doing business” in this day and age. Interestingly, we’re starting to see another concerning trend: more and more crashes blamed on faulty hardware or network problems, while the software itself is ignored. It’s funny that the difference in incidents can be more than 10 times between applications with similar functional characteristics. Is it possible that the robustness of the software inside the applications has something to do with apparent hardware failures? I think I see a frustrated data center operator reading this and nodding violently.
For enterprise IT applications, it’s all about processing data defined through multiple types and in large volumes of code. Then the number of lines of code devoted to data handling is high enough to encapsulate a large number of software bugs that are waiting for specific events to damage the IT system and impact the business.
Even if we can say that a bug is a bug and it will be fixed when it occurs, bugs related to data handling should not be underestimated and this for several reasons:
Such bugs are generally not easy to detect among the millions of lines of code that constitute an application. They can be constituted by only one statement that is defined elsewhere, making some specificities in using it not immediately visible. They can also result from the execution of a given control flow associated to a given data flow.
Some of them can be there for a long time and will never occur. The problem is to identify which ones belong to this category to focus on others.
They can be activated by the conjunction of specific conditions that are not easy to identify.
When the issue occurs, the impact for business data can be severe: applications can stop, data can be corrupted, and end users and customer satisfaction can decrease.
Consequences are not always clearly visible and, in this case, few users detect them.
Problems are distributed
Issues can be hidden everywhere in application code. Risk management methodologies can help select the most sensible application areas and reduce the scope of the search. However, in most cases, detecting such potential issues requires the ability to check the types and structure of the data flowing from one component to another, or from a layer to another, as well as the algorithms implemented in your programming language of choice. This spells troubles for everyone.
Why does a bug activate suddenly?
There are different factors that contribute to activating a bug:
Probability increases with the number of lines of code.
The more a component is executed, the more its bugs can be activated.
The more you modify the code, the more likely an unexpected behavior can occur.
Low decoupling between data and treatments makes any changes on data impact the code.
Market pressure that stresses the development team. Working quickly is often a good way to create new bugs and activate existing ones!
Algorithm implementing business rules can be complex and distributed over multiple components, fostering the occurrence of bugs.
Functional data evolutions are not always taken into account in whole application implementation and can make code that is working well run in an erratic way.
The biggest challenge comes when several factors occur at the same time – a difficult challenge for any development team!
The list of situations that can lead to troubles related to data handling is not short. For instance, database access can be made fragile when:
Database tables are modified by several components. Data modifications are usually ruled by the use of specific routines to update, insert, and delete a specific API or a data layer that is fully tested to maintain data integrity.
Host variable sizes are not correctly defined compare to database fields. Some queries can get a volume of data that is higher than expected. Or a change is made in the database structure and it has not been propagated to the rest of the application.
When manipulating variables and parameters, potential issues can be:
Type mismatches are generally insidious cases. For example, it can occur with implicit conversions that occur between two compatible pieces of data, such as the ones found in different SQL dialects, injecting incorrect values into the database. Similar situations can also be found in COBOL programs when alphanumeric fields are moved into numeric fields, leading to abnormal terminations if the target variable is used in calculation or simply in a computational format. Improper casts between C++ class pointers (ex: a base class to a child class) can lead to lost data and to data corruption propagated through I/O.
Data truncation when no control of variable size is done when moving one into another. Part of the value can be lost if the target variable is used to transport the information.
Incoherencies in functions or program calls between arguments sent by the caller and expected parameters. This can occur when a change done in the function or program interface has not been ported in all callers, making them terminate or corrupt data.
What about consequences?
Unfortunately, there is more than one type of consequence when such bugs activate. One of the big risks for the application is related to the corruption of the data it is manipulating — the worst case being when corruption is spreading throughout the IT system. Generally, this impacts users and the business.
I remember such situations with a banking application. Everything was working fine when the phone rang: “Hi, the numbers on my weekly reports don’t look right. I checked but it seems there is a problem. Can you check on your side?”
Well, we searched which programs generated the report but we did not find any interesting information. We checked its inputs and we found incorrect values. Then we looked at the program that produced these inputs, and finally we found the cause of the problem in a third program — a group of variables that were not correctly valuated.
Fortunately, the problem was detected and fixed. A more critical situation happens when very small corruptions install silently and insidiously over the IT system. They are too small or too dispersed to be pointed out. For instance, some decimal values that are improperly truncated or rounded can seem like a small issue, but in the end, the total can be significant!
Another consequence is related to application behavior. Bad development practices can rapidly lead an application to erratic behavior and sometimes termination. At last, some issues, like buffer overrun, can even lead to security vulnerabilities if the data is exposed to end users, especially in web applications.
Manual search …
Issues related to data handling are rarely discovered and anticipated when they are sought through manual and isolated operations only. The volume of code to look at, the number of data structures to check, the complex business rules to take into account, and the bug subtlety (that sometimes seems to be diabolic!) are serious obstacles for developers who cannot spend too much time to try and fix problems that might never occur.
… or automated system-level analysis?
The most efficient way to detect these types of issues is to analyze the whole application software with tools like CAST AIP to correlate findings concerning data structures with code logic. That can establish who calls who in the code, and can introspect components interacting in the data flow. Thus, the issue detection can be carried out faster, helping developers secure the code. It can be automated to regularly check the applications without disturbing the development team’s activities, allowing them to manage prevention at a lower cost.
I have some good news and I have some bad news. First, the good news: Most smart development teams invest a lot of time designing a rock-solid architecture before the first line of code is even written for a new application. Now, the bad news: Once the architecture is designed, the conversation about it often ends. It’s built and then forgotten while the team runs off and builds the app, or when the application is transferred to a new development team.
Thoughtfully designed architectures with solid design principles might begin to degrade almost the instant they are implemented. How can a team maintain a proper architecture, iteration after iteration? There’s really only one way and that is to implement an “architecture status check” after each new component is built and integrated.
Here are three best practices I’ve seen in play at mature shops that perform a regular architecture status check. In the end, these steps ensure a resilient architecture:
Check your architecture at the speed of your development cycle. IT leaders in large organizations must be certain that the software architectural design is being implemented and adhered to. But in an era when more developers are deploying and coding faster than ever, one architectural review per quarter is not going cut it. Architectural reviews need to happen at the speed of your development teams’ deployment.
Don’t assume the architecture is stable. It’s important that the architecture stay stable for development teams who deploy new builds quickly to meet design deadlines. They’re focusing on the coding of the new application or updates. But here’s the problem: They’re assuming that if the architecture is deployed, it must be stable. Again, if we agree that application architectures start to degrade as soon as they’re deployed, that is a flawed assumption. Define architectural guidelines at the beginning of your project, and then do consistent checks with each new iteration of your application to ensure they’re being upheld.
Use feedback loops. One easy way to set up these checks is to implement a quick feedback loop at the end of each development phase about the last changes made to determine if the code is compliant with the architecture. If not, insert a remediation string. Some architecture violations will require immediate remediation because they might impact the application security or performance. Other ones can be integrated in a future dedicated sprint if you’re using agile methodology, or in another development phase. This way, your team can forget about code review and have a checker that will warn them if they go beyond a barrier.
At the end of the day, maintaining application architectures is about measurement and communication. It’s not like you have to set it up with a Twitter account to tweet the architecture’s status every 15 minutes. But architectural compliance should certainly be examined at crucial development phases and before new upgrades.
When it comes to architectural designs, “set it and forget it” is a recipe for disaster — one that can and should be avoided. Why design the architecture in the first place if the team is not required to work within the design’s framework? Implementing a best practice of goals and checks can ensure that a properly built architecture will stay that way iteration after iteration.
Now I hear some of you groaning that checking the architecture can be a time-consuming and painstaking process. But that’s not true. At least, not since CAST released its Architecture Checker. There’s a reason why I’m so passionate about checking architecture status. It’s because I spend most of my day dealing with the issues that an unchecked software architecture causes for development teams. I encourage you to try our Architecture Checker, and if you have any questions, feel free to email me.
My six-year-old can tie her own shoes. I honestly did not realize how big of a deal that was until her teacher told me a few months ago that she had, for a short time, become the designated shoe tier in her classroom. Apparently, thanks to the advent of Velcro closures for kids’ shoes, nobody else in her kindergarten class knew how to tie their shoes.
The problem with being a “star” of your kindergarten class, however, is that all the kids want their shoes tied by her. As a result, she was trying to tie shoes very fast – too fast, in fact – and started making mistakes, which got her frustrated when the knots don’t come out right.
Seeing this frustration, I calmly remind her that it is better to do something right than to do it fast. This is a lesson many software development teams also need to remember.
While you would think “getting it right” should be the first mantra of developers, though, we see more and more examples of teams finding ways to do things “faster” rather than focusing on quality. While it is true that in order to keep up with competition and demand the current market dictates shorter development cycles than decades past, that does not mean quality needs to be sacrificed or done “on the fly.”
Nevertheless, eschewing quality for speed seems like that’s exactly what’s going on over at Mozilla. Over on the appropriately named blog “It Will Never Work in Theory,” a section from a paper titled “Do Faster Releases Improve Software Quality? An Empirical Case Study of Mozilla Firefox” by Foutse Khomh, Tejinder Dhaliwal, Ying Zou, and Bram Adams is studied. The paper finds:
Users experience crashes earlier during the execution of versions developed following a rapid release model.
The Firefox rapid release model fixes bugs faster than using the traditional model, but fixes proportionally less bugs.
With a rapid release model, users adopt new versions faster compared to the traditional release model.
The post goes on to evaluate these findings, noting that the third point is good news, item two is kinda good news, but item one is a head scratcher.
I doubt anybody would find fault with the third finding from the Mozilla case study. Adopting new versions faster than traditional models is certainly a positive in business. However, there’s something missing here; are developers building these versions faster AND stable or are they just developing them faster without concern for application software quality? If those versions are just being done faster and are not being built well, they will require development teams to go back and constantly fix issues and could possibly lead to major malfunctions that interrupt business continuity. I don’t see how that is a good thing.
What is really confounding, though, is the first finding in the Mozilla case study, “Users experience crashes earlier during the execution of versions developed following a rapid release model.”
How exactly is that positive in any way, shape or form?
The last I checked, the earlier an application crashed the more poorly written, less reliable, more destructive and more useless it was. I don’t think there’s a single Marketing department in the world who could successfully promote its application software by saying, “We Crash Faster!”
I would have to guess the authors of the case study believe that if users experience crashes earlier in the execution of versions it means developers at Mozilla can start fixing those bugs sooner. That’s not a great way to run a business, though. Where’s the concern for software quality? Moreover, when did Mozilla start paying its users to serve as its software quality inspectors?
Maybe that’s why Mozilla Firefox is offered as a free application to its users…because users should know they get what they pay for!
Eventually my child’s teacher stopped sending students to her to have their shoes tied because the teacher was just having to retie them. I suspect as Mozilla users experience these earlier crashes, they, too, will look elsewhere to “have their shoes tied.”
Happy Independence Day everybody! I only hope those of you reading this on your Android device have not turned it sideways or performed some other seemingly innocuous action that has made this application fail.
I say this because I recently read yet another blog about “workarounds” to compensate for application failures inherent in Android devices. These pieces have become almost ubiquitous over the past 18 months to the point where one would think Google would just go back and perform the structural quality analysis it needs to do to address the issues.
Their failure to do so reminds me on this day before Independence Day of the opening lines of Thomas Paine’s “Common Sense”:
These are the times that try men’s souls: The summer soldier and the sunshine patriot will, in this crisis, shrink from the service of his country; but he that stands by it now, deserves the love and thanks of man and woman.
As Google continues to “shrink” from its responsibility to provide application software that is of sound structural quality, they are certainly “trying men’s [and women’s] souls.”
I Have Not Yet Begun to Fight!
I continue to be amazed that Google appears more interested in what to call their next Android OS. As “enamored” (can you feel the sarcasm dripping from that word?) as I was last year with “Ice Cream,” I am even more captivated by their latest one – Jelly Bean. I am betting he name really fits the product – looks solid on the outside, but if Android’s history is any indication it will most certainly be a piece of gelatinous mush on the inside.
Maybe Google continues to fall into the trap of believing its own press clippings – the positive ones, at least – because they seem more concerned with marketing than they are with software quality. Google’s mobile operating systems continue to feature one flaw after another with these flaws not being “discovered” until after the system has been rolled out and installed by the consumer. And these flaws are not just minor ones that inconvenience the user like the ones mentioned in the work arounds blog to which I referred above. They include battery-draining and security flaws that cost time and money for those using the devices..
Nevertheless, they continue to build one iteration after another atop mobile platforms they know to have flaws – or at least by now they should know – and they continue to fail to fix them.
We Find these Truths to be Self-Evident
It’s truly a shame Google won’t use the same methodology as Thomas Jefferson and the Continental Congress did in forging one of the world’s greatest documents – the Declaration of Independence. From the time Jefferson began working on the Declaration of Independence on June 11, 1776, he wrote and rewrote, edited and re-edited versions of the document for almost three weeks until he came to what he felt was a product of optimal quality. And yet even after the document with all those versions and all those edits was presented to the Continental Congress on June 28, 1776, for a vote, those men debated for another five days over the contents of the document and made another 33 changes to it!
Obviously there was no Marketing Department pushing Congress to get out the final product.
The Declaration of Independence – a document that truly did have urgency behind it as men were fighting and dying for the values it espoused – was edited and changed many dozen times before it was delivered. If that’s the case, why can’t the marketing people at Google allow their developers to perform a bit of automated analysis and measurement on Android software before they declare its “independence” from internal production? Were they to do this (harkening Paine again), they would “deserve the love and thanks of man and woman.”
These truly are the times that try our souls.
I’m not one who believes in fortune tellers or those who claim to be able to predict the future. Heck, I don’t even read my horoscope and cringe whenever someone attempts to force it upon me. Only when my wife has attempted to read me my horoscope have I offered even as much as a polite “hmm.” Nevertheless there are many out there who swear by those who claim to be able to predict the future, especially in the financial industry.
And while there were those who predicted a rocky road for Facebook’s IPO, it is doubtful that anybody could have foreseen NASDAQ’s technical melt down that surrounded the Facebook IPO. While the stock price predictions for Facebook may be coming true, surely the technical issues that NASDAQ experienced on Facebook’s IPO day could not have been predicted…or could they?
Not in the Cards
As Scott Sellers points out over on Computerworld, it seemed like NASDAQ understood the kind of volume it would be facing and had taken the necessary precautions. He notes, “Exchange officials claimed that they had tested for all circumstances and for thousands of hours. I believe them.”
I believe them, too, but like we’ve said here many times before, and it’s a point to which Sellers alludes in his post, testing isn’t enough. As Sellers puts it, there needs to be “a resilient underlying infrastructure.” Functionality does not always mean structural quality, yet functionality is all that is needed to ensure that applications pass muster when tested. The functionality issues that might be found in an application are merely the tip of the proverbial iceberg that can potentially sink an application after it sails.
This is what, in all likelihood, happened to the NASDAQ on Facebook IPO day and will probably happen again. Why? Because application failures have happened before on numerous occasions and yet NASDAQ did not take heed from those who had gone (down) before them. Last year alone the London Stock Exchange, Euronext, Borsa Italiana (bought by the LSE in 2007) and the Australian Stock Exchange all suffered outages due to technical flaws.
Obviously there’s a lot more to keeping an exchange running than the functional testing can detect and on this point Sellers adeptly points to the CRASH study on application software health released in December. He notes that:
Exchanges are complex, high-performance systems that can be difficult to build, upgrade and debug. According to CAST Software, “an average mission critical application has just under 400,000 lines of code, 5,000 components, 1000 database tables and just under 1000 stored procedures.”
He later adds that, “Having a robust – and well-reviewed architecture nearly always results in a clear competitive advantage.”
Applying the Crystal Ball
Truth is, software failures like the ones experienced by NASDAQ and the other exchanges have become all too commonplace in all industries. Unless it affects a company’s finances directly – as the NASDAQ failure may have done by holding up trading of the Facebook IPO – we treat news of software failures as though they were inevitable and almost expected. In NASDAQ’s case, however, there are now calls for investigations and answers about what happened.
In my book, that’s a good thing. After all, when exactly did we decide that software failure was an unavoidable part of business and an acceptable excuse to leave us hanging and waiting?
NASDAQ, the London Stock Exchange, Euronext…in fact, all exchanges and financial companies need to do a better job of assessing the structural quality of software before it is deployed rather than merely depending on functional or load testing after it is deployment ready. There’s no crystal ball needed here, just automated analysis and measurement, which is now readily available in the marketplace on a SaaS basis. Not doing structural analysis throughout the build is like waiting for an application to fall on its face and fall it will…faster than the share price of Facebook stock.