Last Wednesday the Royal Bank of Scotland (RBS) underwent an IT failure which withheld 600,000 payments from customer accounts. This occurs seven months after RBS was fined ₤56 million due to an IT Crash in 2012 that impeded customers from accessing their online accounts. The poor system performance has caused difficulties for customers and shock from the banking community.
Giovedì 11 Giugno 2015 ha avuto luogo la IV Edizione della CAST CIO Conference. Ringraziamo i numerosi aderenti alla manifestazione che hanno contribuito al successo di questo evento consentendoci di analizzare, traendo spunto dai più recenti casi di malfunzionamento di applicazioni “mission critical”, le strategie di prevenzione dei rischi attraverso la misurazione della qualità strutturale degli asset applicativi critici.
We currently live in a futuristic world that past generations could only dream of. News, weather, updates from friends all over the world come pouring into our computers and smart devices and we don’t even think twice about the IT risk. Whether we’re at home with family, socializing with friends, or even working, technology is constantly surrounding us in one way or another.
Our reliance on technology is so heavy in fact, we often forget about the science behind it and how much goes into the IT risk management to support it. Beneath the surface of our most frequently used apps, social media accounts, games, and programs, highly complex software and code is constantly operating to maintain a satisfied user experience. Even non-tech businesses now realize they would not be able to function in today’s world without effective technological resources.
Most IT organizations wouldn’t consider the software risk in their application portfolio a brand issue; that is, until they experience a tragedy or crisis such as application failure and customers start to worry. Most of the time IT organizations are able to calculate the cost to fix the problem and how it will affect their overall business. However, what often isn’t taken into account is the long term effects on their brand and business going forward. Continue reading
Here we go again. You probably have heard, since it’s been reported everywhere, that American Airlines was grounded Tuesday, leaving passengers stranded for several hours due to a “computer glitch” in the reservation system. Because of the glitch, gate agents were unable to print boarding passes; and some passengers described being stuck for long stretches on planes on the runway unable to take off or, having landed, initially unable to move to a gate.
We’ve made it a point on our blog to highlight the fact that software glitches in important IT systems — like NatWest and Google Drive — can no longer be “the cost of doing business” in this day and age. Interestingly, we’re starting to see another concerning trend: more and more crashes blamed on faulty hardware or network problems, while the software itself is ignored. It’s funny that the difference in incidents can be more than 10 times between applications with similar functional characteristics. Is it possible that the robustness of the software inside the applications has something to do with apparent hardware failures? I think I see a frustrated data center operator reading this and nodding violently.
For enterprise IT applications, it’s all about processing data defined through multiple types and in large volumes of code. Then the number of lines of code devoted to data handling is high enough to encapsulate a large number of software bugs that are waiting for specific events to damage the IT system and impact the business.
Even if we can say that a bug is a bug and it will be fixed when it occurs, bugs related to data handling should not be underestimated and this for several reasons:
Such bugs are generally not easy to detect among the millions of lines of code that constitute an application. They can be constituted by only one statement that is defined elsewhere, making some specificities in using it not immediately visible. They can also result from the execution of a given control flow associated to a given data flow.
Some of them can be there for a long time and will never occur. The problem is to identify which ones belong to this category to focus on others.
They can be activated by the conjunction of specific conditions that are not easy to identify.
When the issue occurs, the impact for business data can be severe: applications can stop, data can be corrupted, and end users and customer satisfaction can decrease.
Consequences are not always clearly visible and, in this case, few users detect them.
Problems are distributed
Issues can be hidden everywhere in application code. Risk management methodologies can help select the most sensible application areas and reduce the scope of the search. However, in most cases, detecting such potential issues requires the ability to check the types and structure of the data flowing from one component to another, or from a layer to another, as well as the algorithms implemented in your programming language of choice. This spells troubles for everyone.
Why does a bug activate suddenly?
There are different factors that contribute to activating a bug:
Probability increases with the number of lines of code.
The more a component is executed, the more its bugs can be activated.
The more you modify the code, the more likely an unexpected behavior can occur.
Low decoupling between data and treatments makes any changes on data impact the code.
Market pressure that stresses the development team. Working quickly is often a good way to create new bugs and activate existing ones!
Algorithm implementing business rules can be complex and distributed over multiple components, fostering the occurrence of bugs.
Functional data evolutions are not always taken into account in whole application implementation and can make code that is working well run in an erratic way.
The biggest challenge comes when several factors occur at the same time – a difficult challenge for any development team!
The list of situations that can lead to troubles related to data handling is not short. For instance, database access can be made fragile when:
Database tables are modified by several components. Data modifications are usually ruled by the use of specific routines to update, insert, and delete a specific API or a data layer that is fully tested to maintain data integrity.
Host variable sizes are not correctly defined compare to database fields. Some queries can get a volume of data that is higher than expected. Or a change is made in the database structure and it has not been propagated to the rest of the application.
When manipulating variables and parameters, potential issues can be:
Type mismatches are generally insidious cases. For example, it can occur with implicit conversions that occur between two compatible pieces of data, such as the ones found in different SQL dialects, injecting incorrect values into the database. Similar situations can also be found in COBOL programs when alphanumeric fields are moved into numeric fields, leading to abnormal terminations if the target variable is used in calculation or simply in a computational format. Improper casts between C++ class pointers (ex: a base class to a child class) can lead to lost data and to data corruption propagated through I/O.
Data truncation when no control of variable size is done when moving one into another. Part of the value can be lost if the target variable is used to transport the information.
Incoherencies in functions or program calls between arguments sent by the caller and expected parameters. This can occur when a change done in the function or program interface has not been ported in all callers, making them terminate or corrupt data.
What about consequences?
Unfortunately, there is more than one type of consequence when such bugs activate. One of the big risks for the application is related to the corruption of the data it is manipulating — the worst case being when corruption is spreading throughout the IT system. Generally, this impacts users and the business.
I remember such situations with a banking application. Everything was working fine when the phone rang: “Hi, the numbers on my weekly reports don’t look right. I checked but it seems there is a problem. Can you check on your side?”
Well, we searched which programs generated the report but we did not find any interesting information. We checked its inputs and we found incorrect values. Then we looked at the program that produced these inputs, and finally we found the cause of the problem in a third program — a group of variables that were not correctly valuated.
Fortunately, the problem was detected and fixed. A more critical situation happens when very small corruptions install silently and insidiously over the IT system. They are too small or too dispersed to be pointed out. For instance, some decimal values that are improperly truncated or rounded can seem like a small issue, but in the end, the total can be significant!
Another consequence is related to application behavior. Bad development practices can rapidly lead an application to erratic behavior and sometimes termination. At last, some issues, like buffer overrun, can even lead to security vulnerabilities if the data is exposed to end users, especially in web applications.
Manual search …
Issues related to data handling are rarely discovered and anticipated when they are sought through manual and isolated operations only. The volume of code to look at, the number of data structures to check, the complex business rules to take into account, and the bug subtlety (that sometimes seems to be diabolic!) are serious obstacles for developers who cannot spend too much time to try and fix problems that might never occur.
… or automated system-level analysis?
The most efficient way to detect these types of issues is to analyze the whole application software with tools like CAST AIP to correlate findings concerning data structures with code logic. That can establish who calls who in the code, and can introspect components interacting in the data flow. Thus, the issue detection can be carried out faster, helping developers secure the code. It can be automated to regularly check the applications without disturbing the development team’s activities, allowing them to manage prevention at a lower cost.