Ai Editorial: IT failure or power outage - how can airlines minimize the adverse impact?

First Published on 29th May, 2017

Ai Editorial: The latest incident featuring British Airways in which the carrier’s flights faced massive disruption indicates that this industry needs to learn how to ensure their core systems and applications shouldn’t ditch them whatever may happen, writes Ai’s Ritesh Gupta  

 

IT failure, computer glitch, power outage…these are becoming dreadful words for airlines.

Flights getting cancelled for days put a huge question mark over contingency plans and the current state of business continuity planning at airlines. The extent to which British Airways has struggled, with its impact running into the third day now, exemplifies the issue this industry is facing at large. Delta, Southwest Airlines, United…all have been news in the last 12 months or so, featuring in cases where either passengers have been stranded or faced delays owing to such problems.

In case of British Airways, the CEO of the airline, Alex Cruz, told Sky News the IT outage was owing to “a power surge that affected messaging across their systems”. He has also been quoted as saying there wasn’t any indication of a cyber attack as per the initial inspection.

Unlike the recent fiasco at United where a passenger was violently dragged out of the aircraft in the U .S., British Airways has at least attempted to handle the situation better with Cruz featuring in two videos in the first 74 hours. Not that there wasn’t any backlash on social media, still owning up what happened and what to expect at Gatwick and Heathrow airports was a sincere attempt to offer a realistic update. Of course, with their core offering getting disrupted in such a gigantic manner, British Airways would be required to offer an explanation, not to speak of the brand taking a beating and substantial monetary loss owing to this incident.

Digging core issues

In case of British Airways, it needs to be evaluated how a power failure can result in a disaster of such magnitude. Why was it so challenging to recover?

It needs to be ascertained what typically can be issues that have impacted the likes of Delta and British Airways.

At a basic level, evaluate computer programs, the servers backing the running of applications and also what sort of infrastructure is backing these servers.

It is clear that the industry is falling short on one count – in order to avoid computer failures airlines must ensure there is availability of reliable electricity sources to start off. “Airlines need to look at what can result in data centre power outage and what’s in place to overcome such issues. What sort of infrastructure is required to ensure a failure doesn’t convert into a massive disaster? Is cost cutting a fair enough reason to avoid a failover site? In case, requisite power to the servers and applications isn’t reaching out, then what’s the backup plan for running operations? Airlines need to count on a different group of servers and applications, and the location of the same needs to be chosen upon in a diligent manner,” explained a source.

What can be done to simulate failures, especially ones that are already causing huge disruption, and how to ensure this doesn’t happen again. It is not that airlines haven’t done anything to gear up for such unforeseen events, but sometimes back-up plans fail to reach where they are needed. In case of a big airline, it was reported that data centre operations weren’t aptly configured for available back-up power, and as result, there was an IT system failure. “An unforeseen incident could be owing to a strategic alliance, where applications are not necessarily designed for failover, or could have been updated. So a disaster recovery plan needs to be worked out accordingly,” added the source. Even for the disaster recovery site, experts recommend that there needs to be same equipment from the same vendor, and ensure same policies can be simultaneously installed on security devices on both sites.  “It is time airlines sharpen their probability of failure for any set up. Be it for past record or failure data of similar equipment, including the right benchmark for performance of the whole infrastructure, needs to be in place.”

Another question that has been doing rounds for years now is the efficacy of legacy code or mainframes? Is it true that disaster recovery procedures are difficult to be applied on legacy platforms? Not really, even to the extent that there isn't any major issue when it comes to “finding COBOL programmers” if needed.

“Understand the role of flight operation systems and mainframe systems. Where are you running these flight-related systems, assess their network connectivity, and no point in comparing them with mainframes,”  said an executive.

The way forward is to get a balance between mainframe and decentralised technology.

(We highlighted in a report last year that in case of Southwest and Delta, at no time was any mainframe system down at Southwest or Delta, and at no time was any mainframe system suffering from performance problems).

On another note, a report by bbc.com about British Airways, indicated that even when the “power came back on, the systems were unusable because the data was unsynchronised”. What this means is that there could have been “conflicting records of passengers, aircraft and baggage movements”, a tough situation to deal with.

Clearly airlines need to dig a lot and curtail such mishaps. Otherwise, this industry would continue to suffer, especially when the frequency of such incidents is keeping up of late. 

 

Follow Ai on Twitter: @Ai_Connects_Us