IEEE Spectrum The Risk Factor Blog IEEE Spectrum The Risk Factor blog recent content Thu, 14 Apr 2016 16:00:00 GMT Remembering the Technology Glitches and Failures of Tax Years Past A look back at some of the notable failures that have occurred when mixing taxes and IT A look back at some of the notable failures that have occurred when mixing taxes and IT
Photo-Illustration: iStockphoto

We're pleased to note that on 1 April, IEEE Spectrum won the Jesse H. Neal Best Infographics Award for our series "Lessons from a Decade of Failures." To celebrate, it seemed like a good time to once again take a dive into the Risk Factor archives and search for additional historical lessons. Because we're nearing the end of tax season here in the United States, I decided to examine the often volatile combination of tax policy and IT systems. Tax-related problems are some of the most painful IT failures, because they tend to hit citizens right where it hurts most: their bank accounts.

Below you'll find some of the most noteworthy operational glitches of the past decade, but as with previous timelines, the incidents listed here are merely the tip of the iceberg, and should be veiwed as being representative of tax-related IT problems rather than comprehensive. It doesn't even include incidents of tech-assited fraud, data breaches, or failed modernization projects (like the  cancellation of the My IRS Account Project). It's not always easy to identify the exact impact of tax related glitches: in some cases it's easier to measure the number of people affected, while in others, the monetary cost is more straightforward. Use the dropdown menus to navigate to other incidents that might be hidden in the default view.

In reviewing this list of failures, a few of lessons jumped out at me:

  • For ongoing, excruciating, cringe-worthy tax-tech pain, no one beats Her Majesty's Revenue and CustomsAs my colleague Bob Charette has chronicled, the multiyear rollout of the Pay-As-You-Earn computerized tax system is a textbook case of technological and bureaucratic hubris in the face of a challenging IT problem. You can see from the timeline the magnitude of people affected by calculation errors, which grew over time.
  • Data validation, verification, and sanity checks remain poor. Increasing computerization has meant an increase in mistakes that should have been caught by common sense. Tax systems need better safety checks and governments need to be more skeptical of sudden, unexpected windfalls.
  • Don't automatically trust automatically generated notices. It seems like the software in tax systems that generates letters and notices is subject to even less scrutiny and oversight than the rest of the systems' components.
  • There's danger in waiting to file your taxes at the last minute, but doing them early can also cause problemsThere are many examples of tax services simply being unprepared to process early returns, whether because of last-minute changes to the tax code, or from data that has not yet been updated.

Clearly there are lots of advantages to digitizing tax calculation and collection, including efficiency and accuracy. But it's worth keeping in mind that in all likelihood, our IT systems are bound to fail occasionally, so we need to make sure our laws and systems are better prepared for those contingencies. In the past decade, our ability to cause harm with tax systems has often outpaced our ability to make things right.

If there's a notable tax-related glitch you'd like to see represented on the timeline, let me know in the comments, and I'll try to add it.

Thu, 14 Apr 2016 16:00:00 GMT
We Need Better IT Project Failure Post-Mortems It's hard to find trustworthy data about IT debacles It's hard to find trustworthy data about IT debacles
Illustration: Getty Images

In pulling together this special interactive report on a decade’s worth of IT development projects and operational failures, the most vexing aspect of our efforts was finding trustworthy data on the failures themselves.

We initially started with a much larger set than the 200 or so projects depicted in this report, but the project failure pool quickly shrank as we tried to get reliably documented, quantifiable information explaining what had occurred, when and why, who was affected, and most importantly, what the various economic and social impacts were.

This was true not only for commercial IT project failures—which one would expect, given that corporations are extremely reticent to advertise their misfortunes in detail if at all—but also for government IT project failures. Numerous times, we reviewed government audit reports and found that a single agency had inexplicably used different data for a project’s initial and subsequent costs, as well as for its schedule and functional objectives. This project information volatility made getting an accurate, complete, and consistent picture of what truly happened on a project problematic to say the least.

Our favorite poster child for a lack of transparency regarding a project’s failure is the ill-fated $1 billion U.S. Air Force Expeditionary Combat Support System (ECSS) program (although the botched rollout of is a strong second).  Even after multiple government audits, including a six-month, bi-partisan Senate Armed Services Committee investigation into the high-profile fiasco, the full extent of what this seven-year misadventure in project management was trying to accomplish could not be uncovered. Nor could the final cost to the taxpayer be ascertained.

With that in mind, we make our plea to project assessors and auditors asking that they apply a couple of lessons learned the hard way over the past decade of IT development project and operational failures: 

In future assessments or audit reports of IT development projects, would you please publish with each one a very simple chart or timeline? It should show, at a glance: an IT project’s start date (i.e., the time money is first spent on the project); a list of the top three to five functional objectives the project is trying to accomplish; and the predicted versus actual cost, date of completion, and delivered functionality at critical milestones where the project being reviewed, delivered, or canceled.

Further, if the project has been extended, re-scoped or reset, please make the details of such a change absolutely clear. Don’t forget to indicate how this deviation affects any of the aforementioned statistics. Finally, if the project has been canceled, account for the opportunity costs in the final cost accounting. For example, the failure of ECSS is currently costing the Air Force billions of dollars annually because of the continuing need to maintain legacy systems that should have been retired by now. You’d think that this type of project status information would be routinely available. But unfortunately, it is rarely published in its totality; when it is, it’s even less likely to be found all in one place.

Similarly, for records related to IT system operational failures, would you please include all of the consequences being felt—not only financially but to the users of the system, both internally and externally?  Too often an operational failure is dismissed as just a “teething problem,” when it feels more like a “root canal” to the people dependent upon the system working properly. 

A good illustration is Ontario’s C$242 million Social Assistance Management System (SAMS), which was released more than a year ago, and it is still not working properly. The provincial government remains upbeat and positive about the system’s operation while callously downplaying the impact of a malfunctioning system on the poor in the province.

More than 100 years ago, U.S. Supreme Court Judge Louis Brandeis argued that, “Publicity is justly commended as a remedy for social and industrial diseases. Sunlight is said to be the best of disinfectants; electric light the most efficient policeman.” Hopefully, the little bit of publicity we have tried to bring to this past decade of IT project failures will help to reduce their number in the future.

Mon, 21 Dec 2015 20:13:00 GMT
Wishful Thinking Plagues IT Project Plans Delusional estimates for time and cost plague large IT projects Delusional estimates for time and cost plague large IT projects
Photo: Ralf Hiemisch/Getty Images

“An extraordinary failure in leadership,” a “masterclass in sloppy project management,” and a “test case in maladministration” were a few of the more colorful descriptions of UK government IT failures made by MP Edward Leigh, MP for Gainsborough, England when he was the Chairman of the Public Accounts Committee.

Leigh repeatedly pointed that government departments consistently were not only wholly unrealistic in their IT project costs, schedule and technical feasibility, but also didn’t take any responsibility for the consequences of these assumptions.

This identical theme appeared frequently during our review of the past decade of IT project development and operational failures. The “over optimism” disease, aka “Hubble Psychology,” is frequently cited in audit reports as a primary root cause of IT failures. Hubble Psychology is the term used by NASA Inspector General Paul Martin a few years ago in his report into the space agency’s project troubles(pdf) to describe the:

“[E]xpectation among NASA personnel that projects that fail to meet cost and schedule goals will receive additional funding and that subsequent scientific and technological success will overshadow any budgetary and schedule problems. They pointed out that although Hubble greatly exceeded its original budget, launched years after promised, and suffered a significant technological problem that required costly repair missions, the telescope is now generally viewed as a national treasure and its initial cost and performance issues have largely been forgotten.”

In other words, as long as you can keep your program alive, you have a very good chance of continuing to receive sufficient money (and time) to make it work sooner or later. While the expectation that “all will be forgiven” doesn’t always come true, as even the government will eventually run out of money and patience, it works enough—especially in defense programs—to make it a belief worthy of pursuing. If you have the time to dig into the six governmental IT projects we highlighted in our “Life Cycle of Failed Projects,”you’ll soon discover that each suffered from a version of NASA’s Hubble Psychology.

Skewed bias towards extreme optimism doesn’t just affect program development plans, but also infects the decisions made about when to take an IT system live. Thumbing through the myriad of Risk Factor blog posts will quickly show a plethora of IT projects being deployed long before they were ready due to unfounded, if not delusional, optimism concerning their operational state.

Take, for instance, the case of the Los Angeles Unified School District’s (LAUSD) disastrous decision last year to roll out its new integrated $10 million student information system called MISIS. Dozens of operational snags with MISIS immediately cropped up: Thousands of students did not receive class schedules for weeks; an untold number of teachers were assigned 70 or more students in their classes; students were placed in classes they had already completed; middle school students were placed in high school classes; high school seniors were unable to send transcripts to colleges they were applying for; and so forth. It has taken over a year of hard effort plus an additional $100 million plus to make MISIS operations stable, although still it is far from delivering the functionality originally promised.

What makes the MISIS so mind boggling and yet so unsurprising is that the original project schedule, which was already aggressive (two years for $29 million), was compressed by half while the project’s expenditures were condensed by two-thirds. Predictably, severe operational problems began appearing in the weeks before system rollout, caused by an acknowledged lack of system testing. LAUSD teachers, school administrators and others were warning LAUSD senior administration that MISIS was not anywhere near ready to deploy, not only because of the technical hitches, but because only a small minority of LAUSD teachers had been fully trained on how to use the system. Even the LAUSD chief information officer acknowledged a few days before the rollout that it might be “bumpy,” but that did not matter. The LAUSD Superintendent, who admitted IT was not his strong suit, was confident that MISIS was ready to be deployed, so it was deployed.

What makes this situation even more incredible was that the MISIS disaster almost exactly mirrored another massive LAUSD IT project disaster involving the botched rollout of a new payroll system back in 2007 that caused a year of pain. For whatever reason, the lessons from that event were ignored completely.

Not learning from failure seems to go hand in hand with being confident about your IT project’s status. This seemed especially true over the past ten years in the airline industry. For instance, US Airways was over-brimming with confidence in March 2007 when it switched over to a new reservation system after its merger with America West airlines in 2005. Such was its confidence that just before the cut-over to its new system, a US Airways senior vice president of customer service boasted, “We get to demonstrate that these transitions aren't as big and as difficult as historically has been proclaimed.” Well, the new reservation system had a meltdown on the day it went live; it took nearly six months to get everything back to normal.

Then there was the case of British Airway’s London’s Heathrow International Airport’s new Terminal 5 baggage system project. Avoiding duplicating the ignominy of Denver’s International Airport failed IT baggage  system was uppermost in system integrator BAA’s mind. While more successful than DIA’s baggage system, Terminal 5’s baggage system still face-planted spectacularly on its opening day of 27 March 2008 and for days afterwards, with some 430 flights cancelled and more than 20,000 bags mishandled in the first eight days of its operation.

It later came out in a UK Parliamentary inquiry into the baggage system mayhem that BA senior management went ahead with the opening even though it knew that the baggage system wasn’t fully ready and would likely need another six months as its test program and staff training were both “compromised.” But waiting would cost BA money, so a “calculated risk” was taken to open Terminal 5 as planned and hope for the best. Of course, BA didn’t bother to tell its thousands of passengers using Terminal 5 that tidbit of information, instead proclaiming to one and all everything was “tried, tested and ready to go.”

United Airlines was similarly self-assured when it moved to a single passenger service system and website in March 2012 to complete its 2010 merger with Continental Airlines. Then CEO Jeff Smisek said he was confident the process to proceed would go smoothly, proclaiming that the airline was "exceedingly well prepared for it." Again, things didn’t go as calmly as expected to say the least, with United Airlines still suffering the financial and reputational after-effects to this day.

We should note, in fairness, that the recent cut-over of the merged airline American-US Airways reservation system did go well, so perhaps finally, a sort of humility has been gained to offset the overweening hubris that usually goes along when implementing these types of IT systems.

Healthcare IT systems also seem to be prone to the optimism bug. The UK’s £12 billion and Australia’s $A566 million electronic health record system fiascos are prime examples of the belief that a righteous idea alone will create a successful IT system. And of course, the various botched attempts in the US at creating a federal and state health exchanges to support the Affordable Care Act (ACA) are case studies in hope over experience and good sense. A fitting description of the entire situation in late 2013 into early 2014 was unwittingly given by then HHS Kathleen Sebelius when she told a Congressional Committee that the federal exchange “works unless you try to use it.

While the botched rollout of the ACA health exchanges might arguably be the greatest examples of IT hubris coupled with self-denial over the past decade, I personally think the IT development of New York City’s personnel management system called CityTime is an even better example. Originally slated in 1998 to cost $63 million and be completed within 5 years, the project ballooned to over $722 million by March 2010 with a completion date set for June of 2011. A government investigation in December 2010 uncovered what at the time looked like $80 million in fraudulent billing, but that soon exploded into more than $500 million.

CityTime’s prime contractor, SAIC, agreed to forfeit $500 million of the $690 million it was paid to avoid prosecution for defrauding New York City. It admitted that it failed to investigate internal warnings that things were amiss, a well-practiced lack of curiosity no doubt helped along by the firehose of money it was showered with.

What intrigues me more is how long CityTime stayed alive and unexamined by New York City officials as the project costs rapidly climbed to S224 million in 2006, $348 million in 2007, and $628 million in 2009 before breaching the $700 million mark a year later. Even though irregularities in billing were raised back in 2003 and many times thereafter over the years, these warnings were studiously either ignored or played down. Not until the very end did New York City’s comptroller audit or question the program, especially since the original overall benefits of the project s were estimated in 1998 to save the city only $60 million in timesheet fraud!

In fact, there seemed to be a collective “shrug of the shoulders” acceptance by the Bloomberg Administration that big, government IT projects always overrun estimates, so why be overly concerned about CItyTime’s increasing cost? Even as the fraud was being exposed, Mayor Michael Bloomberg cavalierly dismissed paying no attention to the project’s problems and exploding cost as just being one of those things that fell through the oversight cracks. Some crack!!

If future government IT project failures are ever to be minimized, the Hubble Psychology is going to need to be directly addressed. Making individuals in both government and contracts accountable in a meaningful way is a good start. However, even this will not be easy. As exemplified by US Air Force leadership in the aftermath of the $1 billion spent on the Expeditionary Combat Support System (ECSS) without anything to show for it, government doesn’t seem to believe in personal accountability when it comes to IT project failure.

If we can’t hold people accountable, maybe the next-best thing is to shed more light on these failures. That will be the topic of the final blog post in our special report on the Lessons of a Decade of IT Failures.

Thu, 17 Dec 2015 16:14:00 GMT
Sorry for the Inconvenience The empty apologies trotted out by companies and governments in the wake of IT debacles add insult to injury The empty apologies trotted out by companies and governments in the wake of IT debacles add insult to injury
Photo: iStockphoto

After looking back at the project failures chronicled in the Risk Factor for our recent set of interactive features “Lessons From a Decade of IT Failures,” I became intrigued by the formulaic apologies that organizations put out in their press releases when something untoward happens involving their IT systems.  For instance, below are three typical IT mea culpas:

“We would like to apologize for the inconvenience this has caused.”

“We regret any inconvenience this may have caused patients.”

“We apologize for the inconvenience to those affected.”

The first apology came about as a result of the Nationwide, the UK’s largest building society, charging 704,426 customers’ debit card twice for the same transaction, which it blamed on a batch file that was doubly processed. The second apology was in response to the crash of Northern California’s Sutter’s Health $1 billion electronic health record system crashing for more than a day across seven of its major medical facilities because of a faulty system upgrade. The last apology was prompted by the shambolic mess that marked the rollout of California’s EDD unemployment system that affected at least 185,000 claimants, many of them for weeks.

Apparently, regardless of the impact or duration of an IT failure, it is viewed by the offending organization as being merely an inconvenience to those experiencing them. Ontario’s Social Services Minister Helena Jaczek went so far as to liken the months of havoc resulting from the botched rollout of a new provincial CN$240 million welfare and disability system as reminding her “of when I have my BlackBerry telling me that there are issues and I need an update. . . it is often very inconvenient.”

Hmms, not receiving a desperately needed disability or subsistence check equates to a Blackberry software update. Who knew?

Most apologetic public statements by corporate or government officials at least attempt to provide the pretense that executive management feels bad for the consequences of their organization’s IT oofta. However, there have been other times where perhaps not apologizing would have been a better strategy than the apology given. Below are a few of my favorite “best of the worst apologies” from the Risk Factor blog files which clearly indicate that the organization would have been a lot happier if it didn’t have to deal with those pesky customers or taxpayers.

We start off with two of the largest organizations in Ireland that worked overtime to antagonize their respective customers in the wake of billing system errors. The first is Irish Rail, which discovered that a software upgrade to its vending machines caused tickets to be issued to some 9,000 passengers without actually deducting the fare amounts from their debit cards for several days. On the Friday, a week after the discrepancy was discovered, Irish Rail decided to notify the affected customers by way of a press release on its website, which mentioned that it would be deducting the fare amounts due it beginning on the following Monday.

Irish Rail’s press release also stated, “We apologies [sic] for any inconvenience this fault causes customers,” which for many could include incurring hefty penalty charges for unauthorized overdrafts on their debit cards. When asked why it couldn’t wait a week so its customers could ensure that their accounts had the funds to cover the charges, Irish Rail responded it had every right to collect the money immediately and was going to do so. Unsurprisingly, the affected Irish Rail customers didn’t think much of the company’s apology.

Another “show me the money” demand disguised as an apology came from Eircom, the largest telecom provider in the Republic of Ireland. It, like Irish Rail, had a billing “system error” that did not directly debit some 30,000 customer bank accounts correctly, even though its customers’ bills indicated otherwise. Eircom deemed the incident “regrettable” and further stated that “it’s embarrassing and we're very sorry that it's happened.” However, Eircom was neither too embarrassed nor sorry enough to insist that although it planned to reimburse customers the failed direct debit fee charge of €18.45, customers would still have to pay all monies owed the telecom in their next billing cycle. Ireland’s telecom regulator was as unhappy with Eircom’s payment demand as its customers were, even more so because the utility also failed to inform it of the billing error.

“Teething issues” also featured prominently in several apologies for IT fouls ups. Take, for instance, EnergyAustralia, which claimed that on-going “teething problems” with its newly introduced accounting system were why 145,000 of its customers had not been billed for their electricity or gas usage on time, including 21,000 who had never received a bill from the utility. In the apology the company issued, it tried to downplay the extent of the foul-up by saying, “We are sorry to the small number of customers who haven't had the best experience and we're working round the clock to improve our service.” However, for some 12,500 EnergyAustralia customers, the clock spun around for more a year before their billing issues were finally corrected.

Automotive manufacturer McLaren also apologized that its MP4-12C $229,000 supercar was suffering from “teething problems.” Ron Dennis, the executive chairman of McLaren Automotive and McLaren Group, sent out a letter to customers that stated in part, “As you will have already heard from my staff, we are experiencing some early software bugs resulting in unnecessarily sensitive warning lights, battery drainage in certain conditions and IRIS [infotainment] performance issues. My team and the McLaren retailers are working with the pace and intensity that the McLaren brand demands to fully resolve these bugs rapidly and effectively to ensure that any inconvenience to you is kept to a minimum.” Dennis, however, tried to make up for the inconvenience by promising customers that he was going to give them “a pre-release copy of the new McLaren: The Wins coffee-table book." I wonder how many software bugs it would take to get a personally signed copy of the book.

Additionally, there were a couple of organizations that had to make so many apologies that customers just stopped listening to them, deciding  to head for the exits instead. Take, for example, UK’s RBS Group which had a major system meltdown in the summer of 2012 caused by a routine software-update gone bad that kept millions of its bank customers from accessing their accounts for days, and some even for months. At the time, then Chairman Stephen Hester apologized, saying, “Our customers rely on us day in and day out to get things right. On this occasion we have let them down… Once again I am very sorry for the inconvenience.”

Various RBS Group spokespersons had to apologize several more times that summer as they promised everything would soon be made right, which quickly turned out not to be true.  At the time, RBS promised to invest hundreds of millions of pounds into upgrading its IT systems to keep major disruptions from happening again.

However, RBS has suffered significant glitches since, including in December 2013 on Cyber Monday and once more in June of this year. Although after each incident RBS management stated that it was “sorry for the inconvenienced caused” and that the incident was “unacceptable,” tens of thousands of its discomforted customers have decided to do their banking elsewhere.

While RBS may have seen droves of customers desert it over its IT failures, it is nothing compared to Australia’s telecom company Vodafone which has seen millions of its customers leave because of the company’s persistent IT ooftas. The root of the problem can be traced to 2009, when Vodafone merged its network with rival Hutchinson’s “3” network.  Not surprisingly, the merger’s objective of creating a high-quality, unified network across Australia wasn’t as easy or seamless as envisioned. Customer complaints about poor Vodafone service grew throughout 2010, but really came to a head when a network software upgrade in late 2010 didn’t work as expected. Instead of speeding up network traffic, the upgrade slowed it down. That problem took weeks to fix, angering legions of Vodafone customers. 

Then a different, concurrent software issue caused additional problems across the network. Vodafone, which by now was being referred to in the press as “Vodafail,” had to apologize to its angry customers multiple times that the company was “truly sorry” for the continued “dropped calls, delayed SMS and voicemails, slow data speeds, inconsistent coverage, and long waits when you called us.” For more than 2 million Vodafone fed-up customers who left the company between 2010 and 2013, the company didn’t improve fast enough. Finally, after spending AU$3 billion to upgrade its networks and customer support, Vodafone Australia announced earlier this year that it had started adding customers again.

There was also an interesting apologetic non-apology that happened in my home state of Virginia. In the summer of 2010, a server problem at the Virginia Information Technologies Agency (VITA) knocked out the IT systems used by 27 of Virginia’s 89 state agencies for several days, and a number of agencies were affected for over a week. At the time, the state’s IT infrastructure was in the midst of a $2.3 billion upgrade which was a constant source of contention between Virginia and its contractor Northrup Grumman.

When the server problem was finally fixed, Northrop Grumman vice president Linda Mills put out the expected pabulum and said the company “deeply regrets the disruption and inconvenience this has caused state agencies and Virginia citizens.” However, Grumman’s “regrets” were immediately undercut by a company spokesperson who, when asked by a Richmond Times-Dispatch newspaper reporter whether Mills’ statement was an apology, declined to comment. Whatever little goodwill that was left in state government for Northrop Grumman quickly vanished.

In May of 2011, after an investigation into the outage, NG agreed to pay a $4.7 million fine for the outage, which is an apology of the best kind, in my opinion.

Our final apology was given through firmly clenched teeth. For years, Her Majesty's Revenue and Customs (HMRC) in the UK worked to upgrade (pdf) its troubled computer systems. HMRC promised that when complete, the new PAYE (pay as you earn) system would significantly reduce both over- and underpayments of taxes. When it was fully introduced in 2010, the HMRC announced that some 4.3 million UK taxpayers were going to receive letters stating that they had paid on average £400 too much in taxes between 2008 and April of 2010. Additionally, another 1.5 million would be receiving letters just before Christmas that they had paid on average £1,428 too little over the same period, and HMRC wanted its money now. Furthermore, the HMRC indicated that another 6 million taxpayers prior to 2008 were likely owned money for taxes paid previous to 2008, and another 1.7 million possibly owed more taxes. This group would be receiving letters soon, too.

The underlying reason for the millions of over-and under-payments was that taxpayers were being placed in an incorrect tax bracket for years because of errors in the new HMRC PAYE computer system database.

Needless to say, the UK public was not a happy bunch at the news. Fueling their unhappiness was the attitude of HMRC Permanent Secretary Dave Harnett, who stated in a radio interview there was no need for him or his department to apologize for the errors or demand for quick payment of owed taxes, because at least to him, the PAYE system was working as designed: “I'm not sure I see a need to apologise... We didn’t get it wrong.”

Politicians of both parties were appalled by that statement, calling Harnett out of touch and arrogant, especially in light of all the reported PAYE system foul-ups. Harnett was forced by senior Conservative party leaders to retract his statement the following day, saying that he was “deeply sorry that people are facing an unexpected bill.”

A few days later, however, HMRC CEO Dame Lesley Strathie made it very clear Harnett’s apology was really a non-apology when she insisted  that HMRC staff made “no mistakes,” and any and all errors were due to taxpayer mistakes. Dame Strathie also said the critiques of HMRC's performance was unfair since it couldn't pick and choose which customers to serve—it had to deal with everyone—whether her government department liked it or not. That bit of insight didn’t go over well, either.

HMRC’s PAYE system continues to cause grief to UK taxpayers, and the HMRC is taking yet another crack at updating its computer systems. Unfortunately, UK taxpayers don’t have a choice of which tax department to use, like customers of RBS or Vodafone do with banks or telecom companies. 

If you have some “worst IT foul-up apologies of all time” stories to add, please let me know.

Mon, 14 Dec 2015 18:30:00 GMT
The Making of "Lessons From a Decade of IT Failures" Why and how we're looking back at a decade's worth of IT debacles Why and how we're looking back at a decade's worth of IT debacles
Photo: Randi Klett

In the fall of 2005, IEEE Spectrum published a special report on software that explored the question of why software fails and possible ways to prevent such failures. Not long afterwards, we started this blog, which documents IT project and operational failures, ooftas, and other technical hitches from the around the world.

The tenth anniversary of that report seems to be appropriate time to step back, look through 1750 blog posts, and give an overall impression of what has and hasn’t changed vis a vis software crashes and glitches. Obviously, the risk in both the frequency and cost of cybercrime is one major change, but we decided at least for now to concentrate our limited resources on past unintended IT system development and operations failure.

I deliberated at length with Spectrum’s former senior interactive editor Josh Romero, who’s responsible for the data visualizations in our interactive survey “ Lessons From a Decade of IT Failures, about which IT project and operational failures to include. This was a non-trivial task considering that many Risk Factor blog posts were roundups that discuss multiple project failures and operational problems.

To make our work manageable, we decided to include those IT projects and systems that experienced significant trouble. For development projects, this meant being cancelled, suffering a major cost or schedule blowout, or delivering far less than promised. For operational IT systems, suffering a major disruption of some kind qualified it for consideration.

To help winnow down the number of possibilities further, we concentrated on those project failures or incidents where there was reliable documentation of what happened, why it happened, and the consequences, in terms of cost and or people affected. If there is one characteristic that hasn’t changed in terms of IT project development or operational failures, it is the lack of reliable and detailed incident data publicly available. We will discuss this particular issue more thoroughly in a future blog post.

This highlights another aspect of the data we’re using in the “Lessons from a Decade of IT Failures.” The data is skewed not only because of our choices of what to include and not, but because of the data about incidents that actually make it into the public domain. The majority of project failures shown are government projects because they tend to be visible thanks to government accountability mechanisms. Private companies tend to bury their IT failure, so except for the rare lawsuit, their operational failures rarely make it into the news unless the impact a significant number of their customers or government regulators become involved. It should also be obvious that the data is also skewed by our dependence upon English-language news reporting of project failures and operational meltdowns.

Even given the limitations of the data, the lessons we draw from them indicate that IT project failures and operational issues are occurring more regularly and with bigger consequences. This isn’t surprising as IT in all its various forms now permeates every aspect of global society. It is easy to forget that Facebook launched in 2004, YouTube in 2005, Apple’s iPhone in 2007, or that there has been three new versions of Microsoft Windows released since 2005. IT systems are definitely getting more complex and larger (in terms of data captured, stored and manipulated), which means not only are they increasing difficult and costly to develop, but they’re also harder to maintain. Further, when an operational IT system experiences an outage, many more people are affected now than ever before, sometimes “inconveniencing” (to borrow from the lexicon of PR types who have to try to explain these messes) millions or even tens of millions of people globally, a magnitude of technological carnage that prior to 2005 was a relatively rare event.

On top of that, during the past decade we have seen major IT modernization efforts in the airline, banking, financial and healthcare industries, and especially in government, generally aimed at replacing legacy IT systems that went live in the 1980s and 1990s, if not earlier. Many of these efforts have sought to replace multiple disparate IT systems with a single system, which has typically proven to be much more technically and managerially difficult, let alone expensive, than imagined.

There isn’t one right way to look at the interactive graphs and charts we’ve crafted from the documentation we have available. We suggest you just wander through them and then follow the links to more detailed explanations as the mood strikes you. You may be surprised by many major IT failures you may never have heard about, or surprisingly, forgot. Let us know if you think we should add other development or operational in future releases, or if you have better data relating to a project’s cost or impact. We will be releasing more charts and graphs over the next few weeks that will provide other perspectives on IT failures and ooftas gleamed from the Risk Factor blog archives.

Fri, 16 Oct 2015 20:53:00 GMT
Stuxnet-Style Virus Failed to Infiltrate North Korea's Nuclear Program A cousin of the Stuxnet virus that crippled Iran's nuclear program failed to do the same to North Korea, Reuters reports A cousin of the Stuxnet virus that crippled Iran's nuclear program failed to do the same to North Korea, Reuters reports
Illustration: iStockphoto

The famous Stuxnet computer virus that sabotaged Iran’s nuclear program apparently had a cousin designed to do the same to North Korea. But this other U.S. cyber attack failed because agents could not physically access the isolated computers of North Korea’s nuclear program.

Several U.S. intelligence sources told  Reuters that the operation aimed at North Korea took place at the same time as the Stuxnet attack that crippled Iran’s nuclear program in 2009 and 2010. The Stuxnet virus worked by hijacking the control software of fast-spinning centrifuges belonging to Iran’s nuclear program. Once activated, Stuxnet caused physical destruction by forcing the centrifuges to spin out of control and tear themselves apart.  The U.S. National Security Agency led a similar, unsuccessful effort with a modified Stuxnet aimed at taking down North Korean centrifuges.

Both Iran and North Korea likely use similar centrifuges that can enrich uranium for either civilian purposes or to become weapons-grade nuclear material. That means North Korea probably also uses control software developed by Siemens AG running on some version of Microsoft Windows, experts told Reuters.

Stuxnet, a joint effort between the United States and Israel, worked in three phases to infiltrate and attack its targets. First, it continually infected Microsoft Windows machines and networks while replicating itself. Second, it looked for Siemens Step7 software that forms the foundation of industrial control systems for operating equipment such as centrifuges. Third, it hijacked the programmable logic controllers so that it could provide secret surveillance on the centrifuges or command the centrifuges to act in a self-destructive manner.

The computers running the control systems for centrifuges belonging to the nuclear programs of Iran and North Korea are isolated from the Internet to avoid providing easy access for cyber attacks. That’s why Stuxnet relied on first spreading stealthily across many Internet-connected machines, in hopes of a worker sticking a USB thumb drive into an infected machine. Stuxnet could then infect the USB drive and eventually make its way to computers isolated from the Internet. (See  IEEE Spectrum’s  feature “ The Real Story of Stuxnet .”)

But North Korea presented an even greater challenge than Iran because of its extreme isolation. Relatively few North Koreans have access to the open Internet, and computer ownership requires registration with the police. In any case, the United States failed to get its Stuxnet-style virus onto the machines controlling North Korea’s centrifuges.

Many experts interviewed by  Reuters doubted that a Stuxnet-style virus would have made much impact on North Korea’s nuclear program even if it had succeeded. That’s in part because North Korea likely has at least one other hidden nuclear facility beyond the known Yongbyon nuclear complex. But it’s also because North Korea likely has access to plutonium, which does not depend on the complex uranium enrichment process.

Mon, 1 Jun 2015 21:00:00 GMT
Fuzzy Math Obscures Pentagon's Cybersecurity Spending The U.S. military's cybersecurity budgets make it tough to gauge the effectiveness of such spending The U.S. military's cybersecurity budgets make it tough to gauge the effectiveness of such spending
Illustration: Getty Images

U.S. military spending has increasingly focused on cybersecurity in recent years. But some fuzzy math and the fact that funding is spread out among many military services makes it tough to figure out exactly how much money is going toward cybersecurity. That in turn makes it difficult to understand whether each dollar spent really improves the U.S. military’s cyber capabilities.

The U.S. military plans to invest an estimated $5.5 billion in cybersecurity for 2015. But such “cyber budget numbers are squishy” in part because authority over the military’s cyber mission is split among many different organizations and military services, according to a  Nextgov  analysis. Budget analysts also point to confusion in how certain military services define cybersecurity spending within their individual budgets.

The lack of central authority over the military’s overall cybersecurity spending and some unclear budgetary definitions of what counts as cybersecurity could complicate efforts to assess the effectiveness of military spending on cybersecurity, said Peter Singer, coauthor of “Cybersecurity and Cyberwar” and the upcoming novel “Ghost Fleet.” In an interview with IEEE Spectrum, he added:

“This is the next stage. You can no longer keep using the terms ‘cyber 9/11’ or ‘cyber wake-up call.’ That discourse has passed. If you’re still using that discourse, you’re well behind the times. Now is the time for serious conversation; that’s what comes with creating organizations. Now we get to questions of how do we know we’re spending effectively on cybersecurity.”

In 2010, the Pentagon created the U.S. Cyber Command, also known as CYBERCOM, as a central organization that could coordinate cyber warriors from the Army, Navy, Air Force and other military branches starting in 2010. Cyber Command is located at Fort Meade, Maryland, next door to the National Security Agency. Both organizations are led by Admiral Michael Rogers, a Navy officer who wears two hats as commander of CYBERCOM and director of the NSA.

But Cyber Command does not have a single line item for its budget, because its funding comes from multiple sources. That proved a recipe for confusion when a Pentagon budget chart gave the initial impression that Cyber Command’s projected 2015 budget was growing by 92 percent,  according to  Nextgov . In fact the budget represented a 7 percent cut compared to the previous year.

To add to the confusion, Cyber Command’s projected budget of $509 million represents just one piece of the U.S. military’s estimated $5.5 billion investment in cybersecurity. That overall number seems to have risen over the past several years. But it’s tough to tell exactly what defense dollars are being spent on because different military organizations and services define cybersecurity differently. For instance, a report by the Federation of American Scientists pointed out that the U.S. military’s cybersecurity spending appeared to increase by $1 billion from 2013 to 2014, but added the cautionary note that “this increase may reflect changes in how DOD programmatic elements have defined ‘cybersecurity’ programs.”

In another example, the U.S. Air Force submitted a $4.6 billion cybersecurity funding request in 2011. That represented a 10-fold inflation of the U.S. Department of Defense’s own estimate of the Air Force cybersecurity figure as being $440 million. Defense officials explained that the Air Force estimate included “things” that are not typically considered cybersecurity.

Part of that difference in defining cybersecurity within budgets may simply come from internal reorganization of military personnel and resources, explained Singer, a strategist and senior fellow at the  New America Foundation , a nonprofit think tank in Washington, D.C. Other cases may involve military officials relabeling certain programs as “cyber” because that boosts their chances of getting funding. “You have some relabeling for political and budgetary purposes,” said Singer.

It’s natural for the U.S. military to “keep piling people and money” into Cyber Command and other cybersecurity initiatives as it builds up its capabilities, Singer said. But he added that the military and policymakers need to be able to understand whether military cybersecurity spending is getting the bang for the buck in terms of capability. Does raising the budget 1 percent lead to a 1 percent gain in capability? 10 percent? 100 percent? Or has it reached the point of diminishing returns where it just leads to 0.5 percent gain in capability?

There is also the question of what cyber capabilities the U.S. military should focus on funding for research and development (R&D) in cybersecurity. R&D accounts for approximately $1 billion of the military’s overall $5.5 billion projected budget for cybersecurity. Until now, U.S. military spending has heavily favored R&D efforts aimed at developing offensive cyber capabilities such as Stuxnet, the computer virus that targeted Iran’s nuclear program and was discovered in 2010.

But Singer prefers rebalancing the U.S. military’s R&D spending in favor of developing breakthroughs or game-changers in cyber defense. He pointed out that the U.S. currently has a huge strategic vulnerability as the country that is perhaps most vulnerable to cyber attacks; boosting U.S. cyber defenses could make a big differences. By comparison, the U.S. military already possesses some of the most advanced physical and cyber capabilities for attacking enemies around the world. Developing “Stuxnet 2.0” might only represent a relatively minor increase in offensive capability.

“If we’re look for more gamechangers, we’d get more out of being less vulnerable than by being a bit better at reaching out and attacking enemies,” Singer said.

]]> Wed, 8 Apr 2015 12:21:00 GMT Is the Lenovo/Superfish Debacle a Call to Arms for Hacktivists? Proposed exemptions to the DMCA could free white hats to make networked devices more secure Proposed exemptions to the DMCA could free white hats to make networked devices more secure
Image: Kutay Tanir/Getty Images

As Lenovo has come under fire for pre-installing on their computers the intrusive Superfish adware — and as lawsuits are now being filed against the laptop-maker for compromising its users’ security — one solution to the problem may have been given short shrift. Maybe it’s time, in other words, to release the hackers.

To be clear, nothing here should be read as an inducement to any sort of malicious hacking or other nefarious cyber-activities. The call to arms is instead to hacking in the old Homebrew Computer Club, touch-of-code-and-dab-of-solder sense. After all, when pop-up ads became a scourge of the late 1990s Internet, coders behind the smaller Opera and Mozilla browsers rolled out their pop-up blockers to restore a touch of sanity. Major commercial web browsers like Internet Explorer and Safari only rushed in after the nimbler first responders proved the consumer demand.

Over the nearly half-century of the modern amateur computing movement, makers, modders and homemade tech enthusiasts have never come up short on creative solutions to big marketplace challenges. What’s needed in response to the proliferation of Lenovo/Superfish, Samsung Smart TV, and many other security debacles in recent months is more openness and encouragement to let hackers (in the old-school sense of hackers as above) be hackers.

“It comes down to device autonomy, whether users have control over the software and hardware they run,” says Parker Higgins, director of copyright activism at the Electronic Frontier Foundation. “I worry that people may lose the understanding that they deserve that kind of autonomy and that level of privacy and that entitlement to be left alone when they want to.”

In fact, just this month EFF has completed its latest round of petitions to the U.S. Copyright Office to enable exceptions to the Digital Millennium Copyright Act that allow for car repair that involve a car’s onboard computers, Fair Use video remixes, jailbreaking phones and tablets and modifying older video games that require authentication from servers that no longer exist.

“There’s a rulemaking process that happens every three years,” Higgins says. “Every three years you have to submit your exemptions de novo. It doesn’t carry over. We’ve gotten exemptions for jailbreaking phones in the past, and we’ve had to apply it completely from scratch this year.”

So as dry as the DMCA’s exemption-making process may be, he says, it’s still necessary to carve out spaces in the marketplace where consumers can continue to develop new and productive uses for technology whose original manufacturers might otherwise try to shut it down via claims of copyright infringement.

Higgins adds that with enough groundswell of frustration at the proliferation of adware, bloatware and consumer snooping in tech today, legislation like the Unlocking Technology Act of 2013 (which would allow for more hacking of the kind described here — but also died in committee) might one day make it onto the books.

And the reason this matters to aggrieved Lenovo or Samsung SmartTV owners (among numerous known and suspected privacy violations in consumer electronics) is that owners of these devices should be able to build and distribute their own workarounds to spyware or other unrequested and unadvertised technologies they find onerous. And maybe then some smart appliance equivalent of the popup ad blocker will bubble up to restore a touch of sanity again. 

Thu, 26 Feb 2015 16:07:00 GMT
Should Data Sharing Be More Like Gambling? A Microsoft Research team explores transforming permissions into accepting some "privacy risk" A Microsoft Research team explores transforming permissions into accepting some "privacy risk"
Photo-illustration: Randi Klett; Images: Getty Images

When you install a new app on your phone, you might find yourself facing a laundry list of things the software says it needs to access: your photos folder, for example, along with your camera, address book, phone log, and GPS location.

In many cases, it’s an all or nothing deal. 

Eric Horvitz of Microsoft Research says companies could do better. Instead of asking users to provide wholesale access to their data, they could instead ask users to accept a certain level of risk that any given piece of data might be taken and used to, say, improve a product or better target ads.

“Certainly user data is the hot commodity of our time,”  Horvitz said  earlier this week at the American Association for the Advancement of Science, or AAAS, meeting in San Jose . But there is no reason, he says, that services “should be sucking up data willy-nilly.”

Instead, he says, companies could borrow a page from the medical realm and look for a minimally invasive option. Horvitz and his colleagues call their approach “stochastic privacy.” Instead of choosing to share or not to share certain information, a user would instead sign on to accept a certain amount of privacy risk: a 1 in 30,000 chance, for example, that their GPS data might be fed into real-time traffic analysis on any given day. Or a 1 in 100,000 chance that any given Internet search query might be logged and used.

Horvitz and colleagues outlined the approach in a paper for an American Association for the Advancement of Artificial Intelligence conference last year.

If companies were to implement stochastic privacy, they’d likely need to engage in some cost-benefit calculations. What are the benefits of knowing certain information? And how willing would a user be to share that information? 

This sort of exercise can turn up surprising results. In an earlier study, Horvig and Andreas Krause (then at Caltech, but now at ETH Zurich) surveyed Internet search users to gauge their sensitivity to sharing different kinds of information. More sensitive than marital status, occupation, or whether you have children? Whether the search was conducted during work hours. 

Of course, even if a company works out what seem to be reasonable risks for sharing different kinds of data, what it might look like on the user end is still an open question. How do you communicate the difference between a 1/30,000 and a 1/100,000 probability? 

Horvitz said that would be a good problem to have. “Would you want to live in a world where the challenge is to explain these things better,” he asked, “or where companies scarf up everything?”

Fri, 20 Feb 2015 19:00:00 GMT
Rooting Out Malware With a Side-Channel Chip Defense System A new software-agnostic malware detection tool detects cyberattacks by their power consumption A new software-agnostic malware detection tool detects cyberattacks by their power consumption
Photo: John Lamb/Getty Images

The world of malware has been turned on its head this week, as a company in Virginia has introduced a new cybersecurity technology that at first glance looks more like a classic cyberattack. 

The idea hatched by PFP Cybersecurity of Vienna, Va., is taken from the playbook of a famous cryptography-breaking scheme called the side channel attack. All malware, no matter the details of its code, authorship, or execution, must consume power. And, as PFP has found, the signature of malware’s power usage looks very different from the baseline power draw of a chip’s standard operations.

So this week, PFP is announcing a two-pronged technology (called P2Scan and eMonitor) that physically sits outside the CPU and sniffs the chip’s electromagnetic leakage for telltale signatures of power consumption patterns indicating abnormal behavior.

The result, they say, is a practically undetectable, all-purpose malware discovery protocol, especially for low-level systems that follow a predictable and standard routine. (Computers with users regularly attached to them, like laptops and smartphones, often have no baseline routine from which abnormal behavior can be inferred. So, PFP officials say, their technology is at the moment better suited to things like routers, networks, power grids, critical infrastructure, and other more automated systems.)

“On average, malware exists on a system for 229 days before anyone ever notices anything is there,” Thurston Brooks, PFP’s vice president of engineering and product marketing told IEEE Spectrum. “What’s really cool about our system is we tell you within milliseconds that something has happened.”

PFP—an acronym for “power fingerprinting”—requires that its users establish a firm baseline of normal operations for the chips the company will be monitoring. So they begin with P2Scan, a credit-card-size physical sensor that monitors a given chip, board, device, embedded system, or network router for its electromagnetic fingerprints when running normally.

Unlike most malware strategies in the marketplace today, PFP takes a strikingly software-agnostic tack to besting malware, hardware Trojans, and other cyberattacks.

“We’re not trying to actually understand what’s going on inside the machine, like the hackers are,” says Brooks. “We’re trying to define what normal behavior looks like. Then, knowing [that], we can detect abnormal behavior.”

The view of malware as seen from outside the chip, in other words, can be a refreshing one. Hackers can’t detect this type of surveillance, because the scanning tools never actually interact with the chip’s operations. And hackers can be as clever as the most sophisticated programmers in the world. Yet, their code will still very likely be detected because, simply by virtue of performing different tasks than the chip normally performs, it will have a different power profile.

“I am a signal processing guy,” says PFP president Jeff Reed, who is also a professor in the ECE department at Virginia Tech. “Our approach is a very different approach than a person who’s normally schooled in security…We’re trying to understand a disturbance in the signal due to the inclusion of malware.”

Reed and Brooks also point out that counterfeit chips are a vast problem in IT, as Spectrum has documented in recent years. By the FBI’s estimates, for instance, chip counterfeiting costs U.S. businesses some $200 to $250 billion annually.

The problem is just as daunting for the U.S. military, as Spectrum has also chronicled. For example, an investigation by the U.S. Senate Committee on Armed Services uncovered counterfeit components in the supply chains for the CH-46 Sea Knight helicopter, C-17 military transport aircraft, P-8A Poseidon sub hunter and F-16 fighter jet.

The problems were expensive but ultimately rooted out. Yet other dangers remain—especially in such high-security realms, where substandard components could endanger troops and missions, or compromised chips could be used to carry out malicious plots.

But any compromised chip—whether hardware-Trojan-laden or part of a single lot of subpar chips coming from the foundry—can be discovered using their system, PFP says.

The trick, says Brooks, is to grab a sample chip from a lot and perform a (typically expensive) decapping, x-ray analysis, and reverse-engineering of the chip’s code. Then, once it’s been confirmed that the chip works as designed and is within spec, it is run through a standard operation, providing an electromagnetic baseline for P2Scan and eMonitor.

Every other chip in the lot can then be rapidly and cheaply tested against the “gold standard” chip by running the same standard operation and comparing the resulting electromagnetic signature to that of the first chip.

“You determine whether you have a good chip or not,” Brooks says. “You only spend the money to do that on one chip…So you amortize the cost of the forensics across all the chips. So if you have a few million chips, you’re talking about a few pennies [per chip] to do the whole thing—and know that you have a million chips that are all good.”

Tue, 27 Jan 2015 15:00:00 GMT
Cyber Espionage Malware Taps Smartphones, Sends Chills Sophisticated malicious code hasn't gotten the notice that the Sony hack has, but that's the point Sophisticated malicious code hasn't gotten the notice that the Sony hack has, but that's the point
Photo-illustration: John Lund/Getty Images

A mysterious malware campaign resembling an attack on Russian officials from earlier this year could be the most sophisticated cyberattack yet discovered.

This fall, around the time hackers were draining crucial digital lifeblood from Sony Pictures, one of the most sophisticated malware attacks in history (completely separate from the Sony hack) was coming to a close. Presumably retreating after being exposed by security researchers, the cyber espionage campaign targeted smartphones of business, government, and embassy officials around the world. Its structure parallels an earlier attack aimed primarily on Russian executives, diplomats, and government officials, but  the extent of the damage inflicted by the recent campaign—as well as its prospects of returning under a new guise—is still unknown.

Waylon Grange, senior malware researcher at Blue Coat Labs in Sunnyvale, Calif., says he’s taken apart both the malware that infected Sony Pictures’ internal networks and the malicious code behind the Russian hack. And in terms of the relative complexity and sophistication of the designs—though of course not by the level of damage—there’s no contest.

“In terms of sophistication, the Sony malware is really low on the pecking order,” he says. “The Sony malware was more destructive. This one is very selective. When it runs, this one does very well tracking its steps. If anything is wrong or the system is not configured just right, this malware detects it, quietly backs off, doesn’t make any errors, cleans itself up and is gone.”

As a result, Grange says, it's been a difficult cyber infection to study and trace. And its code base and Internet routing are so full of false leads and red herrings that it has, to date, proved impossible to source back to any group, nation, or band of hackers. Whoever it is, Grange says, has assembled a next-generation attack that should make security researchers sit up and pay attention.

And, especially in light of how much horrible mischief the far simpler Sony attack has wrought, businesses and governments should also be educating their workforces on cybersecurity and installing more and better locks on their cyber doors and windows.

In a blog post earlier this month, Grange’s colleagues at Blue Coat unveiled the details of the attack, whose infection route begins with a spearphising e-mail to targeted business, government, and diplomatic users in at least 37 countries. The e-mail poses as an update or special offer for users to download the latest version of What’s App. Unfortunate users who click this link download infected Android, Blackberry and iOS versions of the popular messaging app.

An infected smartphone then records calls made by the user and awaits instructions telling it the Internet address to which it should upload the surreptitiously recorded phone calls.

Such an attack would already be remarkable and impressive, Grange says. But it’s only the first of at least two more layers of command and control structure for the malware campaign.

In the second step, apps check a redundant list of hacked public blogs whose posts contain legitimate text at the top (presumably in order to avoid being de-listed by search engines or otherwise flagged) followed by long strings of encrypted code. The malware then decrypts the code, providing itself a list containing a second set of command and control websites.

These sites, the researchers found, are often compromised Web pages run on outdated content management software in Poland, Russia, and Germany. It’s at these second-tier websites that the malware then decodes its rapidly changing list of drop-zones for offloading the phone call recordings.

Earlier this year, Blue Coat also detected and studied a similar multilayered Windows-based attack that was carried out primarily in Russia. It began with an infected Microsoft Word document that then infected a PC, causing it to follow an even more carefully guarded and circuitous route for receiving instructions. Subsequently infected PCs would first search a series of hacked cloud service accounts, which in turn would point to hacked embedded devices around the world (including wireless routers, satellite TV boxes and digital TV recording devices). Those compromised devices would in turn point back to virtual private networks that contained the instructions for the malware.

Disassembling the infected code, Grange says, led security researchers to multiple conflicting conclusions about its authors. One piece of the infected Android app contained the Hindi character for "error." Several of the infected blog profiles have set their location to Iran. Many infected home routers are in South Korea. Text strings in the Blackberry malware are in Arabic. Another contained the comment “God_Save_The_Queen.”

It was the many layers of red herrings and command and control, Grange says, that inspired Blue Coat to call the original (Russian) malware “Inception,” in homage to the 2010 thriller that contains onion-like layers of story to be peeled away. Blue Coat hasn't explicitly named the smartphone cyberespionage attack, though they strongly suspect it's either by the same hackers or strongly inspired by the "Inception" malware.

“These people are going to great lengths to protect who they are,” he says. “We’ve seen [attackers] use the cloud. But we’ve never seen routers, and we’ve never seen anyone use cloud, router, and private services to hide their identity.”

Grange says the smokescreens have worked so far; he has yet to establish any solid leads on who could have conducted these sophisticated attacks. Yet the lessons learned from the attacks, he said, are not nearly as mysterious. Among them:

• Don’t click links in your e-mail browser—especially in any e-mail from an unknown user, or strange e-mails from known users.

• Don’t root your phone. Because the iPhone, for instance, doesn’t allow for updates outside of the iTunes store, Inception wouldn’t work on a non-rooted iPhone.

• Only update mobile apps through your trusted app store (e.g. iTunes or Google Play).

• Always change the default passwords (“admin,” “password,” etc.) for your household devices.

“We probably haven’t seen the end of these guys,” Grange says. “I’m sure they’ll come back. It’s just a matter of how long have we set them back—and what will be their new toys when they come back.”

Mon, 29 Dec 2014 14:00:00 GMT
New Jersey Finally Cancels $118 Million Social Welfare Computer System State’s auditor questions why incompetent mismanaged project wasn’t canned long ago State’s auditor questions why incompetent mismanaged project wasn’t canned long ago
Photo: Corbis

IT Hiccups of the Week

We end this year’s IT Hiccups of the Week series much like how we began it, with yet another expensive, incompetently managed, and ultimately out-of-control U.S. state government IT project spiraling into abject failure. This one involves the New Jersey Department of Human Services’ six-year, $118.3 million Consolidated Assistance Support System (CASS). It was supposed to modernize the management of the state’s social welfare programs, but it was CASS itself that was in dire need of assistance.

The Department of Human Services decided to announce that it had pulled the project’s plug over the Thanksgiving holiday—no doubt to try to reduce the bad publicity involved while people were enjoying their much-easier-to-swallow, non-IT turkey. A DHS spokesperson would not explain why the CASS contract was terminated; her only related comment made to a reporter was that “an analysis is in progress to determine next steps.”

Hewlett-Packard, which was the CASS project prime contractor (the contract was originally awarded to EDS in 2007; HP acquired the firm in 2008), was equally mum on the subject. However, an HP spokesperson did seem to hint strongly that any and all project problems were the fault of New Jersey’s DHS, when he stated that, “Out of respect, HP does not comment on customer relationships.”

Last week, an audit report (pdf) by Stephen Eells, New Jersey’s state auditor, showed why both DHS and HP did not want to discuss why a system touted as “New Jersey's comprehensive, cutting-edge social service information system” had turned into a debacle. According to the report, both DHS and HP botched the project nearly from its outset in August 2009. The audit report, for example, found HP’s overall technical performance “poor,” due in part to the company’s “absentee management.” HP has changed project managers on the eight-phase CASS effort three times since 2010. One of the managers the state rejected, Eells stated, because they lacked the qualifications “to manage such a large project.”

The audit report also notes that while the CASS contract cost was $118 million (it was originally $83 million), the state’s own project-related costs added up to an additional $109 million. According to a article,  Eells, in testimony last week before New Jersey’s Human Services Committee, made it clear that the state botched its CASS oversight role as well. DHS senior management, he indicated, consistently ignored red flags that the project was in deep trouble, and apparently failed to bring “concerns over the contract to the Department of Treasury, which is responsible for ensuring that problems with contracts are resolved.”

Eells also ruefully noted that the state’s contract with HP didn’t “allow the state to recoup damages from the failure to complete the contracted work.” A minor oversight, one might say.

The Human Services Committee wasn’t able to find out why DHS ignored the warnings that the CASS project was in trouble or failed to report the contract troubles to the state department that really needed to know about them, either. This void in the record is because DHS Commissioner Jenifer Velez “declined to speak at the hearing, citing the ongoing talks with Hewlett-Packard,” reported.

I tend to doubt that the Commissioner will ever explain why her department’s IT managers chose to ignore the facts screaming out to them that the CASS project was on the fast track to failure, or why her department’s contract managers failed to protect state taxpayers from the cost of failure as is routiniely done. It’s not like the Commissioner is personally accountable for what happens on her watch or anything.

In Other News…

Ontario and IBM Locked in Court Battle Over Bungled Transportation System Project

Fixing Ontario’s Social Services’ Buggy Computer System Will Be Costly

Profits for UK’s Brewin Dolphin Drop on IT Debacle

LA DWP Says Billing Mess Over After Inflicting Customers With Year of Pain

Hertz Car Rental Blames Computer Issues for Failing to Pay $435,777 in Taxes

LAUSD Gets $12 Million More to Fix Wayward School Information Management System

6,000 Health Exchange Insurance Plans in Washington State Canceled by Mistake

Robotic Cameras Go Rogue, Irritate BBC News Presenters

Software Bungles in Oregon Child Welfare Data System Cost State $23 Million

Amazon UK Erroneously Selling Hundreds of Products for a Penny

Second Major Air Traffic Computer Problem in Year Cancels, Delays Scores of UK Flights

MPs Demand Investigation into UK Air Traffic System Meltdown

UK Air Traffic Chief Blames Unprecedented Software Issue for Shutdown

Mon, 15 Dec 2014 14:00:00 GMT
How the Internet-Addicted World Can Survive on Poisoned Fruit The world faces tough tradeoffs in reaping the benefits versus risks of the Internet The world faces tough tradeoffs in reaping the benefits versus risks of the Internet
Illustrations: Getty Images

There is no “magic bullet” for cybersecurity to ensure that hackers never steal millions of credit card numbers or cripple part of a country’s power grid. The conveniences of living in an interconnected world come with inherent risks. But cybersecurity experts do have ideas for how the world can “survive on a diet of poisoned fruit” and live with its dependence upon computer systems.

Cybersecurity risks have grown with both stunning scale and speed as the global economy has become increasingly dependent upon the Internet and computer networks, according to  Richard Danzig, vice chair of The RAND Corporation and former U.S. Secretary of the Navy. He proposed that the United States must prepare to make hard choices and tradeoffs—perhaps giving up some conveniences—in order to tackle such risks. Such ideas became the focus of a  cybersecurity talk and panel discussion  hosted by New York University’s Polytechnic School of Engineering on Dec. 10.

“You are trading off the virtue in order to buy security,” Danzig said. “To the degree that you indulge in virtue, you breed insecurity. The fruit is poisonous, but also nutritious.”

The Internet and its related computer networks represent incredibly useful technological tools that provide open communication and speedy transfer of digital information across the world. Cybersecurity risks arise because such useful tools can easily be misused. That means countries and corporations face some tough choices. Danzig cited an online commentator’s analogy: Would we ban automobiles from driving around banks just because they’re sometimes used in bank heists? Such added security comes with costs.

For instance, the U.S. National Security Agency adopted a new rule that requires two people’s passwords to download certain files—a belated countermeasure that only came after former NSA contractor Edward Snowden downloaded as many as 1.7 million documents exposing the U.S. intelligence agency’s worldwide surveillance programs. Such a measure provides some added security, but also sacrifices the ability of individuals to download documents by themselves for normal work purposes.

There is also the potentially huge problem of tracking and securing the electronics hardware found in everything from Internet servers to smartphones. As an example, Danzig asked Intel researchers to calculate how many transistors are manufactured worldwide every second. Intel came back with the “disorienting” estimate of 8 trillion transistors manufactured worldwide every second.

“I don’t believe policymakers, when talking about Moore’s Law and hardware, have any grasp of the magnitude of the challenge in tracking items that go into these systems,” Danzig said.

So how can U.S. lawmakers and CEOs deal with such daunting challenges? Danzig laid out some recommendations found in his Center for a New American Security report titled “ Surviving on a Diet of Poisoned Fruit: Reducing the National Security Risks of America’s Cyber Dependencies ”—recommendations that generated both encouragement and debate among the cybersecurity experts gathered at the NYU talk.

One possible defense involves going the “Battlestar Galactica” route of isolating some computers and networks, or reducing dependence upon digital systems in favor of returning to analog. Danzig suggested merging digital systems with analog and human systems so that a cyber attack by itself can’t compromise the security of a nuclear launch facility or power plant—it might still require a human somewhere to throw a physical switch or perform another action.

Other cybersecurity ideas include making “lean systems” that don’t have extra exploitable features such as getting rid of printer features that digitally track all the documents you’ve printed. Or creating more air-gapped system “enclaves” that don’t have any Internet or local network connections to safeguard certain information.

Danzig also recommended the U.S. government take the steps of recognizing the private sector is “too important to fail” in terms of cybersecurity. Rather than have one cyber czar official applying a “one size fits all” solution, he suggested individual government departments could work with their industry counterparts. For instance, the Department of Energy could work with utilities on cybersecurity risks relevant to the power grid.

Both U.S. government agencies and private companies could also consider sharing anonymous data on cybersecurity threats in a collaborative database—not unlike what U.S. airlines do with a shared database on near misses and other risky incidents that didn’t lead to accidents.

Danzig also suggested that the U.S. could also approach China and Russia to discuss agreements on preventing common cyber attacks from morphing into future “cyber-physical attacks” that directly damage power plants or take down airplanes, the Stuxnet worm attack on Iran’s nuclear centrifuges being the most famous such case. Danzig urged the U.S. to encourage clear agreements on “red lines” for cyber attack behavior that is in everyone’s best interests, such as not launching cyber attacks that penetrate the nuclear missile commands of various countries.

The threat of cyber-physical attacks will only grow as more “non-computer” systems such as cars, medical devices, industrial machines and household appliances become connected to the Internet. That future “Internet of Things” could leave individuals, homes and entire economies vulnerable to hackers whose cyber attacks play havoc with physical objects.

Many industrial systems don’t even have authentication or authorization codes, so that hackers could metaphorically walk in the front door of a power plant’s control system. Other vulnerabilities may exist for everything from heart implants to the car systems that may someday become part of the Internet of Things.

“For those things not [yet] Internet-connected, we have enormous vulnerabilities that are not backdoor,” said Andy Ozment, a ssistant secretary of the  Office of Cybersecurity and Communications in the  U.S. Department of Homeland Security, during the panel discussion following Danzig’s talk. “They are built-in system vulnerabilities that we are going to really struggle to fix.”

One way to tackle that daunting problem is to narrow the focus of the challenges involved. A Windows operating system may be open to all sorts of malware, but the scope of cyber-physical attacks on a power plant are necessarily limited by what they’re trying to physically accomplish, said Ralph Langner, director and founder of Langner Communications.

“Not all hope is lost, because we don’t have to analyze millions of samples of malware,” Langner explained. “We just need to analyze promising cyber-physical attack vectors, which is much easier than you might think.”

It’s also important for cybersecurity experts to not lose sight of the human element in the threat, Danzig said. He and other experts recommended more behavioral studies that look at the incentives driving the people behind certain cyber attacks—whether those people are Chinese military hackers or Eastern European criminal gangs. By understanding the motives and incentives, cybersecurity experts could come up with defenses better tailored for deterring such attacks.

Cybersecurity experts can often fall into the trap of believing there is a technical fix for everything, said Stefan Savage, professor of computer science and engineering at the University of California, San Diego. He pointed out that computer systems just represent the medium through which human conflict takes place. Therefore experts might do better to consider the who and why of cyber attacks.

“You could spend all day looking for every threat that could exist or every vulnerability that could happen,” said Stefan Savage, professor of computer science and engineering at the University of California, San Diego. “To limit that to what’s likely to happen, you have to understand your adversary.”

]]> Thu, 11 Dec 2014 22:23:00 GMT How Not to Be Sony Pictures Lessons learned from the recent Sony Pictures hack Lessons learned from the recent Sony Pictures hack
Photo: Getty Images

The scope of the recent hack of Sony Pictures — in which unidentified infiltrators breached the Hollywood studio’s firewall, absconded with many terabytes of sensitive information and now regularly leak batches of damaging documents to the media — is only beginning to be grasped. It will take years and perhaps some expensive lawsuits too before anyone knows for certain how vast a problem Sony’s digital Valdez may be. 

But the take-away for the rest of the world beyond Sony and Hollywood is plain: Being cavalier about cybersecurity, as Sony’s attitude in recent years has been characterized, is like playing a game of corporate Russian roulette.

According to a new study of the Sony hack, one lesson learned for the rest of the world is as big as the breach itself. Namely, threat-detection is just the first step.

Snuffing out malware, trojans and phishing attacks is of course an important front line battle, but that battle is only one front of a multi-front war. For instance, any organization that thinks cybersecurity is as simple as installing and regularly updating their anti-virus software risks similar nightmare scenarios as what Sony Pictures now stares down.

Fengmin Gong, chief strategy officer and co-founder of Santa Clara, Calif.-based Cyphort security, says today the best security strategies also include continuous monitoring of their networks for suspicious movements of their most carefully guarded data. Security is best, in a sense, presuming that security sometimes fails.

“The new approach today that people have shifts away from prevention — which everyone knows is not achievable — to a focus on attack sequence and consequence,” he says.

So a company that follows his approach, he says, might build a security strategy in which some leakiness is expected. After all, in age of pervasive connectivity, from laptops and servers to smartphones and tablets to wearables and smart appliances, it’s increasingly pie-in-the-sky to suppose that a group of determined hackers couldn’t find holes somewhere in a target company’s networks.

Instead, Gong says, the smart company expects occasional hacks to get through but also knows what digital assets it values most. And those are the nodes, computers and networks it monitors most closely. The reported terabytes worth of Sony Pictures scripts, films, spreadsheets, marketing and sales data and communications that hackers downloaded — clearly a centerpiece of the company’s revenues — would never be shipped out through company networks without network monitors also discovering such a massive breach, he says.

And it’s not just Hollywood studios that need to shift their thinking, he says. (Though Gong says he has also been consulting lately with another prominent Hollywood studio, who he says are applying similar lessons learned to develop smarter cyber security practices.)

For instance, Target and Home Depot suffered recent security breaches in their point-of-sale (POS) networks, leading to many customers’ credit card numbers and other sensitive information being released.

“Today we have to make assumptions that something could fail,” he says. “Continuous monitoring allows you to watch what is the data movement into and out of your POS system. That’s what we mean by focusing on consequences. [Y]ou want your organization to be the first one to realize something just happened or is happening. Then you can contain the damage and anything else. Right now the problem is people getting told by someone else many months later that something happened. Then the damage is already done.”

In Sony Pictures’ case, Gong says, the structure of the malware itself also points to a larger systemic security failure at the company. Some of the malware files, as Cyphort’s report details, actually contain Sony Pictures’ employees usernames and passwords already hard-coded into the malware scripts.

That means there’s at least one earlier round of security breaches at Sony that haven’t yet been fully uncovered — because the authors of the malware must have somehow previously obtained these usernames and passwords in order for them to be able to write and upload the malware they used for the current breach.

“When this [breach] happened, it happened over multiple points in time,” Gong says. “We see the hope that if people start adopting these new approaches to their security posture, we feel confident these things would have been discovered and stopped earlier than what is happening now.”

Thu, 11 Dec 2014 20:00:00 GMT
Amazon Plays Santa after IT Glitch, Singapore Airlines’ Plays Scrooge Student gets to keep unexpected packages, Singapore Airlines wants its airfare money Student gets to keep unexpected packages, Singapore Airlines wants its airfare money
Photo: Getty Images; Bow: iStockphoto

IT Hiccups of the Week

This week’s edition of IT Hiccups focuses on the two different customer service reactions to IT errors, a nice one by Amazon UK and a not so nice one on the part of Singapore Airlines.  

According to the Daily Mail, a student at the University of Liverpool by the name of Robert Quinn  started to receive a plethora of packages from Amazon at his family’s home in Bromley, South London that he hadn’t ordered. The 51 packages included a baby buggy, a Galaxy Pro tablet, a 55-inch 3-D Samsung television set, a Sony PSP console, an electric wine cooler, a leaf blower, a bed, a bookcase and a chest of drawers, among other things. In total, the 51 items were worth some £3,600 (US $5,650).

The Daily Mail reported that Quinn called up Amazon and asked what was going on. According to Quinn, Amazon told him that people must be “gifting” the items to him. That surprised Quinn, since he didn’t know the people who were supposedly gifting him the items. Quinn told the Mail that he speculated that there was some sort of computer glitch affecting the Amazon’s purchase return address labels, since the items all looked as though they were meant to be sent back to Amazon by their original purchasers.

Quinn told the Mail:

 I was worried that people were losing out on their stuff so I phone Amazon again and said I’m happy to accept these gifts if they are footing the cost, but I’m not happy if these people are going to lose out. But Amazon said ‘it’s on us.

The Mail checked with Amazon, who confirmed Quinn’s story. While not confirming that a computer problem affecting its return labels was the cause for the errant packages, Amazon didn’t go out of its way to deny it.

Quinn, who is an engineering student, later told the Mail that packages were still arriving. Quinn indicated that he was going to give some of the items he has received to charity, and then sell the rest to fund “an ‘innovative’ new [electric] cannabis grinder” he was designing.

Whereas Amazon played Santa, Singapore Airlines decided instead to take on the role of Scrooge last week. According to the Sydney Morning Herald, when Singapore Airlines uploaded its business class fares for trips from Australia to Europe into a global ticket distribution system, it instead mistakenly uploaded its economy fare prices. As a result, instead of paying US $5,000 for a business class ticket, travel agents sold over 900 tickets for $2,900 before Singapore Airlines fixed the problem.

Singapore Airlines decided that its mispricing mistake wasn’t, in fact, its problem, but the travel agents’.  The Herald reported that the airline, “told travel agents who sold the cheap tickets that they will have to seek the difference between the actual price and what they should have sold for from their customers, or foot the bill themselves,” if their customers want to fly in business class.

Singapore Airlines admitted, according to a Fox News story, that while it had recently “recently reassigned a booking subclass originally designated to economy class bookings to be used for business class bookings from December 8, 2014,” which could cause confusion, “the airfare conditions for the fare clearly stated that it was only valid for economy class travel.” In other words, we may have screwed up, but the travel agents should have caught our error anyway.

Scrooge would indeed be proud.

Last year, both Delta and United Airlines decided to honor online fare errors, in the latter case even when fares were priced at $0.

Update: The Daily Mail is now reporting that Singapore Airlines has decided to honor the mispriced tickets after all. Tiny Tim must be rejoicing.

In Other News ….

Coding Issues Forces 10,000 New York Rail Commuters to Buy New Fare Cards

Microsoft Experiences Déjà vu Update cum Human Azure Error

New $240 Million Ontario Welfare System Pays Out Too Much and Too Little

New Jersey Social Services Glitch Causes Wrong Cash Payments

Singapore Stock Exchange Suffers Third Outage of Year

Air India Suffers Check-in Glitch

Best Buy Website Crashes Twice on Black Friday

Mazda Issues Recall to Fix Tire Pressure Monitoring Software

Washington Health Insurance  Exchange Glitches Continue

Mon, 8 Dec 2014 14:00:00 GMT
Blob Front-End Bug Bursts Microsoft Azure Cloud 11-hour intermittent global outages helped along by operator error 11-hour intermittent global outages helped along by operator error
Illustration: Getty Images

IT Hiccups of the Week

It being the Thanksgiving holiday week in the United States, I was tempted to write once more about the LA Unified School District’s MiSiS turkey of a project, which the LAUSD Inspector General fully addressed in a report [pdf] released last week. If you like your IT turkey burnt to a crisp, over-stuffed with project management arrogance, served with heapings of senior management incompetence, and topped off a ladleful of lumpy gravy of technical ineptitude, you’ll feast mightily on the IG report. However, if you are a parent of the over 1,000 LAUSD school district students who still have not received a class schedule nearly 40 percent of the way into the academic year—or a Los Angeles taxpayer for that matter—you may get extreme indigestion from reading it.

However, the winner of the latest IT Hiccup of the Week award goes to Microsoft for the intermittent outages that hit its Azure cloud platform last Wednesday, disrupting an untold number of customer websites along with Microsoft Office 365,  Xbox Live , and other services across the United States, Europe, Japan, and Asia. The outages occurred over an 11-hour (and in some cases longer) period.

According a detailed post by Microsoft Azure c orporate v ice p resident Jason Zanderon, the outage was caused by “a bug that got triggered when a configuration change in the Azure Storage Front End component was made, resulting in the inability of the Blob [Binary Large Object] Front-Ends to take traffic.”

The configuration change was made as part of a “performance update” to Azure Storage, that when made, exposed the bug, and “resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services.” The bug, which had escaped detection during “several weeks of testing,” caused the storage Blob Front-Ends to go into an infinite loop, Zander stated. “The net result,” he wrote, “was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.”

Once the error was detected, the configuration change was rolled backed immediately. However, the Blob Front-Ends needed a restart to halt their infinite looping, which slowed the recovery time, Zander wrote.

The effects of the bug could have been contained, except that Zander indicated someone apparently didn’t follow standard procedure in rolling out the performance update.

“Unfortunately the issue was wide spread, since the update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches.”

Zander apologized for the “inconvenience” and says that it is going to “closely examine what went wrong and ensure it never happens again.”

In Other News…

Polish President Says Voting Glitch Doesn’t Warrant Vote Rerun

RBS Hit With £56 Million Fine for “Unacceptable” 2012 IT Meltdown

Wal-Mart Ad Match Scammed for $90 PS4s

Computer Problems Close South Australian Government Customer Service Centers

British Columbia Slot Machines’ Software Fixed After Mistaken $100K Payout

Washington State Temporarily Closes Health Exchange Due to Computer Issues

Software Bug in Washington State Department of Licensing Fails to Alert Drivers to Renew Licenses

Mon, 24 Nov 2014 14:00:00 GMT
RBS Group Facing Huge Fine over Massive 2012 IT System Meltdown Bank still fixing decades-long neglect of IT system infrastructure Bank still fixing decades-long neglect of IT system infrastructure
Photo: eyevine/Redux

IT Hiccups of the Week

We turn our attention in this week’s IT Hiccups to one of the truly major IT ooftas of the past decade—one that was back in the news this week: the meltdown of the IT systems supporting the RBS banking group. (That group includes NatWest, Northern Ireland’s Ulster Bank, and the Royal Bank of Scotland.) The meltdown began in June 2012 but wasn’t fully resolved until nearly two months later. The collapse kept 17 million of the Group’s customers from accessing their accounts for a week, while thousands of customers at Ulster Bank reported access issues for more than six weeks.

Last week, Sky News reported that the UK’s Financial Conduct Authority (FCA) informed RBS that it was facing record breaking fines in the “tens of millions of pounds” for the malfunction, which was blamed on a faulty software upgrade. In addition, the Sky News story states that the Central Bank of Ireland is looking at imposing fines on Ulster Bank over the same issue.  The meltdown has already cost RBS some £175 million in compensation and other corrective costs.

In the wake of another major RBS IT system failure last December, RBS CEO Ross McEwan admitted the bank had neglected its IT infrastructure for decades. Last year, RBS said it would be spending some £450 million to upgrade its IT systems, but that figure was upped to over £1 billion this past June.

According to Sky News, RBS “could receive a discount of up to 30 percent on the proposed penalty if it agrees to settle within the 28-day window under FCA rules.” Seeing how RBS already has admitted that it has been short-changing its IT investiment, it is hard to see why the bank would decide to contest the fine.

In a separate story, Sky News reported that the UK’s Prudential Regulation Authority (PRA), which is part of the Bank of England, has sent a letter to the UK’s biggest banks “demanding” that they improve the resilience of their IT systems. The PRA has given the banks until mid-December to report on what they are doing to ensure their IT systems are robust.

Some wags wondered, however, whether the PRA was going to be conducting a resilience assessment of the Bank of England’s IT systems. That bank suffered a highly embarrassing outage of its own a few weeks ago.

In Other News…

eVoting Problems Crop Up Across US in Mid-Term Elections

LA Unified School District IT Problems Just Keep Mounting With No End in Sight

Singapore Exchange Goes Down Due to Power Outage

Computer Failure Halts Deutche Borse Trading

PayPal Experiences Server Problems

Hundreds of Parents Panic When California School Sends Out Erroneous Missing Student Message

Kansas-based Spirit AeroSystems Has ERP troubles

Ticketmaster Declines to Honor Mispriced Online Circus Tickets

Pawtucket City Rhode Island Sends Out Erroneous Car Tax Bills

New Microsoft Band Has Daylight Saving Time Glitch

Xfinity New X1 Cable Box Users Suffers Multi-state Outage

Mon, 10 Nov 2014 14:00:00 GMT
FCC Chairman Calls April's Seven State Sunny Day 911 Outage "Terrifying" Highlights current fragility of nation’s 911 emergency systems Highlights current fragility of nation’s 911 emergency systems
Photo: Debi Bishop / iStockphoto
IT Hiccups of the Week

This edition of IT Hiccups of the Week revisits the 911 emergency call system outages that affected all of Washington State and parts of Oregon just before midnight, 9 April 2014. As I wrote at the time, CenturyLink—a telecom provider from Louisiana that is contracted by Washington State and the three affected counties in Oregon to provide 911 communication services—blamed the outages, which lasted several hours each, on a “technical error by a third party vendor.”

CenturyLink gave few details in the aftermath of the outages other than to say that the Washington State and Oregon outages were merely an “uncanny” coincidence, and to send out the standard “sorry for the inconvenience” press release apology. The company estimated that approximately 4,500 emergency calls to 911 call centers went unanswered during the course of the Washington State outage. No details were available regarding the number of failed 911 calls there were during the two-hour Oregon outage, which affected some 16,000 phone customers.

Well, 10 days ago, the U.S. Federal Communications Commission released its investigative report into the emergency system outages. It cast a much different light on the Washington State “sunny day” outage (i.e., not caused by bad weather or a natural disaster) that CenturyLink initially tried to play down. FCC Chairman Tom Wheeler even went so far as to call the report’s findings “terrifying.”

As it turns out, while the 911 system outages that hit Oregon and Washington State were indeed coincidental, they were also connected in a strange sort of way that caused a lot of confusion at the time, as we will shortly see. More importantly, the 911 outage that affected Washington State on that April night didn’t just affect that state, but also emergency calls being made in California, Florida, Minnesota, North Carolina, Pennsylvania and South Carolina. In total, some 6,600 emergency calls made over a course of six hours across the seven states went unanswered.

As the FCC report notes, because of the multi-state emergency system outage, “Over 11 million Americans … or about three and half percent of the population of the United States, were at risk of not being able to reach emergency help through 911.” Since the outage happened very late at night into the early morning and there was no severe weather in the affected regions, the emergency call volume was very low; luckily, no one died because of their inability to reach 911.

The cause of the outage, the FCC says, was a preventable “software coding error” in a 911 Emergency Call Management Center (ECMC) automated system in Englewood, Colorado, operated by Intrado, a subsidiary of West Corporation.  Intrado, the FCC report states, “is a provider of 911 and emergency communications infrastructure, systems, and services to communications service providers and to state and local public safety agencies throughout the United States…  Intrado provides some level of 911 function for over 3,000 of the nation’s approximately 6,000 PSAPs .”

As succintly explained in an article in the Washington Post, “Intrado owns and operates a routing service, taking in 911 calls and directing them to the most appropriate public safety answering point, or PSAP, in industry parlance. Ordinarily, Intrado's automated system assigns a unique identifying code to each incoming call before passing it on—a method of keeping track of phone calls as they move through the system.”

“But on April 9, the software responsible for assigning the codes maxed out at a pre-set limit [at 11:54 p.m. PDT]; the counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure,” the Post article went on to state.

All told, 81 PSAPs across the seven states were unable to receive calls; dialers to 911 heard only “fast busy” signals.

When the software hit its 40 million call limit, the FCC report says, the emergency call-routing system did not send out an operator alarm for over an hour. When it finally did, the system monitoring software indicated that the problem was a “low level” problem; surprisingly, it did not immediately alert anyone that emergency calls were no longer being processed. 

As a result, Intrado’s emergency call management center personnel did not realize the severity of the outage, nor did they get any insight into its cause, the FCC report goes on to state. In addition, the ECMC personnel were already distracted with alarms they were receiving involving the Oregon outage also involving Century link.

Worse still, says the FCC, the low-level alarm designation not only failed to get ECMC personnel’s attention, but it also prevented an automatic rerouting of 911 emergency calls to Intrado’s ECMC facility in Miami.

It wasn’t until 2:00 a.m. PDT on 10 April that ECMC personnel became aware of the outage. That, it seems, happened only because CenturyLink called to alert them that its PSAPs in Washington State were complaining of an outage. After the emergency call management center personnel received the CenturyLink call, both they and CenturyLink thought the Washington State and Oregon outages were somehow closely interconnected. It took several hours for them to realize that they were entirely separate and unrelated events, the FCC report states. Apparently, it wasn’t until other several other states’ PSAPs and 911 emergency system call providers started complaining of outages that call management center personnel and CenturyLink realized the true scope of the 911 call outage, and were finally able zero in on the cause.

Once the root cause was discovered, the Colorado-based ECMC personnel initiated a manual failover of 911 call traffic to Intrado’s ECMC Miami site at 6:00 a.m. PDT. When problems plaguing the Colorado site were fixed later that morning, traffic was rerouted back.

The FCC report states that, “What is most troubling is that this is not an isolated incident or an act of nature. So-called ‘sunny day’ outages are on the rise. That’s because, as 911 has evolved into a system that is more technologically advanced, the interaction of new [Next Generation 911 (NG911)] and old [traditional circuit-switched time division multiplexing (TDM)] systems is introducing fragility into the communications system that is more important in times of dire need.”

IEEE Spectrum published an article in March of this year that explains the evolution of 911 in the U.S. (and Europe) and provides good insights into some of the difficulties of transitioning to NG911. The FCC’s report also goes into some detail on how the transition from traditional 911 service to NG911 can create subtle problems that are difficult to unravel when a problem does occur.

According to a story at, Rear Admiral David Simpson, chief of the FCC’s Public Safety and Homeland Security Bureau, told the FCC during a hearing into the outage that there were three additional major “sunny day” outages in 2014, though none were ever reported before this year. All three—which I believe involved outages in Hawaii, Vermont and Indiana—involved NG911 implementations or time division multiplexing–to-IP transitions, Simpson said.

The FCC report indicates that Intrado has made changes to its call routing software and monitoring systems to prevent this situation from happening again, but it also said that 911 emergency service providers need to examine their system architecture designs. The hope is that they’ll better understand how and why their systems may fail, and what can be done to keep the agencies operating when they do. In addition, the communication of outages among all the emergency service providers and PSAPs needs to be improved; the April incident highlighted how miscommunications hampered finding the extent and cause of the outage.

Finally, the five FCC Commissioners unanimously agreed that such an outage was “simply unacceptable” and that future “lapses cannot be permitted.” While no one died this time, they note that next time everyone may not be so lucky.

In Other News…

Sarasota Florida Schools Plagued by Computer Problems

Weather Forecasts Affected as National Weather Satellite Goes Dark?

Bad Software Update Hits Aspen Colorado Area Buses

Bank of England Suffers Embarrassing Payments Crash

Google Drive for Work Goes Down

Google Gmail Experiences Global Outage

Cut Fiber Optic Cables Knock-out Air Surveillance in East India for 13 Hours

Bank of America Customers Using Apple Pay Double Charged

iPhone Owners Complain of Troubles with iOS 8.1

UK Bank Nationwide Apologizes Once More for Mobile and Online Outages

Vehicle Owners Seeking Info on Takata Airbag Recall Crash NHSTA Website

West Virginia Delays Next Phase of WVOASIS Due to Testing Issues

UK’s Universal Credit Program Slips At Least Four Years

Heathrow Airport Suffers Yet Another Baggage System Meltdown

Mon, 27 Oct 2014 13:00:00 GMT
LA School District Superintendent’s Resigns in Wake of Continuing MiSiS Woes Superintendent may be gone, but hundreds of system problems remain unresolved Superintendent may be gone, but hundreds of system problems remain unresolved
Photo: Michael Kovac/Getty Images

We turn our IT Hiccups of the Week attention once again to the Los Angeles Unified School District’s shambolic roll out of its integrated student educational tracking system called My Integrated Student Information Systems (MiSiS). I first wrote about MiSiS a few months ago, and it has proved nothing but trouble to the point that it became a major contributing factor in “encouraging” John Deasy to resign his position last week as superintendent of the second largest school system in the United States. He’d  been on the job three and a half years.

Deasy claimed in interviews after his resignation that the MiSiS debacle “played no role” in his resignation, and instead blamed it on district teachers and their unions opposing his crusading efforts to modernize the LAUSD school system. That is putting a positive spin on the situation to put it mildly.

Why? You may recall from my previous post that LAUSD has been under a 2003 federal district court approved consent decree to implement an automated student tracking system so that disabled and special need students’ educational progress can be assessed and tracked from kindergarten to the end of high school. Headway toward complying with the obligations agreed under the consent decree is assessed by a court-appointed independent monitor who publishes periodic progress reports. Deasy repeatedly failed to deliver on the school district’s promises made to the independent monitor over the course of his tenure.

What really helped seal Deasy’s fate was the latest progress report [pdf] from the independent monitor released last week. The report essentially said that despite numerous “trust me” promises by LAUSD officials (including Deasy), MiSiS was still out of compliance. The officials had promised that MiSiS would be completely operationally tested and ready at the beginning of this school year. But, said the report, the system’s incomplete functionality, the ongoing poor reliability due to inadequate testing, and the misunderstood and pernicious data integrity issues were causing unacceptable educational hardships to way too many LAUSD students—especially to those with special educational needs.

An LA Times story, for one, stated that the monitor found that MiSiS, instead of helping special needs students, made it difficult to place them in their required programs. A survey conducted by the independent monitor of 201 LAUSD schools trying to use MiSiS found that “more than 80% had trouble identifying students with special needs and more than two-thirds had difficulty placing students in the right programs,” the Times article stated.

Deasy’s fate had been hanging by a thread for a while. For instance, at several LAUSD schools—especially at Thomas Jefferson High School in south Los Angeles—hundreds of students were still without correct class schedules nearly two months after the school year had started. 

Another story in the LA Times reported that continuing operational issues with MiSiS meant that some Jefferson students were being “sent to overbooked classrooms or were given the same course multiple times a day. Others were assigned to ‘service’ periods where they did nothing at all. Still others were sent home.”

The problems at Jefferson made Deasy’s insistence that issues with MiSiS were merely a matter of “fine tuning” look disingenuous at best.

The MiSiS fueled difficulties at Jefferson, which extended to several other LAUSD  schools, caused a California Superior Court judge about two weeks ago to intervene and order the state education department to work with LAUSD officials to rectify the situation immediately. In issuing the order, the judge damningly wrote that, “there is no evidence of any organized effort to help those students” at Jefferson by LAUSD senior officials.

As a result of the judge’s order, the LAUSD school board last week quickly approved a $1.1 million plan to try to eliminate the disarray at Jefferson High. Additionally, the school board is now undertaking an audit of other district high schools to see how many other students are being impacted by the MiSiS mess and what additional financial resources may be needed to eliminate it.

Fraying Deasy’s already thin thread further was his admission that MiSiS is in need of some 600 enhancements and bug fixes (up from a reported 150 or so when the system was rolled out in August), which would likely cost millions of dollars on top of the $130 million already spent to address them. Further, he also acknowledged that one of the core functions solemnly promised to the independent monitor would be available for this school year—the proper recording of student grades—could take yet another year to fix all the bugs with it, the LA Times reported.

According to the LA Daily News, LAUSD teachers complain that they not only have a hard time accessing the grade book function, but when they finally do, they find that student grades or even their courses have disappeared from MiSiS. Hundreds if not thousands of student transcripts could be complete shambles, which for seniors applying for colleges is causing major concern. Their parents are also unamused, to say the least.

Probably the last fiber of Deasy’s thread was pulled away last week when it turned out that even if MiSiS had been working properly, a majority of LAUSD schools likely wouldn’t have been able to access all of its functionality anyway. According to a story at Contra Costa Times, LAUSD technology director Ron Chandler informed the district’s school board last week that most of the LAUSD schools’ administrative desktop computers were incapable of completely accessing MiSiS because of known compatibility problems.

A clearly frustrated school board wanted to know why this situation was only being disclosed now; Chandler told the board that the initial plan was for the schools to use the Apple iPads previously purchased by the school board to access MiSiS. But questions over Deasy's role in that $1 billion contract put a hold to that approach. The school board was more than a bit incredulous about that explanation since they had not approve the purchase of iPads with the intent that they were to be used by teachers and school administrators as the primary means to access MiSiS.

Reluctantly, the school board approved $3.6 million in additional funding to purchase 3,340 new desktop computers for 784 LAUSD schools to allow them unfettered access to MiSiS.

While Deasy’s resignation will alleviate some of the immediate political pressure on LAUSD officials caused by MiSiS fiasco, the technical issues will undoubtedly last throughout this academic year and possibly well into the next. However, for many unlucky LAUSD students, the impacts may last for many years beyond that.

In Other News…

Baltimore County Maryland Teachers Tackling Student Tracking System Glitches

Tallahassee’s New Emergency Dispatch System Offline Again

Washington State’s Computer Network Suffers Major Outage

Software Glitch Hits Telecommunications Services of Trinidad and Tobago

New Mexico Utility Company Incorrectly Bills Customers

Software Issue Means Oklahoma Utility Company Overbills Customers

Computer Error Allows Pink Panther Gang Member Early Out of Austrian Jail

Dropbox Bug Wipes Out Some Users’ Files

Generic Medicines Might Have Been Approved on Software Error

Australia’s iiNet Apologizes to Hundreds of Thousands of Customers for Three-day Email Outage

Spreadsheet Error Costs Tibco Investors $100 Million

Mon, 20 Oct 2014 13:00:00 GMT
Duke Energy Falsely Reports 500,000 Customers as Delinquent Bill Payers Since 2010 Utility says sorry about trashing credit scores and admits number of customers affected may climb Utility says sorry about trashing credit scores and admits number of customers affected may climb
Photo: Nell Redmond/AP Photo

IT Hiccups of the WeekThere were several IT Hiccups to choose from last week. Among them were: problems with the Los Angeles Unified School District’s fouled up new student information and management system that are so egregious that a judge ordered the district to address them immediately; and the UK Revenue and Customs department’s embarassing admission that its trouble-plagued modernized tax system has again made multiple errors in computing thousands of tax bills. However, the winner of this week’s title as the worst of the worst was an oofta by Duke Energy, the largest electric power company in the U.S. Duke officials apologized in a press release to over 500,000 of the utility’s 800,000-plus current and former customers (including 5,000 non-residential customers) across Indiana, Kentucky, and Ohio for erroneously reporting them as being delinquent in paying their utility bills since 2010.

Duke Energy admitted that the root cause of the problem was a coding error that occurred when customers opted to pay their monthly utility bills via the utility’s Budget Billing or Percentage of Income Payment Plan Plus (in Ohio only).  A company spokesperson told Bloomberg BusinessWeek that while customers were sent the correct invoices and their on-time payments were properly credited, the billing system indicated that the customers’ bills were paid late.

 As a result, that late payment information for residential customers was sent by formal agreement to the National Consumer Telecom & Utilities Exchange (NCTUE). The NCTUE is a consortium of over 70 member companies from the telecommunications, utilities and pay TV industries that serves as a credit data exchange service for its members. Holding over 325 million consumer records, NCTUE provides information to its members regarding the credit risk of their current and potential customers. For non-residential customers, the “late payment” snafu had worse consequences: the delinquency reports were sent to the business credit rating agencies Dun & Bradstreet and Equifax Commercial Services.

Duke Energy’s press release said that the company “deeply regretted” the error that has effectively trashed the credit scores of hundreds of thousands of its residential and business customers for years. The utility says the erroneous information has now been “blocked” for use by the NCTUE, Dun & Bradstreet and Equifax, and it has dropped its membership in all three.

The press release mentioned that the company is still investigating whether additional customers who had “unique” billing circumstances were affected by the coding error.

But what the written statement failed to mention is that the utility found the error only after a former customer discovered that she was having trouble setting up service at another NCTUE utility member because of a supposedly poor payment history at Duke Energy. After contacting Duke Energy and asking why she was being shown as a delinquent bill payer when she was not, the utility realized that the woman’s erroneous credit information was only the tip of a very large IT oofta iceberg.

While Duke Energy claims that “we take responsibility” for the error, it is being rather quiet about explaining what exactly “taking responsibility” means for the hundreds of thousands of customers who may have been unjustly financially affected by the erroneous information sent to the three credit agencies over the past four years. It wouldn’t surprise me to see a class action lawsuit filed against Duke Energy in the near future to help the company gain greater clarity on what its responsibility is.

In Other News…

Judge Orders California to Help LAUSD Fix School Computer Fiasco

UK’s Tax Agency Admits it Can’t Compute Taxes Properly

Tahoe Ski Resort Withdraws Erroneous $1 Season Pass

UK NHS Hospital Patients Offered Harry Potter Names

Florida Utility Insists New Billing System is Right: Empty House Used 614,000 Gallons of Water in 18 Days

Audit Explains How Kansas Botched Its $40 Million DMV Modernization Effort

Indiana BMV Finally Sending Out Overbilling Refund Checks

Nielson Says Software Error Skews Television Viewer Stats for Months

Mon, 13 Oct 2014 13:00:00 GMT
Japan Trader's $617 Billion “Fat Finger” Near-Miss Rattles Tokyo Market Bad memories of 2005 fat finger failure revived Bad memories of 2005 fat finger failure revived
Photo: Yoshikazu Tsuno/AFP/Getty

IT Hiccups of the Week

This week’s IT Hiccup of the Week concerns yet another so-called “fat finger” trade embroiling the Tokyo Stock Exchange (TSE). This time it involved an unidentified trader who last week mistakenly placed orders for shares in 42 major Japanese corporations.

According to a story at Bloomberg News, the trader placed over-the-counter (OTC) orders adding up to a total value of 67.78 trillion yen ($617 billion) in companies such as Canon, Honda, Toyota and Sony, among others. The share order for Toyota alone was for 1.96 billion shares—or 57 percent of the car company—amounting to about $116 billion.

Bloomberg reported that its analysis “shows that someone traded 306,700 Toyota shares at 6,399 yen apiece at 9.25 a.m. ... The total value of the transaction was 1.96 billion yen. The false report was for an order of 1.96 billion shares. [The Japan Securities Dealers Association] said the broker accidentally put the value of the transaction in the field intended for the number of shares.”

The $617 billion dollar order, which Bloomberg said was “greater than the size of Sweden’s economy and 16 times the Japanese over-the-counter market’s traded value for the entire month of August,” was quickly canceled before the orders could be completed. Given the out-sized orders and that OTC orders can be canceled anytime during market hours, it is unlikely that the blunder would have gone unfixed for very long, but the fact that it happened resurrected bad memories for the Tokyo Stock Exchange.

Back in 2005, Mizuho Financial Group made a fat finger trade on the TSE that could not be canceled out. A Financial Times of London story states that, “Mizuho Securities mistakenly tried to sell 610,000 shares in recruitment company J-Com at ¥1 apiece instead of one share at ¥610,000. The brokerage house said it had tried, but failed, to cancel the J-Com order four times.” The mistaken $345 million trade cost the president of the TSE along with two other exchange directors their jobs.

Then in 2009, a Japanese trader for UBS ordered $31 billion worth of bonds instead of buying the $310,000 he had intended, the London Telegraph reported.  Luckily, the order was sent after hours, so it was quickly discovered and corrected.

A little disconcerting, however, was a related Bloomberg News story from last week that quoted Larry Tabb, founder of research firm Tabb Group LLC. According to Tabb, despite all the recent efforts by US regulators and the exchanges themselves to keep rogue trades from occurring (e.g., the Knight Capital implosion), fat finger trades still “could absolutely happen here.”

“While we do have circuit breakers and pre-trade checks for items executed on exchange,” Tabb told Bloomberg, “I do not believe that there are any such checks on block trades negotiated bi-laterally and are just displayed to the market.”

Don’t insights like that from a Wall Street insider just give you a warm and fuzzy feeling about the reliability of financial markets?

In Other News…

Computer Glitch Affects 60,000 Would-be Organ Donors in Canada

Korean Air New Reservations System Irritates Customers

Ford Recalls 850,000 Vehicles to Fix Electronics

Mitsubishi i-MiEV Recalled to Fix Software Brake Issue

Doctors’ “Open Payments” Website Still Needs Many More Government Fixes

Apple iOS 8 Hit by Bluetooth Problems

Electronic Health Record System Blamed for Missing Ebola at Dallas Hospital

Mon, 6 Oct 2014 13:00:00 GMT
JP Morgan Chase: Contacts for 76 Million Households and 7 Million Small Businesses Compromised That's about half of U.S. households, in case you were wondering That's about half of U.S. households, in case you were wondering
Photo: Spencer Platt/Getty Images

Banking giant JP Morgan Chase filed an official notice yesterday to the U.S. Securities and Exchange Commission (SEC) updating the material information concerning the cyberattack the bank uncovered during the summer. According to the bank’s Form 8-K, for customers using its and JPMorganOnline websites as well as the Chase and J.P. Morgan mobile applications:

  • User contact information—name, address, phone number and email address—and internal JPMorgan Chase information relating to such users have been compromised.
  • The compromised data impacts approximately 76 million households and 7 million small businesses.
  • However, there is no evidence that account information for such affected customers—account numbers, passwords, user IDs, dates of birth or Social Security numbers—was compromised during this attack.

To give you some perspective on the size of the breach, there are approximately 112 million households in the United States, along with 29.7 million small businesses.

The bank also reported in its SEC filing that it hasn’t seen any unusual customer fraud related to the data breach and that customers will not be not liable for any unauthorized transaction on their accounts, provided that they promptly alert the bank to the bogus transaction.

JP Morgan goes on to say in a customer notice that it is “very sorry that this happened and for any uncertainty this may cause you.” Additionally, it  says that, “There are always lessons to be learned, and we will learn from this one and use that knowledge to make our defenses even stronger. “

In the bank's 2013 annual report, JP Morgan CEO Jamie Dimon stated  that the firm was going to be spending $250 million annually on cybersecurity and employ some 1,000 people to help ensure security at the bank.

Cybersecurity experts all seem to agree that the breach of JP Morgan, considered one of— if not the— most sophisticated and best cyber- protected banks in the world, is highly worrying. Less clear is whether the reason customer personal data wasn’t taken was accidental or on purpose. ( The Wall Street Journal reports that the bank’s marketing systems rather than operational banking systems were penetrated)

A story at the New York Times, for instance, says that the cybercriminals had deep and pervasive access to JP Morgan IT systems for months, even obtaining “the highest level of administrative privilege” to 90 of the bank’s computer servers.  However, the Times states, “investigators in law enforcement remain puzzled” since there is no evidence that money has been taken from customer accounts, nor has there been any launch of a major phishing campaign using the stolen contact information. Phishing a JP Morgan employee seems to be the way the cybercriminals got access to JP Morgan systems, by the way.

Speculation runs the gamut, including that the attack was sponsored by elements of the Russian government as a warning about Western government interference in the Ukrainian Conflict and that it could be a search for confidential information on high value targets, such as President Obama, who is said to be a JP Morgan customer. Other security experts speculate that this attack may have been just an initial foray into the bank’s IT system to understand how it works. If so, they likely will be back, in which case, expect more than contact information to be compromised.

Whatever the real reason, the bottom line is that as the recent compromise of 56 million U.S. and Canadian payment cards at Home Depot exemplifies, cyber-insecurity is pervasive. Security maven Brian Krebs probably said it best when he told the Guardian, “Reality is dawning among regular corporations that you can’t keep these guys out. The most you can do is stop the bleeding.”

Fri, 3 Oct 2014 17:00:00 GMT
FBI’s Sentinel System Still Not In Total Shape to Surveil Bureau promises fixes to $500 million system are on the way but DOJ inspector general wants hard proof Bureau promises fixes to $500 million system are on the way but DOJ inspector general wants hard proof
Illustration: iStockphoto

IT Hiccups of the Week

Other than the rather entertaining kerfuffle involving Apple’s new iPhone OS and its initial (non)corrective update (along with the suspicious “bendy phone” accusations), the IT Hiccups front was rather quiet this past week. Luckily, an “old friend” came by to rescue us from writing a post on some rather mundane IT snarl, snag or snafu.

Just in the nick of time, the U.S. Department of Justice's Inspector General released his latest in an ongoing series of reports [pdf] about Sentinel, the FBI’s electronic information and case management system. In this report, the IG focused on how Sentinel users felt about working with the system. Sadly yet unsurprisingly, the IG found that Sentinel is still suffering from some serious operational deficiencies two years after it went live.

Sentinel, you may remember, was finally deployed in 2012 as a replacement for the FBI’s legacy Automated Case Support (ACS) system, which was originally supposed to be superseded by the infamous US $170-million Virtual Case File system in late 2003 or early 2004. The VCF was itself a component of a larger FBI effort called Trilogy that was begun in 2001 to upgrade and modernize the FBI's IT systems. To best understand how and why VCF became infamous, you owe it to yourself to read “Who Killed the Virtual Case File?” Written by IEEE Spectrum's digital editorial director Harry Goldstein, it is a classic government IT project failure story that appeared in the September 2005 issue of Spectrum. For Agatha Christie murder mystery fans, Goldstein's story will immediately remind you of Murder on the Orient Express .

Anyway, the origins of Sentinel—which the DOJ Inspector General succinctly describes as providing “records management, workflow management, evidence management, search and reporting capabilities, and information sharing with other law enforcement agencies and the intelligence community”—go back to 2005. The system has carried with it the FBI's hopes that it would permanently erase the unpleasant memories associated with the Virtual Case File debacle. At one time, those hopes looked like they might be fulfilled, as even the usually highly critical U.S. Government Accountability Office, which lambasted the FBI for its VCF risk mismanagement, called the original $425 million Sentinel acquisition a procurement model for the FBI.

A Rocky Start and a Surprise Recovery

Alas, those hopes faded as troubles began to plague the Sentinel project. Lockheed Martin, which won the prime contractor role in March 2006, was removed from its leadership position in 2010 after a devastating independent assessment by MITRE. The assessment indicated that the project, which at that point was into the taxpayers for $461 million and already past its original 2009 completion date, would take at least another $351 million and six more years to complete.

In the wake of the MITRE assessment, plus withering criticism of the project [pdf] by the IG and others, the FBI decided its best course of action was to take over control of the project itself. The FBI decided to take an agile programming approach to try to finish the project quickly and without blowing the budget. To everyone’s surprise, the Bureau successfully deployed Sentinel in July 2012 at a total cost of around $500 million.

I say around $500 million because even the IG doesn’t seem to have a really good handle on Sentinel’s true costs: IT development, operations and maintenance, plus FBI personnel and other internally-borne IT costs are not easily accounted for. For example, the FBI is still operating legacy IT systems that were supposed to be replaced by Sentinel, and those costs are not counted against the cost of Sentinel. The IG reports today’s obligated cost of Sentinel is $551.4 million, not counting the tens of millions of dollars in related FBI internal costs that were and are still being incurred.

The Sentinel project’s unexpectedly smooth roll-out won it many laurels and accolades (including its own). It was even named a ComputerWorld Honors Program 2013 Laureate. And since its 2012 roll-out, there has been nary a perturbing peep about Sentinel in the press.

It Glitters, But It's Not Gold

However, according to the inspector general's latest report, while “Sentinel has had an overall positive impact on the FBI’s operations, making the FBI better able to carry out its mission, and better able to share information,” there remain major problems with Sentinel’s two critical functions of searching for and indexing case information.

The IG report states that Sentinel’s search function is supposed to provide users the ability to locate cases and specific case-related information within Sentinel, while the indexing function's role is to designate and modify the relationship between any two identifiers, such as the relationship between a person and that person’s address. The proper indexing of Sentinel records is critical if FBI agents are to be able to “connect the dots,” the IG states.

For instance, the IG provides the following example:

“Indexing allows Sentinel to add structure to the data it contains, which in turn enables improved search results. [For example], a search for white males who drive black cars using a search engine like those used for internet searches would return all documents that mention any of the following: white males, black males, white cars, or black cars. By adding structure to the data through indexing, Sentinel’s search function is able to return only white males who drive a black car. When a user indexes an entity, the system will suggest potential matches already indexed in Sentinel.”

But the IG found in a survey of FBI agents that only 42 percent stated they “often received the results they needed” from Sentinel. Some 59 percent reported that they “sometimes, rarely, or never received the results they needed.” The IG also said that some survey respondents commented that the search function of the old ACS system was better than Sentinel’s! Furthermore, the IG stated that two issues kept frustrating the system’s users: “Sentinel returned too many search results for a person to reasonably review or no results at all for a document the user knew existed.”

In addition, the IG noted that soon before Sentinel was rolled out in July 2012 to all that acclaim, FBI management was boasting to the IG [pdf] that “the search function is both flexible and powerful enough to accommodate the substantial volume and wide variety of information available for retrieval.” However, according to information uncovered by the IG, at the same time FBI management was singing praises to Sentinel's search function the it knew there were major deficiencies with Sentinel’s search function.

The inspector general didn’t outright state it in his report, but reading between the lines you can see an IG  clearly peeved that the FBI wasn’t honest in 2012 (or over the past two years, for that matter) about the true operational state of Sentinel's search capability and how it has hindered FBI personnel. The IG also cast some indirect doubt on how well the FBI’s highly touted agile approach worked: it may have helped save money and get Sentinel up and running quickly. But the question one has to ask now is, at what operational cost?

The IG also found that Sentinel’s indexing function wasn’t popular with users either. FBI Special Agents who now have to index their own case files (they used to hand the function off to an administrative staff member) complain that the process is a major administrative burden, and are frustrated that it “leaves less time for investigative activities." A full 41 percent of survey respondents “reported that they spent more time indexing in Sentinel than they did in ACS.”

I can’t be sure, but I'd bet that with Sentinel (and probably going back to VCF), FBI management wanted to reduce personnel costs by moving the task of indexing from administrative staff to Special Agents. They also probably reasoned that the move would improve the accuracy of the indexing. But as often happens in these cases, the “cost savings” usually turn out to be an illusion.

In fact, FBI management initially told the IG “that the FBI is not currently able to provide Special Agents in the field with assistance in reducing the time it takes to index large structured documents such as bank records, or unstructured documents such as a report of investigation form … or email.” In other words, there is no money to hire admins to do the case indexing grunge work. The IG responded that the FBI better find a technological solution and soon.

There are several other operational deficiencies listed in the IG report, which I won’t go over. Interestingly, some of these seem to be less of a design flaw and more of a combination of comfort with using technology and organizational memory of how things used to be. For instance, users with fewer than 10 years of FBI tenure, and especially those with fewer than 5 years, found it easier to use Sentinel’s search function than those with much longer tenures at the Bureau. That said, 25 percent of “tech savvy” users, still found the search function difficult to learn.

The IG made one telling statement in his report that should trouble all taxpayers in the United States:  “Based on the feedback received from Sentinel users, we are concerned that Sentinel does not appear to have met users’ expectations and needs.”

FBI management admitted that there are indeed some issues with Sentinel, but also told the IG that fixes are on the way. In October, the FBI said, there is going to be a major Sentinel software release that will help address many of the concerns the IG raised in his report.

The problems with Sentinel’s search may take longer to resolve, however. FBI management promised the IG that it will begin soliciting user feedback in regard to how to improve the search function. The Bureau will then develop and implement solutions to increase search functionality and operational efficiency. The IG agreed to the FBI's proposed approach but indicated that he was now going to be taking a “trust, but verify” perspective to the statement's made by FBI management. The IG stated that in future he wanted from the FBI “a detailed description of the change[s] made and how the search function was improved as a result.”

As the old saying goes, fool me once, shame on you; fool me twice, shame on me. The IG has made it crystal clear to FBI management that he isn’t going to be fooled again.

In Other News...

Apple Says Sorry for iPhone Software Update Mess

Dallas Police Officials Dispute Claims of New Computer Problems

Morgan Stanley Agrees to Pay for Prospectus IT Snafu

UK Tax Department Demands Taxes from Thousands of Firms that Don’t Owe Anything

Barclays Software Maintenance Issue Causes Online, Telephone, and Mobile Banking Failures

Louisiana State Police Computer Failure Keeps Bonded Prisoners in Jail

Software Issues Blamed for Voting Results Problem in New Brunswick, Canada

Cover Oregon Health Insurance Exchange Finds Yet Another Tax Credit Problem Website Cost Jumps $1 Billion Plus

Mon, 29 Sep 2014 13:00:00 GMT
Home Depot: Everything is Secure Now, Except Maybe in Canada Hardware retailer finally fesses up to a data breach that compromised 56 million payment cards Hardware retailer finally fesses up to a data breach that compromised 56 million payment cards
Photo: Daniel Acker/Bloomberg/Getty

This past Thursday, after weeks of speculation, Home Depot, which calls itself the world’s largest home improvement retailer, finally announced [pdf] the total damage from a breach of its payment system: At its 1,157 stores in the U.S. and Canada, 56 million unique credit and debit cards were compromised. This is said to be among the three largest IT security breaches of a retail store, and ranks with some of the largest security breaches of all time.

According to Home Depot’s press release, the company confirmed that the criminal cyber intrusion began in April and ran into September, and “used unique, custom-built malware to evade detection. The malware had not been seen previously in other attacks, according to Home Depot’s security partners.”

The company says that it has now removed all the malware that infected its payment terminals, and that it has “has rolled out enhanced encryption of payment data to all U.S. stores.” The enhanced encryption approach, Home Depot states, “takes raw payment card information and scrambles it to make it unreadable and virtually useless to hackers.” It is a bit curious that the company says “virtually useless” and not “completely useless,” though.

Canadian stores, on the other hand, will have to wait a bit longer. While Home Depot’s Canadian stores have point-of-sale EMV chip and PIN card terminals, “the rollout of enhanced encryption to Canadian stores will be completed by early 2015,” the company says. Canadian Home Depot stores were at first thought to be less vulnerable because of the chip-and-pin terminals being in place, but that apparently hasn't been the case. For some reason, the company is refusing to disclose the number of Canadian payment cards compromised, the Globe and Mail says. The Globe and Mail estimates the total number of cards compromised to be around 4 million.

Home Depot goes on to say in its press release that it has no evidence “that debit PIN numbers were compromised or that the breach has impacted stores in Mexico or customers who shopped online at or”

As usual in these situations, Home Depot “is offering free identity protection services [for one year], including credit monitoring, to any customer who used a payment card at a Home Depot store in 2014, from April on. The company also apologized to its customers “for the inconvenience and anxiety this has caused.”

Home Depot’s data breach was first made public on 2 September by Brian Krebs, the former longtime Washington Post reporter with amazing IT security contacts, who now publishes a must-read security website called Krebs on Security. Several banking sources told Krebs that “a massive new batch of stolen credit and debit cards that went on sale [that] morning in the cybercrime underground,” with Home Depot looking like the source. Krebs went on to write that:

There are signs that the perpetrators of this apparent breach may be the same group of Russian and Ukrainian hackers responsible for the data breaches at Target, Sally Beauty and P.F. Chang’s, among others. The banks contacted by this reporter all purchased their customers’ cards from the same underground store — rescator[dot]cc — which on Sept. 2 moved two massive new batches of stolen cards onto the market.

In fact, it wasn’t until 8 September that Home Depot confirmed that it had in fact suffered a breach. Krebs, who has since written about the breach several times, recently wrote that the breach may not be as severe as indicated (nor as severe as it could have been). Sources have indicated that the malware used — which looks like a variant of what smacked Target late last year — was “installed mainly on payment systems in the self-checkout lanes at retail stores.” The reasoning is that if the malware had penetrated Home Depot’s payment system to the extent that Target’s systems were breached, many more than 56 million payment cards would have been compromised.

Sellers of compromised Home Depot card data are targeting specific states and ZIP codes in the hopes that buyers of the stolen cards will raise fewer red flags in the credit card and banking fraud algorithms. For instance, some 52,000 for Maine Home Depot stores, 282,000 for stores in Wisconsin, and 12,000 for those stores in Minnesota have been offered for sale. Card prices seem to be ranging mostly from $9 to $52 apiece, although for $8.16 million, one could purchase all of the stolen payment card numbers from Wisconsin, the Milwaukee-Wisconsin Journal Sentinel reported. The Journal Sentinel noted that its investigation found that:

Prices start at $2.26 for a Visa debit card with an expiration date of September 2014. The most valuable cards are MasterCard platinum debit cards and business credit cards. The most expensive card compromised in Wisconsin, a MasterCard valid through December 2015, was advertised at $127.50.

Interestingly, while Home Depot’s 56 million payment card breach is larger than Target’s 40 million payment card breach, the severity of the blowback so far is much more muted on the part of customers and investors. Part of the reason seems to be that the discovery of the breach happened at the end of summer, a slow shopping time for Home Depot, while Target’s was announced during the prime holiday buying period, which spooked its customers.

Further, investors have figured that Target’s breach cost the company some $150 million, excluding the $90 million in insurance reimbursements—a sum the company could ill afford given its ongoing retail difficulties. A similar sum may dent Home Depot’s bottom line, but the company is better placed financially to absorb the damage. The company stated in its press release that it has spent at least $62 million in dealing with the breach so far, with some $27 million of it covered by insurance. Home Depot says it doesn’t know how much more it will need to spend, but I suspect it could be an additional couple of hundred million dollars before all is said and done.

A third reason for the muted response may be that customers are now becoming inured in the wake of so many point-of-sales data breaches. For example, last May, the Ponemon Institute was cited in a CBS News report as stating that some 47 percent of adult Americans have had their personal information compromised in the past year. Given the Home Depot breach, as well as many others since, the number is probably even higher now. How many people had their personal information compromised multiple times is unknown, but I suspect it isn’t an insignificant number.

Home Depot’s financial and reputational pain might increase significantly, however, if the joint Connecticut, Illinois, and Californian state attorneys general investigation into the breach decides there is sufficient cause to sue Home Depot. As expected, at least one class action lawsuit each has been filed in both the United States and Canada, and more can be expected. Banks may also decide to sue Home Depot to cover the cost of any credit or debit cards they have to replace and for other financial damages, like some did against Target and earlier against TJX.

As reported by both The New York Times and Bloomberg’s BusinessWeek , Home Depot was repeatedly warned by its own IT security personnel about its poor and outdated IT security since 2008. Corporate management reportedly decided not to increase immediately the company’s security capabilities using readily available systems even in the aftermath of the Target breach and a couple of Home Depot stores being hacked last year, incidents that were not publicly disclosed until now. While the company did eventually decide to upgrade its payment security systems, the implementation effort didn’t get started until April, the same month as the breach. In addition, the papers report, Home Depot seemed to have weak security monitoring of its payment system, even though company management knew it was highly vulnerable to attack.

That Home Depot’s payment system was left vulnerable is interesting, because the company spent hundreds of millions of dollars improving its IT infrastructure over the past decade. Perhaps with revenues of $79 billion in 2013 the company felt it could easily afford the costs of an attack, and therefore, there was no urgent rush to increase its security posture. Brian Krebs notes this apparent lack of urgency as well. He says that even though the company was alerted to something being massively amiss by banks,  “thieves were stealing card data from Home Depot’s cash registers up until Sept. 7, 2014, a full five days after news of the breach first broke.

That alone speaks of an arrogance that belies Home Depot's public statements about how it takes the privacy and security of its customers’ personal information “very seriously.” Local Home Depot store personnel I have spoken with seem very ill-informed concerning the breach and what customers should do about it, which also seems to me a sign of a less than Home Depot’s advertised customer-caring attitude.

Home Depot’s seemingly cavalier IT security attitude isn’t unique, of course. Target didn’t bother to investigate alerts from its advanced warning system showing that it was being hacked until it was JTL — just too late. Just last week, eBay was being slammed again for its “lackadaisical attitude” toward IT security after multiple instances of malicious cross-site scripting that have been unabated since February were found on its UK website. Only after the BBC started asking eBay questions about the scripting issue did it decide that perhaps it should take them seriously. You may remember, it was only last March when eBay, which also proclaims to take customer security “very seriously,” asked all of its users to change their passwords after a cyberattack compromised its database of 233 million usernames, contact information, dates of birth, and encrypted passwords.

To tell you the truth, every time I read or hear a company or government agency claim in a press release that, “We take your security seriously,” in the wake of some security breach, I shake my head in disbelief.  Why not just state honestly, “We promised to take your security seriously and we obviously failed to take it seriously enough. We’re sorry and we will be better prepared from now on.” Alas, that level of candor is probably much too much to ask.

Tue, 23 Sep 2014 13:00:00 GMT
Indiana’s Bureau of Motor Vehicles Overcharged 180,000 Customers for 10 years BMV says it will refund $29 million plus interest over the coming months BMV says it will refund $29 million plus interest over the coming months
Illustration Credit: iStockphoto

IT Hiccups of the Week

Put aside, for a moment, the record theft of credit card accounts from Home Depot. I'll tell you all about that in a later post. Instead let me pick another interesting IT Hiccup from last week's hodgepodge of IT problems, snarls, and screw-ups: The Indiana’s Bureau of Motor Vehicles (BMV) plans to refund some US $29 million plus interest to 180,000 customers for charging them an incorrectly calculated excise tax when they registered their vehicles. The BMV claimed the problem began during the initial changeover in 2004 to its then new $32 million System Tracking and Record Support (STARS) computer system.

According to the BMV’s press release [pdf], when a car is registered with the agency, state law requires that the vehicle be placed in a specific tax classification category based on its value. The value is calculated “using the price of the vehicle and applying an adjustment factor based upon Consumer Price Index [CPI] data related to increases in new automobile price.” The result of the value calculation is then entered into STARS which in turn uses it to automatically determine the excise tax that needs to be paid by the vehicle’s owner. For reasons that the BMV did not disclose, the STARS programming seems to have failed to take into account the adjustment factor for 180,000 out of over 60 million vehicles registered using the software system, resulting in that small subset of registrations being overcharged.

The BMV says it will be mailing out letters to those customers affected within the next month or so. However, those overcharged customers will still need to fill out the form enclosed in the letter and send it back to the state if they want to receive their refunds.

The miscalculation of the excise tax doesn’t just affect BMV customers, either. A percentage of the excise tax collected by the BMV is transferred to Indiana’s local and county governments, which are now going to have to repay an estimated $6 million back to the state. The BMV press release states that to help reduce the financial impact on local and county government budgets, the state will pay back the overcharged BMV customers and the interest they are owed. The local and county governments will then see their excise tax distributions reduced over the next two years to make up for the previous incorrect tax payments they received.

Like most of these types of incidents, this one has an interesting back story: According to the Post-Tribune, the excise tax error was only recently discovered when the latest Consumer Product Index data was being entered into STARS. Someone apparently started asking how the excise tax was being calculated, and this led to the discovery of the tax overcharges going starting in 2004.

In addition, the BMV press release states that Indiana Governor Mike Pence has authorized the BMV to appoint of an independent consulting company to audit the agency to ensure there are no more fee miscalculations. The reason, BMV Commissioner Don Snemis told the Associated Press, was that, “I don't want to discover any more errors after the damage has been done.”  Snemis might have added that what he really didn’t want were any more errors disclosed as a result of yet another lawsuit against the BMV.

You see, the BMV settled a $30 million class action lawsuit last year for also overcharging the fees paid to it by some 4.5 million BMV customers when they obtained or renewed their driver's licenses between 7 March  2007 and 27 June 2013.  The BMV was publicly embarrassed by the suit, and agreed to credit customer accounts for what it called the “inadvertent” overcharges. The lawyers suing the BMV, however, said the fees were actually imposed on Indiana drivers on purpose, something the BMV strongly denied.

Then in September of last year, the BMV was even more embarrassed when it had to admit that it had also discovered that it been overcharging customers on some 30 other fees. The BMV said that it would immediately be crediting customer accounts for any charges imposed going back six years, which was the statute of limitations. The BMV blamed “the errors on misapplying complex state laws governing more than 300 fees for various BMV services,” the Indianapolis Star reported.

Soon after the BMV admission, the same law firm of Cohen and Malad LLP that forced the earlier settlement (and earned $6 million off of it) brought a second class-action lawsuit against the BMV in October of last year asking the courts to force BMV to not merely issue credits, but instead issue full refunds. The lawsuit also demanded that there be a full accounting of all BMV overcharges, including when the overcharges were first discovered by BMV management. The law firm contended that BMV knew about the 30 plus fee overcharges for some time, and did nothing about them until the first lawsuit was filed. The BMV once more denied the charge, and countered that BMV customers were already made whole by the issuing of the credits to their accounts. Snemis also accused the law firm of Cohen and Malad of bringing the lawsuit as just a way of “seeking a very big fee.

However, the Indianapolis Star reported in June that a video tape deposition given by a former BMV Deputy Director in fact indicated that the BMV has known about the overcharging for quite a while, but “secretly kept doing it for at least two years to avoid budget troubles.”  The Journal Gazette further reported that the BMV is now trying to keep secret other depositions taken in the latest lawsuit against it, supposedly because the BMV claims that allowing them to become public would discourage others from testifying in the lawsuit, an excuse no one seems to be buying.

Another wrinkle to this story is that the STARS system itself has a long, controversial history of its own. The STARS system, as was mentioned, was initially introduced into the BMV in early 2004. However, continuing problems with it meant that the system it was supposed to replace, called BOSS, had to be used concurrently with STARS until July 2006, when BMV Commissioner Joel Silverman decided that the agency was to going to use STARS exclusively. It may have been during this cross-over period when the latest fee fiasco occurred.

Silverman’s decision did not turn out well. Angry BMV customers quickly found themselves waiting in long lines as many routine transactions were unable to be completed; online and self-service were also unavailable. Indiana state police also were said to be worried that they might arrest drivers without just cause or have to let drivers who should be arrested go free because the information being sent to them from the BMV appeared to be inaccurate.  

It wasn’t until late August that BMV operations started resembling anything approaching normal. By then, however, former Gov. Mitch Daniels was forced to apologize for the on-going BMV problems in attempt to cool the ensuing political firestorm raised by the debacle.

By early September, Silverman decided that perhaps it was in everyone’s—and especially the governor’s—best interest to resign. No doubt helping Silverman’s decision along was his loud and forceful declaration within a week of STARS going live that the system was fixed when it obviously was not, as well as admitting (pdf) that none of the extensive system testing that was supposedly done showed any indication of problems later encountered. Obviously, all that extensive testing missed the current excise tax problem as well.

It took over a year for STARS to finally start to operate satisfactorily.

In Other News….

Software Failure Crashes Rice University Network

Fat Finger Trade Hits BP Shares

Canadian Geography Vexes Apple IPhone 6

Johannesburg Stock Exchange Experiences Network Issues

Canadian Hospital Computer Snafu Delays 2,700 Test Results

Australian Dept. of Human Services Fixing Aged Care Processing Glitch

Technology Errors with New Parking System Hit Train Commuters in Perth Australia

New Smart Parking Meters in Walnut Creek California Dole Out Fines Even for Fully Paid Up Parking

Microsoft Pulls OneDrive for Business Patch

Computer and Human Errors Allow Passenger to Board Plane with Wrong Boarding Pass

Flawed Federal Health Insurance Calculator Allows Large Employers to Offer Substandard Health Plans

US Labor Department Releases Important CPI Data Early Due to Unknown "Technical Issues"

Mon, 22 Sep 2014 04:58:00 GMT
GM: The Number of Models That Could Shut Off While You’re Driving Has Tripled Faulty ignition switches could be a hazard in 100 million vehicles Faulty ignition switches could be a hazard in 100 million vehicles
Photo: Mark Wilson/Getty Images
Kim Langley [left], Laura Christian [second from left], Randal Rademaker [second from right] and Mary Ruddy [right] stand with other family members of drivers who were killed while driving GM cars during a news conference at the U.S. Capitol on 1 April 2014. The families want to know why it took GM so long to recall the faulty ignition switch on certain models.

Guess what I got in the mail yesterday! Nope. But that was a good guess. The letter in my mailbox was a safety recall notice from General Motors, the manufacturer of the car I drive. Why should you care, you ask? I'm one of half a million people who have received the notice about the problem, but we represent less than one percent of the number of drivers affected.

In mid-March, I wrote a post for the Risk Factor blog discussing how GM belatedly got religion over its egregious failure to remedy a dangerous problem. Here’s how I described it then:

Millions of cars were equipped with a part that didn’t provide enough resistance to, say, a key ring swinging and rotating the car key so that the ignition was suddenly turned from the on (run) position to the off (accessory) position. There’s nothing to prevent that turn from happening except the tension provided by the spring in the part, known as a detent plunger.

What’s so bad about that? When the car suddenly turns off, power assist for steering and braking are lost, leaving a driver desperately struggling to keep the car from crashing.

That having been said, two things puzzled me about the recall notice I received:

1) The fact that it is now mid-September and the letter notes that the automaker expects “to have sufficient parts to begin repairs by October 1, 2014.” Only then, the letter says, should I contact my local GM dealer to arrange a service appointment. For those keeping score at home, that’s six months since I reported about GM’s CEO making a halfhearted, quasiadmission that the company had dragged its feet and let more than a dozen people die from sudden engine shutoffs. (According to the Wall Street Journal, the official death toll has risen from 13 to 19, but 125 wrongful death claims have been filed in court. Additionally, 445 claims have been made against a US $400 million fund GM has set up to cover such claims.) I can’t count the number of trips I’ve made between my home and the park-and-ride on my way to and from work, or the number of times I’ve picked my sons up from school, taken them on camping trips or to ballgames. GULP!

(Okay. I know what you’re thinking: If you wrote about this, you were the LAST person who should have been unaware of the danger! But that brings me to the second thing that caused me to cock my head in bewildered surprise.)

2) The GM models that had been linked to deaths because of faulty ignition switches were the Chevy Cobalt, Chevy HHR, Pontiac G5, Pontiac Pursuit, Pontiac Solstice, the Saturn Ion, and the Saturn Sky. But it’s now abundantly clear that the Chevrolet Malibu, the GM vehicle I drive, should have been on that list as well.

It didn’t take much digging to discover a few mind-blowing things. That list of seven GM models known to have problems with faulty ignition switches has now ballooned to 20. (Click here to go to the GM ignition recall safety information website; the front page contains a list of the models and affected model years.)

I called the Chevrolet Customer Assistance Center toll-free number provided on the recall notice. A representative told me that the problem with the 13 additional models was discovered in May. He indicated that, as of last week, the automaker had sent out 500,000 notices related to the ignition switch recall for the models added to the list at that point. Asked why it took four months to send me the piece of paper notifying me that my vehicle could potentially suffer sudden shutoff, he said GM has been sending letters around to the states and was just getting around to me. But I guess I should count myself lucky. The customer service representative revealed that although he couldn’t state categorically the exact number of affected vehicles, he estimates that it is “100 million, perhaps.” That’s a lot of letters—many of which have yet to be posted—and a lot of drivers unaware of the potentially deadly problem they encounter each time they get behind the wheel.

Wed, 17 Sep 2014 18:00:00 GMT
Looking for the Key to Security in the Internet of Things New standards will be necessary to keep the coming horde of devices from introducing myriad problems New standards will be necessary to keep the coming horde of devices from introducing myriad problems
Photo-illustration: iStockphoto

As the number of Internet connected-devices in any home skyrockets from a few, to a few dozen, to perhaps even a few hundred—including interconnecting thermostats, appliances, health and fitness monitors and personal accessories like smart watches—security concerns for this emerging Internet of Things (IoT) will skyrocket too. Cisco projects that there will be 50 billion connected devices by 2020; each such node should ideally be protected against malware, spyware, worms, and trojans, as well as overzealous government and commercial interests who themselves might produce their own privacy-compromising intrusions.

It’s a tall order, says Allen Storey, product director at the UK security firm Intercede. But the biggest challenges today are not so much technical problems as they are matters of awareness and education. Consumers need to know, says Storey, that IoT security is a real concern as the first wave of gadgets roll out into the marketplace. And unlike devices with faster processors and bigger memories, security is a product feature that the marketplace may not by itself reward.

Writing in the journal Network Security in July, Storey said that “Without the threat of end-user backlash, there is no strong business case for manufacturers to add a ubiquitous security element into the development process.” Moreover, he said, commercial pressures could in fact only reduce IoT security as many small players rush to be first to market. It's also likely that all the players could pursue siloed security standards that would leave substantial security holes as those devices interconnect with still other Internet-enabled devices (e.g. routers, smartphones, smart watches).

In the absence of any clear industry-wide IoT security standards, Intercede CTO Chris Edwards says consumers should shop for devices that rely on tried and tested security schemes, especially public key cryptography.

“When you’re looking at authenticating devices, the only real standards at the moment that offer any real interoperability tend to be Public Key Infrastructure (PKI),” he says. “The idea here is that you have a secure hardware element in that device that is able to generate and store and use private cryptographic keys that cannot be exported. So you can’t clone that device.”

So PKI chips, like those found in most smart cards, can help secure IoT communications. One other security standard that could be important in the IoT’s early years, Edwards says, is that of the FIDO (Fast IDentity Online) Alliance.

FIDO, a commercial consortium whose members include Microsoft, Google, PayPal, Lenovo, BlackBerry, and MasterCard, offers a lower-overhead variation of PKI that authenticates users and devices in part via biometrics (e.g. fingerprint-sensing chips) and PINs. This in turn makes FIDO more readily scalable to home networks with many devices on them, some of which may not have the battery or processor power to do classic private-public key cryptography for every communication.

“I don’t want the whole world to trust my watch,” Edwards says. “I just want to make sure the front door trusts my watch.”

Apple is conspicuously absent from FIDO's membership roll, which means that the Apple Watch's security will involve a yet to be disclosed set of proprietary security standards. Those protocols will thus probably form an important second web of security standards for the most secure IoT devices.

As an example of an IoT network that uses both PKI and FIDO, Edwards imagines a smartphone that communicates with a smart refrigerator in its owner’s home. The phone and refrigerator have already been introduced to each other and thus don’t need the highest PKI security levels. In that situation, FIDO would suffice for communications between the two devices such as the smartphone telling the fridge to go into low-power mode when the family goes on vacation, or the fridge reporting to the phone that it's time to pick up some milk from the grocery store.)

On the other hand, if the fridge communicates directly to the store to order more milk, the grocery store isn’t going to want to deal with FIDO certifications for its hundreds of customers. It’s more likely to insist on PKI security and authentication when a nearby fridge orders a gallon of milk or a case of beer.

In all, Storey says, the landscape of IoT security standards demands a company that can manage all such secure transactions behind the scenes for the cornucopia of third-party IoT device makers—perhaps like antivirus software today is managed and regularly updated by a small set of private, specialized companies.

“Given the absence of one standards agency producing cover-all protocols, an opportunity has emerged for security vendors and service providers to offer their own umbrella solutions that enable the individual to take control,” Storey wrote. “This is an exciting new dawn, but the industry must first come together to ensure it is a secure one for everyone concerned.”

Tue, 16 Sep 2014 14:00:00 GMT
Detroit's IT Systems “Beyond Fundamentally Broken” Being only fundamentally broken would be an improvement, city CIO says Being only fundamentally broken would be an improvement, city CIO says
TomSzczerbowski GettyImages

IT Hiccups of the Week

Last week’s IT Hiccups parade was a bit slower than normal, but there were a couple of IT snafus that caught my eye. For instance, there was the embarrassed admission by Los Angeles Unified School District (LAUSD) chief strategic officer Matt Hill that the new-but-still-problem-plagued MiSiS student tracking system I wrote about a few weeks ago should have had “a lot more testing” before it was ever rolled out. There also was the poorly thought out pasta promotion by Olive Garden restaurants that ended up crashing its website. However, what sparked my curiosity most was the disclosure by Beth Niblock, Detroit’s Chief Information Officer, that the city’s IT systems were broken.

How broken are they? According to Niblock:

“Fundamentally broken, or beyond fundamentally broken. In some cases, fundamentally broken would be good.”

Niblock’s comment was part of her testimony during Detroit’s bankruptcy hearings. Last July, Detroit filed bankruptcy and since then has been in bankruptcy court trying to work out debt settlements with its creditors, some of whom are unhappy over the terms the city offered. Niblock was a witness at a court hearing looking into whether the city’s bankruptcy plan was feasible and fair to its many creditors, and whether the plan would put the city on more sound financial and operational footing.

Critical to Detroit returning to financial and operational soundness is the state of the city’s IT systems. However, since the 1990s, the city’s IT systems have generally been a shambles, and that is putting it charitably. Currently, according to Niblock (who took on the CIO job in February after turning it down twice and maybe wishing she did a third time), the city’s IT systems are “atrocious”, “unreliable” and “deficient,” Reuters reported.

Reuters went on to report Niblock's testimony that the city’s Unisys mainframe systems are “so old that they are no longer updated by their developers and have security vulnerabilities.” She added that the desktop computers, which mostly use Windows XP or something older, “take 10 minutes” to boot. It probably doesn’t matter anyway, since the computers run so many different versions of software that city workers can’t share documents or communicate, Niblock says. That also may not be so bad, given that city computers have apparently been infected several times by malware.

Detroit’s financial IT systems are so bad that the city really hasn’t known what it is owed or in turn, what it owes, for years. A Bloomberg News story last year, for example, told the story of a $1 million check from a local school district that wasn’t deposited by Detroit for over a month. During that time, the check sat in a city hall desk drawer. That isn’t surprising, the Bloomberg story noted, as the city has a hard time keeping track of funds electronically wired to it. The financial systems are so poor that city income-tax receipts need to be processed by hand; in fact, some 70 percent of all of the city’s financial accounting entries are still done manually. The costs of doing things manually are staggering: it costs Detroit $62 to process each city paycheck, as opposed to the $18 or so it should cost.  Bloomberg stated that a 2012 Internal Revenue Service audit of the city’s tax collection system termed it as being “catastrophic.”

While the financial IT system woes are severe, the fire and police departments' IT systems may be in even worse shape. According to the Detroit News Free Press, there is no citywide computer aided dispatch system to communicate emergency alerts to fire stations. Instead, fire stations receive the alerts by fax machine. To make sure the alarm is actually heard, fire fighters have rigged Radio Shack buzzers and doorbells, among other homemade Rube Goldberg devices that are triggered by the paper coming out of the fax machine. Detroit's Deputy Fire Commissioner told the Detroit News Free Press that, “It sounds unbelievable, but it’s truly what the guys have been doing and dealing with for a long, long time.”

You really need to check out the video accompanying the Detroit News Free Press story which shows fire fighters using a soda can filled with coins and screws perched on the edge of the fax machine so that it will be knocked off by the paper coming out of the machine when an emergency alert is received at the fire station. Makes one wonder what happens if the fax runs out of paper.

The Detroit police department's IT infrastructure, what there is of it, isn’t in much better shape. Roughly 300 of its 1150 computers are less than three years old. Apparently even those “modern” computers have not received software updates, and in many cases, the software the police department relies on is no longer supported by vendors. The police lack an automated case management system, which means officers spend untold hours manually filling out, filing, and later trying to find paperwork. Many Detroit police cars also lack basic Mobile Data Computers (MDC), which means officers have to rely on dispatchers to perform even basic functions they should be able to do themselves. An internal review (pdf) of the state of Detroit’s police department was published in January, and it makes for very sad, if not scary, reading.

If you are interested in how Detroit’s IT systems became “beyond fundamentally broken,” there is a great case study that appeared in a 2002 issue of Baseline magazine. It details Detroit’s failed attempt, beginning in 1997, to upgrade and integrate its various payroll, human resources, and financial IT systems into a single be-all Detroit Resource Management System (DRMS) that went by the name “Dreams.” The tale told is a familiar one to Risk Factor readers: attempting to replace 22 computer systems used across 43 city departments with one city-wide system resulted in a massive cost overrun and little to show for it five years on. Crain’s Detroit Business also took a look back at the DRMS implementation nightmare in a July article.

Detroit hopes, the Detroit News reports, that the bankruptcy judge will approve its proposed $101 million IT “get well” plan, which includes $84.8 million for IT upgrades and $16.3 million for additional IT staff. (In February, according to a story in the Detroit News Free Press, the city wanted to invest $150 million, but that amount apparently needed to be scaled back because of budgetary constraints.) Spending $101 million, Niblock admitted, will not buy world-class IT systems, but ones that are, “on the grading scale… a ‘B’ or a B-minus” at best. And Niblock concedes that getting to a “B” grade will require a lot of things going perfectly right, which is not likely to happen.

On one final note, I’d be remiss not to mention that last week was also the 25th anniversary of the infamous Parisian IT Hiccup. For those who don’t remember, in September 1989, some 41,000 Parisians who were guilty of simple traffic offenses were mailed legal notices that accused them of committing everything from manslaughter to hiring prostitutes or both.  As a story in the Deseret News from the time noted:

“A man who had made an illegal U-turn on the Champs-Élysées was ordered to pay a $230 fine for using family ties to procure prostitutes and ‘manslaughter by a ship captain and leaving the scene of a crime.’”

Local French officials blamed the problem on “human error by computer operators.”

Plus ça change, plus c'est la même.

In Other News ….

Coding Error Exposes Minnesota Students' Personal Information

Computer Glitch Sounds Air Raid Sirens in Polish Town

Computer Problems Change Florida County Vote Totals

Billing Error Affects Patients at Tennessee Regional Hospital

Dallas Police Department Computer Problems Causing Public Safety Concerns

New York Thruway Near Albany Overbills 35,000 EZ‐Pass Customers

Olive Garden Shoots Self in Foot With Website Promotion

Apple Store Crashes Under iPhone6 Demand

Scandinavian Airlines says Website Now Fixed After Two Days of Trouble

Housing New Zealand Tenants Shocked by $10,000 a Week Rent Increases

GM's China JV Recalling 38,328 Cadillacs to Fix Brake Software

LAUSD MiSiS System Still Full of Glitches

Mon, 15 Sep 2014 14:00:00 GMT
FCC Fines Verizon $7.4 Million Over Six-Year Privacy Rights “IT Glitch” 2 million customers not told they could opt-out of target marketing 2 million customers not told they could opt-out of target marketing
Photo: Denis Doyle/Bloomberg/Getty Images

IT Hiccups of the Week

The number of IT snafus, problems and burps moved back to a more normal rate last week. There were a surprising number of coincidental outages that hit Apple, eBay, Tumblr and Facebook, but other than these, the most interesting IT Hiccup of the Week was the news that the U.S. Federal Communications Commission (FCC) fined Verizon Communications a record $7.4 million for failing to notify two million customers of their opt-out rights concerning the use of their personal information for certain company marketing campaigns.

According to the Washington Post, Verizon is supposed to inform new customers via a notice in their first bill that they could opt-out of having their personal information used by the company to craft targeted marketing campaigns of products and services to them. However, since 2006, Verizon failed to include the opt-out notices.

A Verizon spokesperson blamed the oversight as being “largely due to an inadvertent IT glitch,” the Post reported. The Verizon spokesman, however, didn’t make it clear as to why the company didn’t notice the problem until September 2012, nor why it didn’t inform the FCC of the problem until 18 January 2013, some 121 days later than the agency requires. (Companies are required to inform the FCC of issues like this within five business days of their discovery.)  

The FCC’s press release annoucing the fine showed that the agency was clearly irritated by Verizon’s tardiness. Travis LeBlanc, the acting chief of the FCC Enforcement Bureau, said that, “In today’s increasingly connected world, it is critical that every phone company honor its duty to inform customers of their privacy choices and then to respect those choices. It is plainly unacceptable for any phone company to use its customers’ personal information for thousands of marketing campaigns without even giving them the choice to opt out.”   

Of course, a better solution would be for the FCC to force companies to allow customers only to opt-in to the use of their personal information, but that discussion is for another day.

On top of the $7.4 million fine, which the FCC took pains to point out is the “largest such payment in FCC history for settling an investigation related solely to the privacy of telephone customers’ personal information,” Verizon will have to include opt-out notices in every bill, as well as put a system in place to monitor and test its billing system to ensure that they actually go out.

Verizon tried to downplay the privacy rights violation, of course, even implying that its customers benefited from the glitch by being able to receive “marketing materials from Verizon for other Verizon services that might be of interest to them.”

Readers of the Risk Factor may remember another Verizon inadvertent IT glitch disclosed in 2010 in which  Verizon admitted that it over-billed customers by $52.8 million for “mystery fees” over three years.  During that time, Verizon customers who called the company to complain over the fees were told  basically to shut up and pay them. The FCC smacked Verizon with a then FCC record-setting $25 million fine for that little episode of customer non-service and IT ineptitude.

Last year, Verizon agreed to pay New York City $50 million for botching its involvement in the development of a new 911 emergency system. Alas, that wasn’t a record-setting settlement; SAIC owns that honor after paying the city $466 million to settle fraud charges related to its CityTime system development.

In Other News…

eBay Access Blocked by IT Problems

Facebook Experiences Third Outage in a Month

Tumblr Disrupted by Outage

Apple iTunes Outage Lasts 5 Hours

Twitter Sets Up Software Bug Bounty Program

Children Weight Entry Error Placed Australian Jet at Risk

Spanish ATC Computer Problem Scrambles Flights

Yorkshire Bank IT Problems Affects Payments

Computer Problem Hits Boston MBTA Corporate Pass Tickets

Unreliable Washington, DC Health Exchange Still Frustrates Users

South African Standard Bank Systems Go Offline

New Zealand Hospital Suffers Major Computer Crash

Computer Crash Forces Irish Hospital to Re-Check Hundreds of Blood Tests

Fiji Airways Says No to $0 Tickets Caused by Computer Glitch

Portugal’s New Court System Still Buggy

Hurricane Projected Landfall Only 2,500 Miles Off

Mon, 8 Sep 2014 13:00:00 GMT
Vulnerable "Smart" Devices Make an Internet of Insecure Things The weakest security links in the Internet of Things are often the Things themselves The weakest security links in the Internet of Things are often the Things themselves
Photo: Robyn Beck/AFP/Getty Images

According to recent research [PDF], 70 percent of Americans plan to own, in the next five years, at least one smart appliance like an internet-connected refrigerator or thermostat. That's a skyrocketing adoption rate considering the number of smart appliance owners in the United States today is just four percent. 

Yet backdoors and other insecure channels have been found in many such devices, opening them to possible hacks, botnets, and other cyber mischief. Although the widely touted hack of smart refrigerators earlier this year has since been debunked, there’s still no shortage of vulnerabilities in the emerging, so-called Internet of Things.

Enter, then, one of the world’s top research centers devoted to IT security, boasting 700 students in this growing field, the Horst Görtz Institute  for IT Security at Ruhr-University Bochum in Germany. A research group at HGI, led by Christof Paar—professor and chair for embedded security at the Institute—has been discovering and helping manufacturers patch security holes in Internet-of-Things devices like appliances, cars, and the wireless routers they connect with.

Paar, who is also adjunct professor of electrical and computer engineering at the University of Massachusetts at Amherst, says there are good engineering, technological, and even cultural reasons why security of the Internet of Things is a very hard problem.

For starters, it’s hard enough to get people to update their laptops and smartphones with the latest security patches. Imagine, then, a world where everything from your garage door opener, your coffeemaker, your eyeglasses, and even your running shoes have possible vulnerabilities. And the onus is entirely on you to download and install firmware updates—if there are any.

Furthermore, most Internet-connected “things” are net-savvier iterations of designs that have long pre-Internet legacies—legacies in which digital security had previously never been a major concern. But, Paar says, security is not just another new feature to be added onto an Internet-connected device. Internet security requires designers and engineers embrace a different culture altogether.

“There’s essentially no tolerance for error in security engineering,” Paar says. “If you write software, and the software is not quite optimum, you might be ten percent slower. You’re ten percent worse, but you still have pretty decent results. If you make one little mistake in security engineering, and the attacker gets in, the whole system collapses immediately. That’s kind of unique to security and crypto-security in general.”

Paar’s research team, which published some of its latest findings in Internet-of-Things security this summer, spends a lot of time on physical and electrical engineering-based attacks on IoT devices, also called side-channel attacks.

For instance, in 2013 Paar and six colleagues discovered an exploit in an Internet-connected digital lock made by Simons Voss. It involved a predictable, non-random number the lock’s algorithm used when challenging a user for the passcode. And the flaws in the security algorithm were discoverable, they found, via the wireless link between the lock and its remote control.

The way they handled the discovery was how they handle all security exploit discoveries at the Institute, Paar says. They first revealed the weakness to the manufacturers and offered to help patch the error before they publicized the exploit.

“They fixed the system, and the new generation of their tokens is better,” he says. “They had homegrown crypto, which failed. And they had side-channel [security], which failed. So we had two or three vulnerabilities which we could exploit. And we could repair all of them."

Of the scores of papers and research reports the Embedded Security group publishes, Paar says one of the most often overlooked factors behind hacking is not technological vulnerabilities but economic ones.

“There’s a reason that a lot of this hacking happens in countries that are economically not that well off,” Paar says. “I think most people would way prefer having a good job in Silicon Valley or in a well-paying European company—rather than doing illegal stuff and trying to sell their services.”

But as long as there are hackers, whatever their circumstances and countries of origin, Paar says smart engineering and present-day technology can stop most of them in their tracks.

“Our premise is that it’s not that easy to do embedded security right, and that essentially has been confirmed,” he says. “There are very few systems we looked at that we couldn’t break. The shocking thing is the technology is there to get the security right. If you use state of the art technology, you can build systems that are very secure for practical applications.”

Wed, 3 Sep 2014 17:00:00 GMT
310,000 Enrollees Must Provide Proof Now or Lose Insurance Government claims eligibility info missing, but insurance agents say online glitches preventing information from being received by CMS Government claims eligibility info missing, but insurance agents say online glitches preventing information from being received by CMS
Photo: Getty Images

IT Hiccups of the Week

Last week, there were so many reported IT snags, snarls and snafus that I felt like the couple who finally won the 20-year jackpot on the Lion’s Share slot machine at the Las Vegas MGM Grand casino. Among IT Hiccups of note was the routine maintenance oofta at Time Warner Cable Wednesday morning that knocked out Internet and on demand service across the US for over 11 million of its customers and continued to cause other service issues for several days afterward; the “coding error” missed for six years by German Deutsche Bank that caused the misreporting to the UK government of 29.4 million equity swaps, with buys being reported as sales and vice versa; and the rather hilarious software bugs in the new Madden NFL 15 American football game, which has players flying around the field in interesting ways.

However, for this week, we just can’t ignore yet another snafu of major proportions. Last week, USAToday reported that the Centers for Medicare and Medicaid Services sent letters to 310,000 people who enrolled for health insurance through the federal website asking for proof of citizenship or immigration status by 5 September or they were going to lose their health insurance at the end of September.

Unfortunately, an untold number of those enrollees sent letters have indeed previously sent the information to CMS, but the information either hasn’t been properly processed or it has been lost by CMS. Even resending the requested information electronically doesn’t always work because of ongoing technical issues with the website, USAToday reports. Insurance agents, for example, complain that they are uploading the documentation CMS requested. That information seems to be accepted without an issue, but later CMS says it never received it.

Other enrollees are affected because of data integrity issues with Department of Homeland Security and Social Security Administration databases, for instance, first and last name transpositions. Compounding the problem, USAToday says, is that back in April all user passwords were reset because of the Heartbleed bug. Not unexpectedly, lots of people can’t remember the answers to their security questions, which means that they can’t reset their passwords, which in turn means no online access to their accounts and no way to upload their documentation.  

The problem of losing information is not confined to the federal site, either. Last week, the Las Vegas Journal-Review reported that some 10,000 out of the 30,000 enrollees to Nevada’s health insurance exchange are having billing and other coverage issues with their insurance plans. It is not uncommon, the Journal-Review stated, for enrollees to have paid their premiums in full and on time, have bank statements showing withdrawals to prove it, yet still be told they never paid their premiums. Inevitably, the premium “nonpayment” leads to a cancellation of their insurance policies, with the enrollees unfairly stuck with paying their medical bills.

In addition, while the state knows of the problems, the Journal-Review says, it doesn't know to how to deal with them. It seems that the health exchange developer, Xerox, has fouled up the enroll insurance premium billing system so badly that the issue may not be fully resolved until next year when Xerox’s contract expires and the federal government takes over Nevada’s health insurance exchange. In May, Nevada fired Xerox over its inability to fix over 1,500 software issues with insurance exchange.

Also last week, things heated up between Oracle and the state of Oregon over the “who’s at fault” argument concerning the $305 million Cover Oregon health insurance exchange debacle. In early August, Oracle sued (pdf) Oregon in federal district court for $23 million over unpaid bills and interest, alleging that Oregon breached its contract with the company.

Well, Oregon has fired back with a lawsuit of its own. Saying that, “Oracle sold the State of Oregon a lie,” Oregon Attorney General Ellen Rosenblum filled a fraud and racketeering lawsuit (pdf) against Oracle for a whopping $5.5 billion. The state also wants to keep Oracle from contracting with any of Oregon’s public corporations or government organizations in the future.

Predictably, each party asserts that the claims against them have no merit, and that they will each prevail in court. I doubt either party really wants to go to trial, since it will no doubt highly embarrass them both for their utter IT project management amateurism and risk mismanagement. I expect — but not before a pack of ravenous lawyers get paid a whole bunch of money — an out-of-court settlement where both sides can claim victory. However, that won’t like happen before each side tries to cause as much grief for the other as possible.

In one last note of interest, the U.S. Department of Health and Human Services Inspector General published a report (pdf) on the support contracts given to build the less than stellar federal health insurance exchange. The IG report says that as of February, 33 contractors operating under 60 different contracts have been used to implement the exchange. The contractors have been paid more than $500 million as of February with another expected before the end of the year. Some 20 of the 60 contracts have exceeded their contracted amounts, with 7 of those contracts (so far) exceeding the amount by over 100 percent.

At least $700 million more is estimated to be needed for system maintenance and upgrades, but I suspect that number is highly underestimated given the system’s complexity, the unruly herd of uncoordinated contractors, and the overmatched government oversight involved.

In Other News …

Database Update Problem Affects 8,200 Walgreen Pharmacies

CME Group Says Trading Disruption Caused by Maintenance Gone Bad

Android App Claims National Weather Service website

New IT System Glitch at Ireland’s Bus Éireann Means 30,000 Students Have No Bus Tickets

Computer Problems Keep Alaska’s Credit Union 1 Customers from Accessing Accounts

UK’s East Midland Ambulance Service Loses 41,000 Patient Records

EDF Pays Millions in Fines over UK Customer Complaints Traced to Botched IT System

Florida Alachua County Promises to Fix Voting Glitches

Two Arizona Counties Struggle with Voting Troubles

Electronic Voting Loses Favor in Japan

TennCare Management Explains Ineptitude in Failed System Rollout

StatsCanada Says Being Asleep at Switch Led to Errors

Computer Fault Blamed for US Hypersonic Weapons Test Failure

6,000 LA Resident Still Don’t Have Correct Water Bills after Seven Months

Glitch Causes California State Workers to Receive 200,000 Hours of Unearned Leave Credits

LA Unified School District Students Walk Out of School over Ongoing Student Information Problems

San Diego Unified School District Admits Computer Problems Hurt Students’ College Admission Chances

Computer Glitch Makes Everyone on Modest UK Town Street Property Millionaires

UK’s Virgin Media Repairs Email Glitch

Bank of Ireland Promises to Pay Public Servants After New Processing Issues Surface

South Carolina DMV Experiences Computer Problems

40,000 Speed Camera Tickets Dismissed in Nassau County, New York over Camera Malfunctions

Despite Claims, Florida’s Unemployment System Still Not Fixed

Microsoft Reissues Security Patch After Incapacitating Numerous Windows PCs

Google Software Bug Causes Repeat-Image Issue to Appear in Search

Drivers Unhappy over Unreliable Car Technology

Software Bug Blamed for Two EU Satellites Ending up in Wrong Orbit

Tue, 2 Sep 2014 13:00:00 GMT
LA School District Continues Suffering MiSiS Misery New legally mandated student tracking system plagued with problems New legally mandated student tracking system plagued with problems
Photo: Robyn Beck/AFP/Getty Images

IT Hiccups of the Week

With schools starting to open for the 2014-2015 academic year across the United States, one can confidently predict that there will be several news stories of snarls, snafus, and hitches with new academic IT supports systems as they go live for the first time. (You may may recall that happening in MarylandNew York, and Illinois a few years ago.)

While most of these “teething problems” are resolved during the first week or so of school, significant IT issues affecting the performance of the new integrated student educational tracking system recently rolled out in the Los Angeles Unified School District—the second largest in the country with 650,000 students—has already stretched beyond the first few weeks of the school term with no definitive end in sight. Furthermore, the many software bugs being encountered were known by LAUSD administrators, but they decided to roll out the system anyway.

The new system, called My Integrated Student Information Systems (MiSiS), was launched during the teachers' preparatory week before classes officially began  on 11 August. Unsurprisingly, it did not go well. In fact, a few days prior to launch, LAUSD’s chief information officer Ron Chandler was quoted in the LA Times as saying that he expected the MiSiS launch “to be bumpy.” Chandler also added for emphasis, “It’s going to be messy. The question is just how messy.”

While some modules of MiSiS were rolled out last year, a more complete version was piloted during summer school this year and was claimed by LAUSD administrators as performing acceptably. Even under these more benign conditions, summer school staff noted several operational defects that needed fixing. While many were apparently fixed by early this August, a large number still remained open as MiSiS went live. In addition, MiSiS apparently was not fully stress tested as a complete system and under expected load conditions before its launch.

Image: Los Angeles Unified School District

According to the Costra Contra Times, soon after MiSiS went live, staff across the school district began reporting issues ranging from painfully slow system performance to not being able to access student records at all. Other staff members reported that they were finding many of their students had been given the wrong class assignments, or worse, had not been assigned any classes at all.

Compounding the technical problems reported, the required user training on MiSiS for the more than 29,000 LAUSD teachers and school administrators was not completed before classes started, meaning there was a large number of school staff not fully trained on how to use MiSiS at its launch. So when problems were reported to the LAUSD IT department, it wasn’t clear how many were true technical problems versus user-unfamiliarity problems.

During the first week of school, local newspapers started reporting even more problems were cropping up with MiSiS. Some middle school students, for instance, were being placed in high school classes; while some teachers reported that they found 70 students assigned to their classes. At the end of the week, LAUSD administrators told teachers not to use MiSiS to take attendance until this week, so LAUSD IT specialists could have time to work on the system. LAUSD acknowledged at the same time that there were 130 known issues with MiSiS that needed to be worked through.

The second full week of school was less chaotic, but school staff and teachers were still reporting problems, even though LAUSD administrators insisted that MiSiS was working acceptably. School staff, on the other hand, reported they still were finding students with incorrect or missing class assignments which created classroom disruptions.  Special education teachers reported MiSiS issues were proving especially troublesome. MiSiS performance wasn’t markedly faster, either.

However, LAUSD administrators tried to downplay the problems being reported, calling them merely a “blip” that everyone would soon forget. Administrators insisted that only about 6,500 students were affected by MiSiS-related issues, which they said wasn’t bad considering that MiSiS, according to CIO Chandler, was “easily one of the most complex technology programs going on in the planet right now.”

A bit of hyperbole that, I think.

LAUSD teachers unions strongly challenged the administrators' 6,500 figure, calling it a gross under-reporting of the true number of students affected. They demanded that MiSiS immediately be scrapped and that LAUSD revert back to the old student record management system, which the unions claimed worked much better.

That is not likely to happen unless MiSiS completely falls over dead. The reason is that MiSiS is the end product of a very long and convoluted series of court cases going back to 1993. In that year, the mother of a student named Chanda Smith sued the LAUSD for allowing her daughter to reach—and twice flunk—the 10th grade when it was known that her daughter had a documented learning disability for which she received special tutoring in middle school. Although Smith’s mother tried repeatedly to get Chanda into special education classes in high school, the LA school administrators refused to do so. The lawsuit soon turned into a class action suit as other parents of disabled students also complained about their children not receiving the educational help they required or had received previously.

To make a long story short, in 1996, LA’s school board acknowledged that the school system had violated state and federal laws on the treatment of disabled students, and entered into a court approved consent decree [pdf] detailing how it would improve the education of the school districts disabled students. There was a list of a dozen and a half or so of improvements LAUSD agreed it would make over the next five years, including the implementation of an automated student tracking system so that abled and disabled childrens' educational progress could be assessed and tracked from kindergarten to the end of high school.

However, LAUSD was slow to implement the changes it had promised, saying they were too expensive. In 2001, it was sued again, this time for non-compliance with the 1996 consent decree. In 2003, after much legal wrangling, a modified consent decree (pdf) was signed, under which LAUSD was to have made good on a new set of agreed improvements. The LAUSD promised these would be implemented by the end of 2006, including (once again) the implementation of a comprehensive student tracking system. The courts appointed an independent monitor who would assess and have significant power to say whether the LAUSD was indeed meeting the terms of the modified consent decree.

Progress on meeting the improvement objectives was steady, but still extremely slow. One of the bottlenecks was the implementation of that comprehensive student tracking system. From 2003 to 2009, LAUSD worked to implement an integrated student information system (ISIS), purchasing a commercial product called SchoolMax as a way to speed the process along. The LA Times reports that LAUSD spent $112 million on this effort.

However, LAUSD found, in its words, “many challenges with software development and SchoolMAX’s performance.” So, in 2012 LAUSD approached the independent monitor with a plan to internally redevelop the student tracking system used at the Fresno Unified School District, which he approved. The LAUSD claims that the new system, which the LA Times said cost $20 million to develop and now called MiSiS, would offer “greater flexibility, user-friendliness, and cost effectiveness.”

The hope is—assuming that MiSiS can be fixed—that LAUSD might finally get out from under the consent decrees stretching back to 1996.

This helps explain why the LAUSD administrators made the decision to roll-out MiSiS a few weeks ago, warts and all, and why it will never revert to back to a student tracking system that isn’t compliant with the 2003 modified consent decree. It will be interesting to see whether the independent monitor believes MiSiS actually does now meet the terms of the decree’s mandate. If the system is still buggy by the end of this year, and if special education teachers are still complaining about it, I suspect he will be none too sympathetic to lifting the modified consent decree.

One person who is definitely not pleased at the moment is LAUSD school board member Tamar Galatzan, who is calling for a full audit of MiSiS and the decisions behind its “bumpy” launch. Galatzan claims the school board was not informed of the potential MiSiS problems until the news hit the media.

MiSiS isn’t the only IT issue involving the LAUSD, either. Last week, the LA Times reported that an internal LAUSD analysis of the school district’s $1 billion initiative to provide an Apple iPad to every student “was beset by inadequate planning, a lack of transparency and a flawed bidding process.”

That sounds surprising familiar to what occurred a few years ago in regard to the LAUSD payroll system fiasco. A school district that doesn’t learn from past mistakes—more than a bit ironic, wouldn’t you say?

In Other News…

Computer Problem Affects Multiple  Irish Hospitals

17,000 New Jersey Red Light Tickets Voided After “Technical Glitch”

Australian Commonwealth Bank Beset by Online Issues

Microsoft Suffers Multiple Azure Cloud Outages

Microsoft’s Update Comes with Blue Screen of Death

Iowa’s Workforce Development Office Tries to Hide News of Computer Glitch

Finance Company Fined Over Trying to Exploit Computer Flaw

FBI Spied on Wrong Suspects on Account of Typos

Passport Control System Traps Miami Arriving International Passengers

Manila’s Jollibee Restaurants Reopen after Supply Chain System Problems Fixed

Mon, 25 Aug 2014 13:00:00 GMT
The Routing Wall of Shame Internet traffic hit bumps as IPv4 “limit” on older routers’ addressing capability reached Internet traffic hit bumps as IPv4 “limit” on older routers’ addressing capability reached
Photo-illustration: Randi Klett; Images: Getty Images

IT Hiccups of the Week

While I have been en vacances the past few weeks, there have been several potential IT Hiccups of the Week stories of interest, including the 200-to-500 year old Indian women getting free sewing machines and Philippine’s fast food giant Jollibee Food having to temporarily close 72 of its restaurants in the Manila region because of problems the company experienced migrating to a new IT system—much to the disappointment of its Chickenjoy fans. However, the one hiccup that stands above the rest was the Internet difficulties reportedly experienced last week by the likes of eBay, Amazon, and LinkedIn, among many others.

One of the first inklings that something was amiss was the UK’s Inquirer story Tuesday reporting that eBay was experiencing spotty service and in some cases complete outages in parts of the UK and Europe for several hours. eBay said without elaboration that the performance issue had to do with “third party internet service provider access issues” and that it was “sorry for the inconvenience" being caused.

UK eBay sellers were furious at the reportedly tenth disruption of the year and were demanding compensation, the London Telegraph reported on Wednesday. However, the Telegraph also reported in an accompanying story that Internet difficulties not only affected the auction company, but hit the newspaper’s online operations as well. A story at SiliconANGLE reported that Amazon and LinkedIn were also affected by service disruptions, while ZDNet indicated that U.S. Internet service providers Level 3, AT&T, Cogent, Comcast, Sprint, Time-Warner and Verizon all experienced sporadic performance problems across the United States and parts of Canada. The Register also reported that Canadian ISP Shaw Communications had suffered from fairly severe network disruptions as well.

The reported culprit behind last week’s disruptions appears to be a well-known network technical risk that turned into a not-unexpected annoying problem: the global Internet routing table apparently exceeded 512,000 routes. As a result, many older routers that cannot support more than that number of routes because of memory and other limitations are at risk of sporadically causing some level of local Internet service instability until they are upgraded or replaced to handle the ever increasing number of Internet routes [pdf]. Speculation was that Tuesday’s disruptions were caused in part—or at least exacerbated—by the network activity of Verizon, which pushed the routing table to exceed the 512K threshold for a short time. The 512K mark is expected to be crossed permanently any time now, however.

Router supplier Cisco warned about a need to upgrade routers on its blog back in May when the global routing table passed 500,000 routes. It also laid out what its customer could do to upgrade their Cisco kit or perform workarounds. Last week, with the 512K milestone seemingly reached, Cisco posted another blog saying that it was really time to take action to “avoid any performance degradation, routing instability, or impact to availability.”

In highly simplistic terms, the root issue is that the vast majority of routers in operation use Internet protocol version 4 (IPv4) which was originally specified to support 512K Internet routes, with an unknown number of them unable (even with workarounds) to address more than that number. A newer router protocol IPv6 exists that allows for the addressing of 340 undecillion unique routes. But using that protocol requires investing in new equipment.  If you want to know how big 340 undecillion is, and what some of the concerns are in moving from IPv4 and IPv6, there is a great interview from 2011 done by IEEE Spectrum former editor Steven Cherry with IPv6 evangelist Owen DeLong of Hurricane Electric, a company which claims to be “the largest IPv6 backbone in the world as measured by number of networks connected.” I also suggest reading a Spectrum story from 2006 on some of the reasons why migration to IPv6 has been a slow process as well as this piece at the Register from last week.

Whether bypassing the 512K routing milestone will cause many Internet disruptions remains to be seen, although the betting seems that it won’t.  For instance, a story at the Guardian quotes James Blessing, chair of the UK Internet Service Providers Association, as saying, “In the grand scheme of things, it’s tiny. It’s a glitch, glitches happen…  If someone at an ISP hasn’t noticed it by now, it’s too late as the default table is over 512,000, so nothing that had this problem is now connected to the internet and working… We’ve had the glitch and nothing further will happen now concerning the 512,000 bug.”

Others, however, are not as nearly sanguine, and warn that further service disruptions shouldn’t be discounted until routers are upgraded or replaced, as happened when the 128K and 256K table limits were reached.

It won’t take long to find out whether last week's Internet hiccup was a one-time event or the beginning of a few weeks of Internet service burps.

In Other News…

U.S. Slowly  Making Progress on Passport Visa System Issue

Texas Toll Road IT System Still a Shamble

San Francisco Bay Area Bridges Hit by Plate-reader Glitches

Oracle Sues Oregon over Health Exchange Fiasco

Billing Problems Vex Vermont Health Exchange

Hawaii Health Exchange Still in Trouble

Washington State Slowly Fixing its Health Exchange Software Errors

CMS Finally Fixes Pharma Website Problems IT Costs Approach $1 Billion

UK Power Company Sees 62,000 Customers Leave From Billing Debacle

UK Driver Vehicle Agency Move from North Ireland to Wales Creates Car Tax Frustrations

Law Students Sue over Bar Exam Computer Problems

40,000 Pennsylvania Taxpayers Still Waiting for Refund Checks

Colorado State Computer Issues Hampers Unclaimed Property Payments

Australian Hospital Issues 200 Erroneous Death Notices

LA Schools Officials Prediction of new IT System Glitches Prove Highly Accurate

Mon, 18 Aug 2014 13:00:00 GMT
Black Hat 2014: How to Hack the Cloud to Mine Crypto Currency Cyber security researchers devise a hack to demonstrate the need for improved anti-botnet security measures Cyber security researchers devise a hack to demonstrate the need for improved anti-botnet security measures
Illustration: iStockphoto

Using a combination of faked e-mail addresses and free introductory trial offers for cloud computing, a pair of security researchers have devised a shady crypto currency mining scheme that they say could theoretically net hundreds of dollars a day in free money using only guile and some clever scripting.

The duo, who are presenting their findings at this week’s Black Hat 2014 cyber security conference in Las Vegas, shut down their proof-of-concept scheme before it could yield any more than a token amount of Litecoins (an alternative to Bitcoin). The monetary value of both virtual currencies is based on enforced scarcity that comes from the difficulty of running processor-intensive algorithms.

Rob Ragan, senior security associate at the consulting firm Bishop Fox in Phoenix, Ariz., says the idea for the hack came to him and his collaborator Oscar Salazar when they were hired to test the security around an online sweepstakes.

“We figured if we could get 100,000 e-mails entered into the sweepstakes, we could have a really good chance of winning,” he says. “So we generated a script that would allow us to generate unique e-mail addresses and then automatically click the confirmation link.”

Once Ragan and Salazar had finished securing the sweepstakes against automated attacks, they were still left with all those e-mail addresses.

“We realized that … for about two-thirds of cloud service providers, their free trials only required a user to confirm an e-mail address,” he says. So the duo discovered they effectively had the keys to many thousands of separate free trial offers of cloud service providers’ networked storage and computing.

In other words, they had access to many introductory accounts at sites like Google’s Cloud Platform, Joyent, CloudBees, iKnode, CloudFoundry, CloudControl, ElasticBox and Microsoft Windows Azure.

Some of these sites, each offering their own enticement of free storage and free computing as a limited introductory offer, could be spoofed, the researchers discovered. Troves of unique e-mail addresses, using a non-discoverable automated process they developed, could be readily made on the fly and then used to get free storage and processor time.

A spoof e-mail address of course has two components, Ragan says, the local part (the stuff to the left of the “@“ sign) and the domain (to the right). To appear like a random stream of e-mail addresses signing up for any given service, Ragan says they scraped real local addresses from legit e-mail address dumps on sites like Pirate Bay. The domain side they set up using “FreeDNS” servers that attach e-mail addresses to existing domains, a service that can be exploited for domains that have poor security measures in place.

So, say there’s an address dump file on the Internet containing the legit e-mail addresses “CatLover290 at gmail” and “CarGuy909 at Yahoo.” Ragan and Salazar’s algorithm would attach “CatLover290” and “CarGuy909” to one of thousands of spoof URLs they’d set up through the FreeDNS sites. The original e-mail accounts would then be unaffected. But the resulting portmanteau e-mail addresses would appear to be coming from a random stream of humans on the Internet.

Thus, Ragan says, not even a human observer watching the e-mails registering for free cloud computing accounts—none appearing to be produced by a simple algorithm or automated process—would detect anything overtly suspicious. And to further throw off the scent of suspicious activity, they used Internet anonymizing software like TOR and virtual private networks to spoof where the trial account requests were coming from. (Ragan says that generating real-seeming names using name-randomizing algorithms would probably be good enough.)

“A lot of the e-mail confirmation and authentication features rely on the old concept that one person has one e-mail address—and that is simply not the case anymore,” Ragan says. “We’ve developed a platform that would allow anyone to have 30,000 e-mail addresses.”

So they signed up for hundreds of free cloud service trial accounts and, in the process, strung together a free, ersatz virtual supercomputer.

“We demonstrated that we could generate a high amount of crypto hashes for a high return on Litecoin mining, using these servers that didn’t belong to us,” Ragan says. “We didn’t have an electricity bill, and we were basically able to generate money for free out of thin air.”

Ragan says at their scheme’s peak, they had 1000 accounts that were each generating 25 cents per day: $250 of free Litecoin. He says they shut the system down before it generated any real monetary value or made any noticeable performance dent in the cloud service systems.

And Ragan stressed that the devious schemes he and Salazar developed are being disclosed in order to raise awareness of problems in security measures that real criminal elements around the world can and probably already are taking advantage of.

“Not planning for and anticipating automated attacks is one of the biggest downfalls a lot of online services are currently experiencing,” Ragan says.

One measure Ragan says he and Salazar wanted to see that would combat their scheme’s spoofing of cloud service providers was the introduction of random anti-automation controls. Captchas, credit card verification, and phone verification can all be spoofed, he says, if they’re at predictable places in the cloud service signup and setup process.

“Some services don’t want to add a Captcha, because it annoys users,” Ragan says. “But…there are compromises that can be [employed], like once an abnormal behavior is detected from a user account, they then prompt for a Captcha. Rather than prompting every user for a Captcha every time, they can find that balance. There’s always a balance to be made between security and usability.”

Ragan says that’s what he and Salazar want the takeaway from their talk to be: that a lot more consideration is given to how to better implement anti-automation controls and features.

Fri, 8 Aug 2014 19:00:00 GMT
Black Hat 2014: A New Smartcard Hack Researchers hack chip-based credit and debit cards. Banks hack terms-of-service changes so consumers would be stuck with the bill Researchers hack chip-based credit and debit cards. Banks hack terms-of-service changes so consumers would be stuck with the bill
Photo: Getty Images

According to new research, chip-based “Smartcard” credit and debit cards—the next-generation replacement for magnetic stripe cards—are vulnerable to unanticipated hacks and financial fraud. Stricter security measures are needed, the researchers say, as well as increased awareness of changing terms-of-service that could make consumers bear more of the financial brunt for their hacked cards. 

The work is being presented at this week’s Black Hat 2014 digital security conference in Las Vegas. Ross Anderson, professor of security engineering at Cambridge University, and co-authors have been studying the so-called Europay-Mastercard-Visa (EMV) security protocols behind emerging Smartcard systems.

Though the chip-based EMV technology is only now being rolled out in North America, India, and elsewhere, it has been in use since 2003 in the UK and in more recent years across continental Europe as well. The history of EMV hacks and financial fraud in Europe, Anderson says, paints not nearly as rosy a picture of the technology as its promoters may claim.

“The idea behind EMV is simple enough: The card is authenticated by a chip that is much more difficult to forge than the magnetic strip,” Anderson and co-author Steven Murdoch wrote in June in the Communications of the ACM [PDF]. “The card-holder may be identified by a signature as before, or by a PIN… The U.S. scheme is a mixture, with some banks issuing chip-and-PIN cards and others going down the signature route. We may therefore be about to see a large natural experiment as to whether it is better to authenticate transactions with a signature or a PIN. The key question will be, “Better for whom?””

Neither is ideal, Anderson says. But signature-based authentication does put a shared burden of security on both bank and consumer and thus may be a fairer standard for consumers to urge their banks to adopt.

“Any forged signature will likely be shown to be a forgery by later expert examination,” Anderson wrote in his ACM article. “In contrast, if the correct PIN was entered the fraud victim is left in the impossible position of having to prove that he did not negligently disclose it.”

And PIN authentication schemes, Anderson says, have a number of already discovered vulnerabilities, a few of which can be scaled up by professional crooks into substantial digital heists.

In May, Anderson and four colleagues presented a paper at the IEEE Symposium on Security and Privacy on what they called a “chip and skim” (PIN-based) attack. This attack takes advantage of some ATMs and credit card payment stations at stores that unfortunately take shortcuts in customer security: The EMV protocol requires ATMs and point-of-sale terminals to broadcast a random number back to the card as an ID for the coming transaction. The problem is many terminals and ATMs in countries where Smartcards are already used issue lazy “random” numbers generated by things like counters, timestamps, and simple homespun algorithms that are easily hacked.

As a result, a customer can—just in buying something at one of these less-than-diligent stores or using one of these corner-cutting ATMs—fall prey to an attack that nearby criminals could set up. The attack would allow them to “clone” a customer’s Smartcard and then buy things on the sly with the compromised card. Worse still, some banks’ terms and conditions rate card cloning—which EMV theoretically has eliminated—as the customer’s own fault. So this sort of theft might leave an innocent victim with no recourse and no way of refunding their loss.

“At present, if you dispute a charge, the bank reverses it back to the merchant,” Anderson says. “Merchants are too dispersed to go after customers much. But EMV shifts the liability to the bank, and the banks in anticipation are rewriting their terms and conditions so they can blame the customer if they feel you might have been negligent. I suggest you check out your own bank's terms and conditions.”

Thu, 7 Aug 2014 19:00:00 GMT
U.S. State Department Global Passport, Visa Issuing Operations Disrupted Computer problems could create backlog of hundreds of thousands of applicants Computer problems could create backlog of hundreds of thousands of applicants
Photo: iStockphoto

IT Hiccups of the Week

Last week saw an overflowing cornucopia of IT problems, challenges and failures being reported. From these rich pickings, we decided to focus this week’s edition of IT Hiccups first on a multi-day computer problem affecting the US Department of State’ passport and visa operations, followed by a quick rundown of the numerous US and UK government  IT project failures that were also disclosed last week.

According to the Associated Press, beginning on Saturday, 21 July, the U.S. Department of State has being experiencing unspecified computer problems including “significant performance issues, including outages” with its Consular Consolidated Database [pdf], which has interfered with the “processing of passports, visas, and reports of Americans born abroad.” A story at ComputerWorld indicates that the problems began after maintenance was performed on the database. State Department spokeswoman Marie Harf told the AP that the computer problem effects were being felt across the globe.

The AP story says that a huge passport and visa application backlog is already forming, with one unidentified country already reporting that the backlog of applications had reached 50,000 as of Wednesday. The growing backlog has also “hampered efforts to get the system fully back on line,” Haff told AP.

The rapidly expanding backlog is easy to understand, as the Oracle-based database, which was completed in 2010, “is the backbone of all consular applications and services and supports domestic and overseas passport and visa activities,” according to a State Department document [pdf]. In 2013, for example, the database was used in the issuing of some 13 million passports and 9 million visitor visas.

Department spokeswoman Harf was quoted by the AP as saying, “We apologize to applicants and recognize this may cause hardship to applicants waiting on visas and passports. We are working to correct the issue as quickly as possible.” However, she did not give any indications when the problems would be fixed or the backlog would be erased. Stories of families stuck overseas and not able to return to the US are rapidly growing.

Earlier this summer, the UK saw a similar passport backlog develop over the mismanagement of the closures of passport offices at British Embassies during the past year. The backlog, which blossomed into a political embarrassment to Prime Minister Cameron’s Government, is still not fully under control. It remains to be seen whether the U.S. passport and visa problems will do the same for the Obama Administration—if it lasts for a couple of weeks, it very well could.

More likely to cause embarrassment to the Obama and the Cameron administrations are the numerous government IT failures reported last week. For example, the AP reported that the U.S. Army had to withdraw  its controversial Distributed Common Ground System (DCGS-A) from an important testing exercise later this year because of “software glitches.” DCGS-A, the Army website says, “is the Army’s primary system for posting of data, processing of information, and disseminating Intelligence, Surveillance and Reconnaissance information about the threat, weather, and terrain to all components and echelons.”

The nearly $5 billion spent on DGCS-A so far has not impressed many of its Army operational  users in Afghanistan, who have complained that the system is complex to use and unreliable, among other things. They also point out there is a less costly and more effective system available called Palantir, but the Army leadership is not interested in using it after spending so much money and effort  on DCGS-A.

The AP also reported last week that a six year, $288 million U.S. Social Security Administration Disability Case Processing System (DCPS) project had virtually collapsed, and that the SSA was trying to figure out how to salvage it. DCPS, which was supposed to replace 54 legacy computer systems, was intended to allow SSA workers across the country “to process claims and track them as benefits are awarded or denied and claims are appealed,” the AP said. 

The AP story says that the SSA may have tried to keep quiet a June report [pdf] by McKinsey and Co. into the program’s problems so as to not embarrass Acting Social Security Commissioner Carolyn Colvin who President Obama recently nominated to head the SSA. The McKinsey report indicates that one reason for the mess is that no one could be found to be in charge of the project. The report also states that “for past 5 years, Release 1.0 [has been] consistently projected to be 24-32 months away.” Colvin was deputy commissioner for 3½ years before becoming acting commissioner in February 2013, the AP says, so the DCPS debacle is squarely on her watch.

Then there was a story in the Fiscal Times concerning a Department of Homeland Security (DHS) Inspector General report [pdf] indicating that the Electronic Immigration System (ELIS), which was intended to “provide a more efficient and higher quality adjudication [immigration] process,” was doing the opposite. The IG wrote that, “instead of improved efficiency, time studies conducted by service centers show that adjudicating on paper is at least two times faster than adjudicating in ELIS.”

Why, you may ask? The IG states that, “Immigration services officers take longer to adjudicate in ELIS in part because of the estimated 100 to 150 clicks required to move among sublevels and open documents to complete the process. Staff also reported that ELIS does not provide system features such as tabs and highlighting, and that the search function is restricted and does not produce usable results.”

Hey, what did those immigration service officers expect for the $1.7 billion spent so far on ELIS, something that actually worked?  DHS is now supposed to deploy an upgraded version of ELIS later this year, the IG says, but he is also warning that major improvements in efficiency should not be expected.

As I mentioned, reports of project failure were the story of the week in the UK as well. Computing published an article concerning the UK National Audit Office’s report into the 10-year and counting Aspire outsourcing contract for the on-going modernization and operation of some 650 HM Revenue & Customs tax systems. While the NAO has said that the work performed by the consortium led by Capgemini has resulted in a “high level of satisfactory implementations,” the cost to do so has been a staggering amount.

HMRC let the Aspire contract in 2004, after ending a ten-year outsourcing contract with EDS (now HP) when the relationship soured. HMRC said at the time that the ten-year cost of the Aspire contract would be between £3.6bn and £4.9bn; however, the NAO says the cost has topped £7.9 billion through the end of March this year, and may reach £10.4 billion by June 2017 when the contract, which was extended in 2007, expires. Public Accounts Committee (PAC) chair Margaret Hodge MP says the cost overrun is an example of HMRC’s management of the Aspire contract being “unacceptably poor.”

On top of being unhappy about the doubling in contract costs, and the high level of profits the suppliers made on it, the NAO also warned HMRC that it needs to get serious about a replacement contract when the Aspire contract ends. Hodge says that while HMRC has started planning Aspire’s replacement, “its new project is still half-baked, with no business case and no idea of the skills or resources needed to make it work.”

Apparently the NAO found another half-baked UK government IT project as well. According to the London Telegraph, the NAO published a report [pdf] describing how the UK Home Office has managed to waste nearly £347 million since 2010 on its “flag ship IT programme” called the Immigration Case Work system, which is intended to deal “with immigration and asylum applications.” The NAO says that the Home Office has now abandoned the effort, thereby, “forcing staff to revert to using an old system that regularly freezes.”

In addition, the NAO says that the Home Office is planning to spend at least another £209 million by 2017 on what it hopes to be a working immigration case work system.  Until that new system comes on line, however, the Home Office will need to spend an undetermined amount of money trying to keep the increasingly less reliable legacy immigration system from completely falling over dead. The legacy system support contract ends in 2016, the NAO states, so that Home Office doesn’t have a lot of wiggle room to get its new replacement immigration system operational.

Finally,  the London Telegraph reported that the UK National Health Service may have reached a deal to pay Fujitsu £700 million as compensation for the NHS unilaterally changing the terms of its National Program for IT (NPfIT) electronics health record contract with the Japanese company. The changes sought by the NHS led Fujitsu to walk off the program (as did Accenture) in 2008. The NPfIT project, a brain child of then Prime Minister Blair in 2002, was cancelled in 2011 after burning through some £7.5 billion so far.

In Other News…

Vancouver’s SkyTrain Suffers Failures over Multiple Days

North Carolina’s Fayetteville Public Works Commission Experiences New System Billing Problems

UK Nationwide Bank Customers Locked Out of Accounts

Nebraska Throws Out Writing Test Scores in Wake of Computer Testing Problems

GAO Finds It Easy to Fraudulently Sign up for Obamacare

Washington State Obamacare Exchange Glitches Hits 6,000 Applicants

Pennsylvania State Payroll Computer Glitch Fixed

UK Couple Receives £500 Million Electricity Bill

Mon, 28 Jul 2014 14:30:00 GMT
Senate Condemns US Air Force ECSS Program Management’s Incompetence Underwhelming executive leadership studiously ignored its own risk mitigation plans, Underwhelming executive leadership studiously ignored its own risk mitigation plans,
Photo: Paul J. Richards/Getty Images

IT Hiccups of the Week With no compelling IT system snafus, snags, or snarls last week to report on, we thought we’d return to an oldie but goodie project failure of the first order: the disastrous U.S. Air Force Expeditionary Combat Support System (ECSS) program.

The reason for our revisit is the public release a short time ago of the U.S. Senate staff report [pdf] into the fiasco.  Last December,  Senators Carl Levin and John McCain, respectively the chairman and ranking member of the Senate Armed Services Committee, requested the report. The request was made in the wake of the Air Force’s publication of the executive summary [pdf] of its own investigative report which apparently the Senators were not altogether happy with. You may recall that Levin and McCain christened the billion-dollar program failure—which the Air Force admitted failed to produce any significant military capability after almost eight years in development—as being “one of the most egregious examples of mismanagement in recent memory.” Given the number of massive DoD IT failures to choose from, that is saying something.

Not surprisingly, the Senate staff report identified basically the same contributing factors for the debacle as the internal Air Force report, albeit with different emphasis. Whereas the Air Force report listed four contributing factors for the ECSS program’s demise (poor program governance; inappropriate program management tactics, techniques, and procedures; difficulties in creating organizational change; and excessive personnel and organizational churn), the Senate staff report condensed them into three contributing factors:

  • Cultural resistance to change within the Air Force
  • Lack of leadership to implement needed changes; and
  • Inadequate mitigation of identified risks at the outset of the procurement.

The Senate report focused much of its attention on the last bullet concerning ECSS program risk mismanagement. In large part, the report blamed the calamity on the Air Force’s failure to adhere to business process reengineering guidelines “mandated by several legislative and internal DOD directives and [that] are designed to ensure a successful and seamless transition from old methods to new, more efficient ways of doing business.” From reading the report, one gets the image of an exasperated parent scolding a recalcitrant child: Congress seemed as miffed at the Air Force for ignoring its many IT-related best practices directives as for the failure itself.

Clearly adding to the sense of frustration is that the Air Force “identified cultural resistance to change and lack of leadership as potential [ECSS] problems in 2004” when the service carried out a mandated risk assessment as the program was being initially planned. Nevertheless, the risk mitigation approaches the service ended up developing were “woefully inadequate.” In fact, the report said that the Air Force identified cultural resistance as an ongoing risk issue throughout the program. However, the lack of action to address it permitted the “potential problem” to become an acute problem.

To its credit, the ECSS program did try to set out an approach in 2006 to try to contain the technical risks involved in developing an integrated logistics system to replace hundreds of legacy systems then in use across the Air Force. Two key risk reduction aspects of the plan were to “forego any modifications” to the Oracle software selected for ECSS and to “conduct significant testing and evaluation” of the system.  However, by the time the ECSS project was canceled in 2012, the report notes, Oracle’s software was not only being heavily customized, but it also wasn’t being properly tested.

Several things contributed to this 180 degree turn in project risk reduction, according to the report. One was partially a problem of the Air Force conducting what can only be called bait-and-switch procurement. As the report states:

"In its March 2005 solicitation, the Air Force requested an “integrated product solution.” The Air Force solicitation stated that it wanted to obtain “COTS [commercial off-the-shelf] software [that is] truly ‘off-the-shelf’: unmodified and available to anyone.” Oracle was awarded the software contract in October 2005, and provided the Air Force with three stand-alone integratable COTS software components that were “truly off the shelf.” Oracle also provided the Air Force with tools to put the three components together into a single software “suite,” which would “[require] a Systems Integrator (SI) to integrate the functions of the three [components].” Essentially, this meant the various new software pieces did not initially work together as a finished product and required additional integration to work as intended.


"In December 2005, the Air Force issued its solicitation for a systems integrator (SI) … portrayed the three separate Oracle COTS software components, as a single, already-integrated COTS product which was to be provided to the winning bidder as government funded equipment (GFE). Confusion about the software suite plagued ECSS, contributing significantly to program delays. Not only was time and effort dedicated to integrating the three separate software components into a single integrated solution, but there were disagreements about who was responsible for that integration. While CSC [the system integrator] claimed in its bid to have expertise with Oracle products, the company has said that it assumed, that the products it would receive from the Air Force would already be integrated. Among the root causes of the integration-related delay was the Air Force’s failure to clearly understand and communicate program requirements.

Adding to the general confusion was the small issue of exactly how many legacy systems were going to be replaced. The report states:

"When the Air Force began planning for ECSS, it did not even know how many legacy systems the new system would replace. The Air Force has, on different occasions, used wildly different estimates on the number of existing legacy programs, ranging from “175 legacy systems” to “hundreds of legacy systems” to “over 900 legacy systems.”

Curiously, the Senate report doesn’t note that even if the Air Force was trying to get rid of “only” 175 legacy systems, that was still some 20 times more than the Air Force’s last failed ERP attempt a few years earlier. The staff report seems to assume that such a business process engineering undertaking was still feasible from the start (and during a period of conflict as well), which is a highly dubious assumption to be making.

Probably the most damning sentence in the whole report is the following:

"To date, the Air Force still cannot provide the exact number of legacy systems ECSS would have replaced."

Two years after ECSS was terminated, after two major investigations into why ECSS failed, and while the Air Force is actively engaged in planning for another try, this fact is still rather amazing.

I’ll let you read the report to dig through the other gory details involving the risk-related issues involving cultural resistance and lack of leadership, but suffice to say you have to wonder where top Air Force and Department of Defense leadership was during the eight years this project blunder unfolded. As I have noted elsewhere, the DoD CIO at the time claimed to be “closely” monitoring the program, and up to the day ECSS was terminated, the CIO viewed it as being only a moderately risky program.

There was the same lack of curiosity on the part of Congress as well, however. DoD ERP system developments have been well-documented by the US Government Accountability Office [pdf] for over two decades as being prone to self-immolation. But Congress has kept the money flowing to them anyway without bothering to perform much in the way of oversight. Predictably, the Senate report avoids looking into Congress's own role in permitting the ECSS failure to occur.

The Senate report goes on to list several other DoD ERP programs that are trying their best to imitate ECSS. In this time of tight government budgets, that list might actually move Congress to quit acting as a disinterested party to their future outcomes. In fact, Federal Computer Week ran an article last week that indicated the Senate Appropriations Defense Subcommittee was slicing $500 million dollars off of DoD’s IT budget, which is clearly a warning shot across DoD’s bow.

Another warning shot of note is that both Senators Levin and McCain have noted that: “No one within the Air Force and the Department of Defense has been held accountable for ECSS’s appalling mismanagement. No one has been fired. And, not a single government employee has been held responsible for wasting over $1 billion dollars in taxpayer funds.” The Senators have stated they plan to introduce legislation to hold program managers more accountable in the future.

I suspect—and dearly hope—that if another ECSS happens in defense (or in other governmental agencies or departments, for that matter), more than a few civil and military careers will be, like ECSS, terminated.

In Other News …

Birmingham England Traffic Wardens Unable to Issue Tickets

Chicago Car Sticker Enforcement Delayed After Computer Glitch

Ohio’s Lorain City Municipal Court Records are Computer "Nightmare"

Immigration System Crash Leads to Chaos at Santo Domingo’s Las Americas Airport

Texas TxTag Toll System Upgrade Causes Problems

Melbourne Members Equity Bank System Upgrade Issues Vexes Customers

Reservation System Issue Hits Las Vegas-based Allegiant Air Flights

Vancouver’s Skytrain Shutdown Angers Commuters

Computer Assigns Univ of Central Florida Freshman to Live in Bathrooms and Closets

Australia’s Woolworth Stores Suffers Store-wide Checkout Glitch

Mon, 21 Jul 2014 13:00:00 GMT
UK Retailer Marks & Spencer’s Revenue Results Smacked by Website Woes Company promises improvements, but will they be in time? Company promises improvements, but will they be in time?
Photo: Marks & Spencer

IT Hiccups of the Week

We concentrate this week’s edition of IT snarls, snags, and snafus on the lessons being learned the hard way by Marks & Spencer—the U.K.'s largest clothing retailer and one of the top five retailers in the country—on what happens when your online strategy goes awry. What makes this more than a run-of-the-mill website goes bad story, at least in the U.K., is that as London's Daily Mail put it late last year, “Marks & Spencer, to coin a phrase, is not just any shop. It is the British shop, as much a part of our cultural heritage as the Women’s Institute, the BBC and the Queen.”

M&S launched with great fanfare a new £150 million website in February as a primary means to stem declining sales and profitability, as well as accelerate the achievement of the 128–year old company’s objective of being an international multichannel retailer. However, last week, CEO Marc Bolland announced shortly before the company’s annual meeting that on-going “settling in” problems with its website contributed to an 8.1 percent drop in online sales over the previous quarter. The decline in online sales, which was more than expected, helped M&S chalk up its 12th quarter in a row of declining sales in its housewares and clothing division.

Bolland reiterated a statement he initially made in May that the company expected that website issues would still negatively impact online sales until the 2014 holiday season, an admission that shareholders no doubt were still unhappy to hear. Share prices in M&S have declined some 20 points to the mid-420's on the London exchange since late May.

Paradoxically, Bolland also insisted to RetailWeek that despite the admitted negative impact on online sales that the new website has caused, “there is no problem with our new website.”  M&S Chairman Robert Swannell further tried to put a positive spin on the situation at the annual meeting, according to a story at City A.M. While granting that “every project of this scale will cause some disruption,” Swannell contended that the new website had, in fact, created the “newest and biggest flagship store, open to everyone, 365 days a year.” Alan Stewart, the company’s finance director, even went so far as to claim at the annual meeting that nothing “had gone wrong” with the website’s launch. In May, the London Telegraph reported, Stewart set out to trivialize the website problems M&S customers were complaining about, comparing the problems as being similar to going to a supermarket where the milk has been moved, and not being able to immediately finding it. Stewart repeated the misplaced milk analogy at the annual meeting, which was probably not a good idea since it didn’t go over well with M&S online shoppers the first time.

M&S customers (and shareholders) can be forgiven for being more than a bit jaded after hearing the three executives’ upbeat utterances at the annual meeting. It's especially forgivable knowing that the company spent three years to develop and test the website, all the while promising a unique as well as reliable shopping experience. Yet, within days of the website’s 18 February launch, it crashed spectacularly. Some M&S customers were so angry, the Daily Mail reported at the time, that they promised to boycott M&S until the old website, which was operated by Amazon for the previous seven years, was brought back. M&S executives seem to have conveniently forgotten about that incident in their "no problems with the website" remarks last week.

While the website crash was exasperating, adding to M&S online customers’ angst was the discovery that they would have to re-register their personal credentials on the new site. Perhaps a minor annoyance in the general scheme of things, except that many long-time customers reported that they were having (and continue to have) difficulties re-registering.  In fact, M&S admits that only half of the previously 6 million registered customers have re-registered on its new site.

Of course, the 3 million who haven’t re-registered may have decided it wasn’t worth the effort.  According to the Guardian, things are so bad that many M&S online shoppers are giving up on using the new site because they view it as being “tricky to use” and “unfriendly.” Customers have complained, among other things, that searching for M&S products have proven not only to be clumsy and slow to perform, but also erroneous. Adding to shoppers’ frustration is the fact that a non-trivial amount of merchandise offered for sale online seems to be tagged as being “out of stock.”

In May, an article in the Economist indicated that Bolland blamed the current merchandise stocking issues on a “15-year legacy of not investing” in IT and logistics by the company’s previous management. Shareholders might want to ask, then, what was the outcome of the company’s £400 million IT investment strategy begun in 2009 specifically aimed at improving IT and logistics? Was the investment strategy a complete bust?

M&S doesn’t have a lot of time to overcome its existing customers’ exasperation with its website, as well as begin the process of wooing new customers to it. If the site isn’t fixed before the holiday season as promised and it continues to be a drag on company revenue, shareholders may lose patience with the company’s management and call for their resignations. M&S executives may have decided to beat shareholders to the punch, however.

Soon after the poor quarterly results were declared at the annual meeting, M&S announced that its financial director Alan Stewart had decided to “defect” to rival U.K. retailer Tesco.  Financial analysts told the Financial Times that the move was not a good sign: It may be an indication that despite all of Stewart’s positive talk, he was really “not optimistic about the success of the turnaround at M&S.”

Maybe Stewart’s misplaced milk actually really is out of stock, and no one knows how to re-order it.

In Other News …

Hilo Hawaii Dud 4th of July Fireworks Display Blamed on Computer Issues

UK Domino’s Pizza Charges £180,000 for a Large Margherita

Nevada USGS Earthquake Reporting System Had “Glitch”

South African First National Bank Has Transactions Trouble

Danish Railway Experiences IT Problems

Billing Troubles Hit Virginia Tunnel Users Again

Ticket Woes Angers Irish Hurling Fans

US Government Tells Six States to Fix Medicaid Systems or Else

Computer Glitch Causes Thousands in Connecticut to Lose Health Insurance

Software Flaw Can Cause Unintended Acceleration in Honda Fit and Vezel Hybrids

14,000 Men Aged 117 to 121 Ordered to Register for Draft




Mon, 14 Jul 2014 13:00:00 GMT
Thousands of Bags Miss Flights at Heathrow Terminal 5 Again Awakens bad memories of T5 baggage system meltdown of 2008 Awakens bad memories of T5 baggage system meltdown of 2008
Photo: Nathan King/Alamy

IT Hiccups of the Week

Here's some glitch déjà vu from 2008, namely another baggage system miscue involving British Airways (BA) at Heathrow International Airport in London. As you may remember, in March 2008, BA and Heathrow operator British Airports Authority (now known as Heathrow Airport Holdings) opened the long-awaited BA Terminal 5 with great fanfare, with BAA loudly proclaiming the “world-class” baggage system was “tried, tested and ready to go.” No Denver International Airport baggage system-like problems for them! And BA's deservedly poor reputation as the top airline for losing luggage would finally be over.

Of course, such publicly-stated optimism over the reliability of automation is rarely left unpunished. Almost immediately, a massive meltdown of the baggage system on the first day of T5’s operation led to more than 28,000 passenger bags piled high across the terminal, hundreds more being lost, and some 15 percent of BA flights being cancelled over the course of nearly a week. It took three weeks before the majority of bags were reunited with passengers. The extreme embarrassment for both BA and Heathrow management because of the incident was acute, as was BA passenger rage, to say the least.

The nightmares of that week have slowly receded from BA passengers' memories. That is, until Friday, 27 June, when London papers like the Daily Mail reported that T5’s automated baggage system had suffered another major IT failure, with bags having to be handled manually again. As a result, thousands of BA passengers were sent (unknowingly) on their way without their luggage, including those passengers transiting through London via T5. The Mail quoted a BA spokesperson as saying, “On Thursday morning, the baggage system in Terminal 5 suffered an IT problem which affected how many bags could be accepted for each flight… We are very sorry for the difficulties this has caused and we have been working hard with the airport to make sure we reunite all of our customers with their luggage as quickly as possible.”

The BA spokesperson failed to point out that the phrase “how many bags could be accepted for each flight” actually meant no bags were accompanying their owners on an untold number of BA flights. BA also insisted to the press that they stop saying that passenger bags were lost; the bags merely “missed” their flights, BA pouted.  

A short two-paragraph Heathrow Airport Holdings press release did BA one better at trying to downplay the baggage system problem, stating that it affected only “some bags,” and that flights were in fact operating “normally.” You have to love press statements that are totally true but also totally disingenuous.

BA passengers on Thursday were naturally displeased at traveling without their bags, but at least they got to their destination, unlike those flying out of T5 last September, when another but very short-lived IT problem with the baggage system prevented hundreds of passengers from ever boarding their flights and had to be rebooked onto new ones, many the next day.

While BA passengers from June 27 were naturally miffed, what BA and Heathrow’s operator failed to make clear until early this week was that the “intermittent” IT problems with T5’s baggage system had actually begun on Thursday, 26 June and continued well into Sunday, 29 June. I am sure that many BA passengers flying out of T5 on June 28 and 29 would have changed airlines if they knew the full extent of the baggage problems. Conveniently, neither BA nor the airport operator came forward with the information about the multi-day operational problem until Tuesday, 1 July. Nor have they disclosed the total number of bags or passengers inconvenienced.

Both BA and Heathrow Airport Holdings are in damage control mode as BA passengers, many of them famous, have taken to social media to lambast them both. Many passengers, for example, have complained that when they finally did receive their bags, they had been ransacked with items stolen from them. Others complained that their journeys were over by the time their bags finally reached them.

BA put out another press release blaming international airline security rules for bags being opened as well as being delayed, and further promised to look into the ransacking claims. A BA spokesperson went on to apologize, stating that, “We are very sorry that this process is taking longer than anticipated, and we fully understand the frustration that this is causing.” Heathrow Airport Holdings new CEO John Holland-Kaye also apologized, saying the IT problem had taken too long to resolve and that airport needs “to do better.” Disclosing IT problems while they are occurring would be a good start.

The BA spokesperson went on to warn that it would still take “several days” before all the bags that “missed” their flights are reunited with their owners. BA also indicated that because of the number of bags involved, its bag tracking system was not working as it should, which could further add to the delays.

BA is reminding its customers flying out of T5 that, “You may wish to carry essential items in hand baggage where possible.” That is probably good advice. ComputerWorldUK reports that Heathrow Airport Holdings is remaining very tight-lipped over what caused the baggage system fault and why it took four days to fix it, which is rarely a good sign that everything is under control.

In Other News…

Florida’s DMV Computer System Back Online

Bombay Stock Exchange Recovers from Outage

New Zealand Exchange Suffers IT Glitch

DNS Error Hits British Telecom

Irish Drivers Avoid Parking Fines in County Clare Due to Computer Error

PayPal Error Blocks CERN and MIT anti-Spying ProtonMail Fundraising Efforts

Microsoft Anti-crime Operation Disrupts Legitimate Servers

UK Adult Content Filters Hit 20 Percent of Legal Popular Sites

Goldman Sachs Gets Court to Order Google to Block Misdirected Email

HHS IG Reports Say Federal and State Health Insurance Exchange Controls Very Weak


Thu, 3 Jul 2014 15:11:00 GMT
Outages Galore: Microsoft, Facebook, Oz Telecom Users are Unhappy Lot Multiple IT systems fall down and have a hard time getting back up Multiple IT systems fall down and have a hard time getting back up
Illustration: Randi Klett

IT Hiccups of the Week

We go on an IT Hiccups hiatus for a week and wouldn’t you know it, Facebook does a worldwide IT face plant for thirty minutes while mobile phone users of two of the three largest telecom providers in Australia, Optus and Vodafone, coincidentally suffer concurrent nationwide network outages for hours on the same day. Microsoft follows that with back-to-back Office 365-related outages, each lasting more than six plus hours. In addition, there were system operational troubles in Finland, India and New York to name but a few. So, we decided to focus this week’s edition of IT problems, snafus and snarls on the recent outbreak of reported service disruptions that happened around the world as well as those sincere sounding but ultimately vacuous apologies that always now accompany them. 

Our operational oofta review begins last Tuesday, when Microsoft’s Exchange Online was disrupted for some users starting from around 0630 to until almost 1630 or so East Coast time, leaving those affected without email, calendar and contact information capability. The disruption was somewhat embarrassing for Microsoft, which likes to tout that its cloud version of Office365 is effectively always available (or at least 99.9% of the time).

PC Advisor reported that Microsoft's investigation into the outage “determined that a portion of the networking infrastructure entered into a degraded state. Engineers made configuration changes on the affected capacity to remediate end-user impact.” Microsoft later explained that the failure uncovered a “previously unknown error” that took some time to correct. Microsoft has chosen to remain mum, however, about how many Exchange users were affected by the interruption of service.

Redmond Magazine published Microsoft’s upbeat apology for the disruption which stated, “We sincerely apologize to customers for any inconvenience this incident may have caused and continuously strive to improve our service and using these opportunities to drive even greater excellence in our service delivery.”

One service improvement suggestion Exchange users vigorously made to Microsoft during the outage was to actually indicate on its service health dashboard that there was, in fact, an outage occurring—something that apparently wasn’t effectively done. Microsoft later admitted sheepishly that the reason for lack of timely notice of the outage was that there was also a problem with the publishing process related to its dashboard. Users also strongly suggested that Microsoft should refrain from using the words “delay” and “opportunities” together in describing future day-long outages since those words didn’t seem to fit with the experiences of users.

Microsoft’s problems on Tuesday were preceded on Monday by “issues” experienced in North America (and reportedly by some outside NA) by users of its Lync instant messaging service, which is also part of the Office 365 suite (which, we should note, is available as a standalone product, as is Exchange).  Microsoft indicated that the issues involved its “network routing infrastructure.” The outage began at about 0700 East Coast time, with service disruptions still being reported into early Monday evening.

Interestingly, no one seemed to report on Microsoft’s apology for the Lync outage, which may be because the Exchange outage came so quickly on its heels, or perhaps there are so few users solely dependent upon Lync for their communications that there was really no one around to apologize to.

Microsoft’s dual outages were themselves preceded by a global outage of Facebook that took place the previous week on Thursday, 19 June. That oofta took place at about 0400 East Coast time, and lasted for only about 30 minutes. However, even that short period of time apparently left some of its 1.2 billion users “frustrated,” at least according to London’s Daily Mail.

Facebook initially sent out the expected pro-forma apology, “We're sorry for any inconvenience this may have caused.” Later, however, Facebook spokesman Jay Nancarrow expanded on Facebook's “Sorry, something went wrong” website message users encountered during the service interruption:

“Late last night, we ran into an issue while updating the configuration of one of our software systems. Not long after we made the change, some people started to have trouble accessing Facebook. We quickly spotted and fixed the problem...  This doesn't happen often, but when it does we make sure we learn from the experience so we can make Facebook that much more reliable.”

Also on that same Thursday and in a weird coincidence, first Vodafone and then Optus mobile phone customers in Australia reported that they were unable to make calls or send text messages for most of the day. According to a story at the Sydney Morning Herald, Vodafone’s problem began Thursday morning in Western Australia and then soon spread across the country. The paper said that the service disruption stemmed from a combination of a faulty repeater on the primary fiber link connecting Western Australia to the rest of Australia and a back-up cable failing as well. It took until 1830 AEST for the outage to be completely fixed, the Herald reported.

According to a story in the Financial Review, the Optus problem involved an issue with the IMEI (International Mobile Equipment Identity) numbers, which are the distinctive identifiers given each mobile phone. The Review stated that beginning at about 1300 AEST, “the unique IMEI numbers of several thousand mobile handsets were accidentally deemed to be stolen” in the Optus central database, thus, “blocking them from using Optus's mobile network for several hours.” The problem wasn’t completely sorted until after 2100 AEST. Unfortunately for us curious types, how the IMEI error occurred was not explained.

Optus said it was “sorry” for the problem and indicated that the company would give a service credit to compensate customers for their troubles. Similarly, Vodafone’s Chief Technology Officer said he was “sorry” for the telecom's service oofta and indicated that the company's customers would receive unlimited data usage within Australia for a weekend as its compensation.

And in even more of a fluke, a hardware failure that same Thursday afternoon in the Melbourne region of Australia took out Telstra's ADSL service for a couple of hours as well. This outage meant that the three largest telecom companies in Australia were experiencing some sort of newsworthy IT-related failure simultaneously.

Returning to last week, the Finnish broadcasting company (Yle) news program Yle Uutiset reported Tuesday that the Finland-wide government passport system fault had been resolved. Yle stated that that the fault involved “upgrades to the encryption system over midsummer [that] had prevented the passage of information between police and the Population Register Centre.” While Finnish officials at first feared that the fault would prevent the issuance of new passports for possibly weeks, Yle reported that a solution to the problem had been found within a day.

Then last Thursday morning, the India Times reported that the “technical problems” that knocked out the websites of several Indian government ministries including finance and defense as well as the Prime Minister’s Office for several hours on Wednesday evening was fixed. There were no details in the Times as to the root cause of the outage, other than an indication that it was equipment-related and not a cyber-attack.

Finally, Time Warner Cable announced last week that the email problem that affected many of its customers in Central New York State for several weeks was finally resolved. The problem was eventually discovered to be in a “database used by its email servers.” Even though Time Warner early on set  the issue as a “top priority" to be fixed, a solution proved frustratingly harder to implement than first thought.

Time Warner spokesman Scott Pryzwansky said in a statement to its customers that, “We apologize if you have been affected by this inconvenience. Providing you with reliable service is our top priority.”

Time Warner may want to work on improving the sincerity of its apology, through. Pryzwansky neglected to mention that Time Warner was going to embrace the opportunity the weeks-long service oofta offered to improve its customers’ service delivery à la Facebook and Microsoft.

Well, maybe next time.

In Other News…

Australian Westpac Subsidiary St George Bank Loses Online Banking Services

GLONASS April Outage Explained

Intercontinental Exchange Glitches Twice Suspend Trading

Illinois Driver License Facilities Shut Due to Mainframe Issues

UK Energy Regulator Tells Npower Fix Persistent Billing Problems or Else

Computer Problems Delays Flights From Israel’s Ben-Gurion Airport

Tennessee Department of Children’s Services IT System Still Suffering Costly Glitches

Problems Linger in New Hampshire's Medicaid Billing Computer System

Target Experiences Nationwide Checkout Problem

Nebraska Racetrack Overpays $6,000 on Wager

Sprint Cable Cut Takes out Emergency 911 in Parts of Oregon and Washington State

Computer Glitch Frees 25 Inmates from Dallas Jail

UK Tax Agency Computer System Riddled with Errors

Dutch Safety Agencies Warns About Potential ILS Interference with Autopilots

Drone Glitch and Bad Decisions Led to Drone Crashing into US Navy Ship

GM Recalls 425,000 SUVs and Trucks for Software Fix

USA-Germany World Cup Match Overloads ESPN Streaming Services

Veteran Affairs Computer Insists Veteran is Dead; He Disagrees

Mon, 30 Jun 2014 12:00:00 GMT