Amazon blames website glitches on AI.

Evan Schuman
10 Min Read

Report indicates Amazon’s engineering teams are actively developing strategies to mitigate site and system outages that executives attribute to AI integration.

The Amazon logo prominently displayed on the exterior of a company building
                                        <div class="media-with-label__label">
                        Credit:                                                             Prathmesh T / Shutterstock                                                  </div>
                                </figure>
        </div>
                                        </div>
                        </div>
                    </div>                  
                    <div id="remove_no_follow">
    <div class="grid grid--cols-10@md grid--cols-8@lg article-column">
                  <div class="col-12 col-10@md col-6@lg col-start-3@lg">
                    <div class="article-column__content">

Amazon reportedly called an urgent engineering meeting this past Tuesday to address a string of operational failures linked to their AI tools, as detailed in a *Financial Times* report. 

“The e-commerce giant noted a ‘pattern of incidents’ recently, characterized by a ‘broad impact’ and ‘generative-AI-supported modifications,’” according to internal documentation for the mandatory session, the FT revealed. “Among the ‘contributing factors,’ the memo listed ‘novel genAI usage for which best practices and safeguards are not yet fully established.’”

The article cited Dave Treadwell, a senior vice president within Amazon’s engineering division, who stated in the same note that “junior and mid-level engineers will now require approval from more senior engineers for any AI-assisted changes.”

However, Chirag Mehta, a principal analyst at Constellation Research, suggested that mandating senior engineer sign-off might inadvertently negate the primary advantage of AI strategies: improved efficiency.

“If every AI-assisted modification now demands a senior engineer’s scrutiny of every difference, the organization forfeits a significant portion of the speed gains it initially sought,” Mehta explained. “The true solution involves shifting the review process earlier and enforcing it automatically: deploying policy checks pre-release, implementing tighter controls on the impact scope for high-risk services, making canary deployments mandatory, enabling automatic rollbacks, and establishing stronger traceability so teams always know which changes were AI-powered, who authorized them, and what production behaviors subsequently altered.”

This mandate for approvals follows a series of AI-related incidents that disrupted Amazon and AWS services. These include an almost six-hour-long Amazon site outage earlier in the month and a 13-hour disruption of an AWS service in December.

System Glitches are to be Expected

Analysts and consultants contend it’s unsurprising that companies like Amazon are encountering embarrassing issues as non-deterministic systems are deployed at scale. While human oversight is a good strategy, there must be a sufficient number of people to realistically manage the immense scope of such deployments. For instance, in healthcare, expecting a human to approve 20,000 test results in an eight-hour shift doesn’t constitute effective control; rather, it primes that individual to bear responsibility for inevitable testing errors.

Acceligence CIO Yuri Goryunov emphasized that these kinds of glitches were always an unavoidable part of the process. 

“From my perspective, these are typical growing pains and logical next steps as a relatively new technology integrates into existing workflows. The benefits in productivity and quality are immediate and impressive,” Goryunov observed. “However, there are absolutely unforeseen complexities that require investigation, understanding, and resolution. As long as productivity gains outweigh the necessary remediation and validation efforts within established parameters, we will be fine. Otherwise, we’ll need to revert to conventional methods for that specific application.”

An ‘Irresponsible’ Strategy

Conversely, Nader Henein, a Gartner VP analyst, predicts that this issue will intensify. 

“These types of incidents are only going to become more frequent. The reality is that most organizations assume they can implement AI-assisted capabilities just as they would bring on a new employee, without adjusting the surrounding organizational framework,” Henein stated. “When we assign a task and a set of rules to an AI system, we might believe we have everything under control. But in truth, AI will pursue its objective within those rules by any means necessary, even if it uncovers inventive and occasionally alarming workarounds.

“It’s not that AI possesses malice. It simply doesn’t care. It lacks the boundaries, the empathy, or the intuitive judgment that most individuals develop over time.”

Given these considerations, Flavio Villanustre, CISO for the LexisNexis Risk Solutions Group, labeled the typical enterprise AI strategy as “irresponsible.”

“One could liken an AI system to a prodigy with a limited and unpredictable grasp of safety, to whom you grant permission to execute actions that could cause significant harm, all based on the promise of increased performance or cost reduction. This is very close to the definition of recklessness,” Villanustre asserted.

“At minimum, if this were approached conventionally, you would first test it in an isolated environment, verify the outcomes, and then deploy the actions to a production setting,” he noted. “Even if incorporating human oversight might reduce speed and somewhat diminish AI’s benefits, it represents the correct methodology for applying this technology today.”

Alternative Practical Tactics

Nevertheless, human involvement alone isn’t a comprehensive solution. There are other effective strategies to mitigate AI-related risks, according to cybersecurity consultant Brian Levine, executive director of FormerGov.

“Traditional quality assurance processes were never designed for systems capable of generating novel errors that no human has previously encountered. This is why simply increasing human oversight fails to resolve the core issue; it merely slows everything down while the underlying risk persists,” Levine explained. “AI introduces a new category of failure: ‘unknown-unknowns’ occurring at machine speed. These are not conventional bugs; they are emergent behaviors. You cannot simply patch your way out of this predicament.”

Worse still, Levine contended that these initial errors tend to spawn many more subsequent issues.

“AI doesn’t just make errors; it makes errors that propagate instantaneously. Enterprises require a distinct deployment pipeline for AI-assisted changes, featuring more stringent gating and automated rollback triggers,” he stated. “If AI is capable of writing code, your systems need the equivalent of financial-market circuit breakers to prevent cascading failures. This translates to automated anomaly detection that halts deployments before customers experience any impact.”

He observed that the objective isn’t to monitor AI more intensely, but rather to afford it “fewer opportunities to cause malfunctions.” Strategies like sandboxing, capability throttling, and designing with guardrails first prove significantly more effective than attempting to manually review every modification.

Levine further advised: “While AI can expedite development, your foundational infrastructure should always incorporate a human-authored fallback. This guarantees resilience when AI-generated changes exhibit unpredictable behavior.”

A Dedicated Operating Model is Essential

Manish Jain, a principal research director at Info-Tech Research Group, concurred. He argued that the Amazon situation doesn’t primarily demonstrate AI’s propensity for more errors, but rather that AI now operates at a scale where even minor mistakes can have “a vast impact radius” and potentially pose “an existential threat” to the organization.

“The risk isn’t necessarily that AI will make mistakes,” he asserted. “The real danger is that it dramatically reduces the window for humans to intervene and correct a catastrophic trajectory. With the rise of agentic AI, time-to-market has plummeted exponentially. However, governance frameworks have not evolved sufficiently to contain the risks generated by this accelerated technological pace.”

 Jain emphasized, however, that simply adding more personnel isn’t a standalone solution. It must be implemented thoughtfully, which involves making an honest assessment of how much oversight one individual can realistically provide. 

 “While a human-in-the-loop approach appears sensible, it is not a cure-all,” Jain remarked. “At scale, the loop quickly outpaces human processing capabilities. Human-in-the-loop cannot be the sole solution for every challenge presented by agentic AI. It must be supplemented by ‘human-over-the-loop’ controls, informed by factors such as autonomy, impact scope, and irreversibility.”

Mehta added, “AI fundamentally alters the nature of operational risk, not just its quantity. These systems can generate code or modify instructions that appear credible, pass cursory reviews, yet still introduce hazardous assumptions in edge cases.

“This implies that companies need a distinct operational framework for AI-assisted production changes, particularly in critical customer journeys like checkout, identity verification, payments, and pricing. These are precisely the types of workflows where the tolerance for experimentation should be extremely low.”

Artificial IntelligenceAmazon Web ServicesIaaSCloud ComputingSoftware Development
Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *