When AWS Has a Bad Day: What to Do When Cloud Outages Drive Unexpected Costs

 

By Resourcive 



Executive Summary

On October 20, 2025, AWS experienced a widespread service event in its US-East-1 region — impacting compute, networking, and application services for much of the day.

While Amazon resolved the issue that evening, many organizations are now dealing with something less visible but equally frustrating: the financial ripple effects of an outage.

At Resourcive, we’ve seen these scenarios before. Outages don’t just disrupt operations — they create cost noise: unexpected usage spikes, idle workloads that quietly burn money, and temporary failover charges that never should’ve existed in the first place.

Let’s unpack what happened, how it might show up in your AWS bill, and what actions you can take right now to protect your bottom line.



What Actually Happened

Between 3:00 AM and 6:00 PM ET, AWS services in US-East-1 experienced degraded performance caused by a DNS resolution failure.

Core services like EC2, Lambda, API Gateway, and Elastic Load Balancing were affected.

As connectivity wavered, applications began retrying operations, failover systems kicked in, and workloads accumulated idle time waiting for resolution.

The outage is over, but for many, the billing impact is still unfolding.



Where Outages Create Hidden Spend

  1. Idle Resources Running Without Output: Instances, databases, or load balancers stayed “on” but delivered no value. The meter kept running, but productivity didn’t.

  2. Retry Storms and Over-Execution: Automated retry logic caused functions and APIs to execute multiple times.
    • Lambda timeouts and re-runs
    • API Gateway request floods
    • Excessive CloudWatch logging
    • Data transfer bursts
      • Each one small, but multiplied at scale, it adds up fast.

  3. Failover and Recovery Costs: When workloads shifted to other regions or backup environments, extra compute, storage, and cross-region data transfer charges followed. These are easy to overlook but often material.



How to Identify If You Were Impacted

Start with Cost Explorer or your preferred cloud analytics platform.

Filter for US-East-1 and the outage window (3 AM–6 PM ET on Oct 20).

Look for:

  • Unusual cost or usage spikes

  • Elevated retry metrics in CloudWatch

  • Logs showing connection errors or timeout loops

  • Region failover events

If you see variance outside your normal pattern, capture it now — AWS will require supporting data for any SLA or billing claim.



Filing for SLA Credits (and Beyond)

AWS’s compute SLA generally provides service credits when availability drops below defined thresholds. For this event, the outage duration suggests potential credit eligibility around the 98 % uptime range (~10 % service credit).

Here’s what to do:

  1. Document the evidence – hourly cost data, metrics, and error logs.

  2. Submit a support case via AWS Support Center.

    • Use the subject: “AWS Compute SLA Credit Request – October 20, 2025 Incident.”

    • Include all timestamps and impacted services.

  3. Follow up if the credit isn’t reflected within two billing cycles.

    • Deadline to submit: December 31, 2025.

Keep in mind: SLA credits typically address downtime — not cost overages from retries, data transfer, or failovers.
If you experienced those, file a separate cost recovery claim referencing the same incident.



Lessons for Cloud Cost Governance

Incidents like this reinforce why cost governance isn’t just about optimization — it’s about resilience.
You can’t control when AWS stumbles, but you can control how your environment responds financially.

A few takeaways:

  • Build multi-region redundancy into mission-critical workloads.

  • Monitor retry behavior and add limits to avoid runaway costs.

  • Use budget alerts and anomaly detection to surface issues early.

  • Treat incident-driven costs like any other operational risk — quantify them, report them, and recover them.

 



Final Thought

AWS handled the technical resolution quickly, but the financial cleanup falls on you.

Don’t let a transient service event turn into a permanent expense.

Resourcive helps enterprise and mid-market IT teams improve cloud cost visibility, governance, and procurement strategy, so when outages like this happen, you’re not left sorting through the wreckage alone.

Let’s talk.