The AWS Cost Optimizations That Actually Move the Needle

Most AWS cost optimization advice focuses on basic/core savings.

Delete unused snapshots
Clean up old EBS volumes
Purchase Savings Plans
Removing idle resources

These are important and should absolutely be part of every team's cost optimization strategy. But in my experience, that's where many teams stop.

Once those basic optimizations are implemented, there's often a belief that the AWS environment is now fully optimized and there isn't much more to save.

That's rarely true.

The biggest cost reductions I've seen didn't come from cleaning up resources or buying discounts. They came from questioning architectural decisions and understanding the tradeoffs behind them.

Here are the mistakes I see most often.

1. Using Kubernetes Too Early

Many startups adopt EKS long before they actually need Kubernetes.

Sometimes they're planning for future scale. Sometimes it's driven by FOMO.

The result:

Higher infrastructure costs
Increased operational complexity
More engineering time spent maintaining clusters

For many teams, ECS is sufficient for years.

2. Paying for Production Reliability Everywhere

Development and staging environments often use the same architecture as production.

That means:

NAT Gateways (in some non-production environments, a NAT Instance may be a more cost-effective alternative)
Multi-AZ databases
Large instance sizes

even when the workloads don't justify them.

3. Running Non-Production Environments 24/7

This is one of the easiest cost optimizations I see teams overlook.

Many development and staging environments run continuously, even though nobody is using them during nights, weekends, or holidays.

Recently, I worked with an early stage startup whose non-production infrastructure was running 24/7 despite the team only using it during business hours.

By introducing automated start and stop schedules, we were able to significantly reduce compute costs without affecting developer productivity.

If your staging environment is only used Monday through Friday, ask yourself:

Does it really need to be running on Saturday morning?

For many startups, scheduling non-production environments can reduce infrastructure costs by 20-30% with very little engineering effort.

Some of the AWS services that are commonly left running unnecessarily include:

EC2 Instances
ECS Services
RDS Instances

4. Ignoring VPC Endpoints

Many workloads route traffic through NAT Gateways unnecessarily. AWS charges for data processed through NAT Gateway.

Using S3, ECR, and other VPC endpoints can significantly reduce networking costs.

A Real Example: Reducing NAT Gateway Traffic for ECR

❝

During an AWS cost review for one of our clients(Scale was high: 300+ nodes in EKS cluster), we noticed a surprising amount of traffic flowing through NAT Gateways.

After investigating, we found that EKS pods were pulling container images from ECR through the NAT Gateway.

Since Amazon ECR ultimately stores image layers in Amazon S3, we introduced:

An ECR API Endpoint
An ECR DKR Endpoint
An S3 Gateway Endpoint

After updating the environment, image pulls no longer depended on the NAT Gateway.

This reduced NAT Gateway data processing charges and lowered overall networking costs without changing the application itself.

Example calculation:

300 worker nodes pulls ~2GB of data from ECR per day
300 nodes × 2 GB = 600 GB/day
600 GB × 30 days = 18000 GB/month
NAT data processing charges:
- 18,000 × $0.045 = $810/month

5. Choosing Managed Services Without Considering the Premium

Managed services are valuable.

But every managed service introduces a convenience tax.

Sometimes that tradeoff is worth it.

Sometimes it isn't.

6. Overprovisioning Databases

I regularly see databases sized for future traffic instead of current traffic.

Most startups can scale databases later.

Paying for unused capacity today rarely makes sense.

Final Thought

Most teams approach AWS cost optimization by looking for discounts.

I prefer to start by questioning decisions.

Do we really need Kubernetes?
Does this environment need production-grade reliability?
Should this workload be running 24/7?

To me, that's the difference between operating infrastructure and engineering infrastructure.

Operations teams optimize resources.

Great platform teams optimize decisions.

Until next time…