AWSField GuideMarch 16, 20266 min read1,218 words

How to Optimize AWS Costs (FinOps Practical Guide)

M

MOJAHID UL HAQUE

DevOps Engineer

0 likes0 comments

AWS bills rarely jump because of one dramatic mistake. They grow through defaults nobody revisits: oversized compute, logs kept forever, public networking choices that accumulate fees quietly, idle staging resources, and storage that remains attached long after the workload it supported disappeared. FinOps is valuable because it makes those patterns visible enough for engineers to improve them deliberately rather than noticing them only after a budget surprise.

A practical FinOps program is not about blind cost cutting. It is about linking spend to architecture, ownership, and business value. The strongest teams know which costs are strategic, which are accidental, and which are leftovers from decisions that no longer make sense.

Why this matters in production

Cost optimization matters because cloud waste competes directly with other engineering investment. Money spent on idle resources, overprovisioned data layers, or unnecessary transfer patterns is money not spent on performance, resilience, or product work. FinOps is also a governance discipline. Once costs are owned and measured, cloud design conversations become more rigorous and less emotional.

Implementation approach

A practical AWS optimization path starts with ownership and visibility, then moves to rightsizing, retention tuning, and only then to commitments such as Savings Plans. Compute, storage, logs, and networking should each have explicit review loops. Teams should understand utilization before they buy commitments and understand dependencies before they eliminate spend in ways that would damage resilience. Good FinOps is iterative and evidence-driven, not a one-time cleanup project.

bash
aws ec2 describe-volumes --filters Name=status,Values=available
aws logs describe-log-groups --query 'logGroups[*].[logGroupName,retentionInDays]'
aws ce get-cost-and-usage \
  --time-period Start=2026-03-01,End=2026-03-31 \
  --granularity MONTHLY \
  --metrics UnblendedCost

Real-world use case

Consider an application stack running ECS services, RDS, CloudWatch Logs, S3 backups, and a NAT-heavy private networking model. The quickest wins might come from moving suitable workloads to Graviton, shortening non-critical log retention, shutting down idle staging resources on a schedule, and redesigning public-IP-heavy networking into private routing with a cleaner ingress layer. None of those changes are glamorous, but together they can materially reduce spend while improving security and governance at the same time.

Common mistakes and operating risks

The biggest mistakes are treating cost review as finance-only work and buying commitments before removing obvious waste. Another common trap is focusing only on compute while ignoring networking, logging, and storage behavior that often hide substantial spend. Teams also fail when optimization lacks ownership. A dashboard without a responsible engineer rarely changes architecture.

When this pattern fits best

These practices fit any AWS environment large enough for cloud costs to shape planning or team behavior. They are especially useful for multi-account setups, container platforms, and organizations where platform engineering already manages shared defaults. Even smaller teams benefit because tagging, rightsizing, and retention policy prevent bad habits from turning into expensive normal behavior later.

Checklist

  • Tag resources by owner, environment, and application.
  • Review compute, logs, storage, and networking as separate cost domains.
  • Rightsize before buying Savings Plans or reservations.
  • Set default retention policies instead of relying on manual cleanup.
  • Treat cost anomalies as engineering signals, not only finance reports.

How to roll this out safely

The safest rollout path is usually narrower than teams expect. Start with one service, one environment, or one clear platform boundary and baseline the metrics that matter before changing everything at once. Document ownership, define rollback or fallback behavior, and review the first few changes with the people who will support the system during real incidents. That approach prevents architecture optimism from outpacing operational reality. Mature patterns spread well because they are tested in small steps first, not because they looked complete in a design document.

What to measure after adoption

Success should be visible in operating outcomes, not only in implementation status. Good patterns reduce surprise, shorten diagnosis time, improve release confidence, or create a more predictable cost and performance profile. If the change only adds process, dashboards, or YAML without improving those outcomes, the design is probably too heavy. Measure the behaviors that matter to responders and service owners, then simplify aggressively anywhere the pattern creates ceremony without making production safer or easier to understand.

What teams usually learn after the first real test

The first serious deployment, spike, or incident almost always reveals something the design discussion missed. Maybe ownership was less clear than expected, maybe the observability path was too thin, or maybe the new process worked but took longer than planned because one dependency was not included in the original mental model. That is normal. Production patterns mature when teams capture that feedback immediately and adjust the defaults before the next rollout. In practice, the best patterns are not the most complicated ones. They are the ones that survive contact with real operations and become easier to use with every review.

Ownership and review cadence

Every useful platform practice needs a review loop. After the first few real uses, revisit the pattern with fresh evidence from deployments, incidents, and operator feedback. Ask what was confusing, what created noise, what saved time, and what controls were worth keeping. The strongest engineering patterns usually become smaller and clearer over time because teams trim the parts that do not change behavior. Review cadence turns a one-time implementation into a dependable operating habit.

That final review step is easy to skip when the initial rollout appears successful, but it is usually where the best long-term improvements are found. Small refinements in defaults, ownership, and observability often create more value than another wave of tooling.

A good rule is to treat the first month after adoption as part of the implementation rather than as an afterthought. Watch how the pattern behaves under normal changes, under stress, and during one real support event. If it remains understandable in all three cases, it is probably strong enough to become a team standard.

If the pattern is difficult to explain to a new engineer after that first month, it still needs refinement. Clarity is one of the most reliable indicators that a production practice is ready to scale across teams.

Documentation should evolve along with the pattern. Keep the shortest possible notes that explain ownership, the expected success signals, the rollback or fallback path, and the dashboards or logs responders should check first. Teams often over-document implementation detail and under-document the operational decisions that matter during a real event. A concise, current operating note is usually more valuable than a long design artifact nobody opens once the initial rollout is complete.

That knowledge-transfer step is especially important when more than one team or on-call rotation will depend on the pattern. A practice is not really finished until another engineer can use it confidently without needing the original author in the room.

Continue the thread

Related archive posts that connect this guide back to the original LinkedIn stream.

AWSLinkedIn PostFeb 26, 2026

How I reduced AWS networking costs by 93% while removing public attack surface

I recently tackled a common but expensive challenge in AWS: the hidden cost of public IPv4 addresses. In a setup with dozens of ECS Fargate tasks, my "In-use Public IP" charges were hitting hundreds of dollars per month. Beyond the cost, having backend workers exposed to the public internet was a security risk I wanted to eliminate. The Fix: I transitioned the entire architecture to a private-first model. 1. Disabled Public IPs: Moved all Fargate tasks to private mode within the VPC. 2. VPC Peering: Connected multiple VPCs using VPC Peering to enable secure, private communication between services across environments, no internet routing required. 3. Optimized Routing: Navigated complex DNS and routing requirements to ensure seamless communication between services without needing a NAT Gateway. 4. Added a Public Load Balancer: Introduced an internet-facing Application Load Balancer to handle inbound traffic. Only the load balancer is publicly accessible backend services remain private. The Results: - Cost: Monthly networking spend for public IPs was eliminated entirely, replaced by a much smaller, fixed endpoint fee. - Security: Drastically reduced the attack surface by ensuring backend workers are no longer reachable from the internet. - Efficiency: The system is now more robust, secure, and cost-predictable.

1 min read100463
Read more →
AWSLinkedIn PostJul 16, 2025

Running Containers on Graviton with ECS: Faster, Cheaper, and Worth It

Running Containers on Graviton with ECS: Faster, Cheaper, and Worth It. Alright, let's talk shop. If you're deploying containerized workloads in AWS and not paying attention to Graviton processors, you're probably leaving performance and cost savings on the table.

1 min read40285
Read more →

Next step

Need help with DevOps setup? Contact me.

FAQ

Quick answers to the questions teams usually ask when implementing this pattern.

What is the first FinOps control to implement?

Tagging and ownership. If you cannot tell who owns a resource or spend category, optimization discussions stay abstract and weak.

Are Savings Plans enough?

No. Commitments help with steady-state compute spend, but they do not solve idle resources, bad storage policy, unnecessary logging, or expensive network paths.

How often should costs be reviewed?

Monthly is the minimum, but fast-moving teams often benefit from weekly visibility so cost drift gets corrected before it becomes the new baseline.

Who owns cloud cost optimization?

Engineering and finance share the outcome, but engineering teams must own the technical decisions that create the majority of cloud spend.

Related Posts

AWSField GuideMar 21, 2026

Terraform Modules Best Practices

Design better Terraform modules with stable interfaces, versioning, validation, composition rules, and practical patterns for reusable infrastructure.

6 min read00
Read more →
AWSLinkedIn PostFeb 26, 2026

How I reduced AWS networking costs by 93% while removing public attack surface

I recently tackled a common but expensive challenge in AWS: the hidden cost of public IPv4 addresses. In a setup with dozens of ECS Fargate tasks, my "In-use Public IP" charges were hitting hundreds of dollars per month. Beyond the cost, having backend workers exposed to the public internet was a security risk I wanted to eliminate. The Fix: I transitioned the entire architecture to a private-first model. 1. Disabled Public IPs: Moved all Fargate tasks to private mode within the VPC. 2. VPC Peering: Connected multiple VPCs using VPC Peering to enable secure, private communication between services across environments, no internet routing required. 3. Optimized Routing: Navigated complex DNS and routing requirements to ensure seamless communication between services without needing a NAT Gateway. 4. Added a Public Load Balancer: Introduced an internet-facing Application Load Balancer to handle inbound traffic. Only the load balancer is publicly accessible backend services remain private. The Results: - Cost: Monthly networking spend for public IPs was eliminated entirely, replaced by a much smaller, fixed endpoint fee. - Security: Drastically reduced the attack surface by ensuring backend workers are no longer reachable from the internet. - Efficiency: The system is now more robust, secure, and cost-predictable.

1 min read100463
Read more →
AWSLinkedIn PostOct 21, 2025

Stop Leaving AWS Credits Unclaimed - That Outage Might've Owed You Money

Remember the AWS outage on October 20th? Six hours down. Over 100+ services affected. Millions of users impacted. Everyone's talked about the RCA, multi-region setups, and resilience planning. But here's what most teams completely miss: 👉 You might be owed money. The SLA Reality Check AWS makes uptime promises like: - Cognito — 99.9% - DynamoDB — 99.99% (and 99.999% for Global Tables) - EC2, Lambda, CloudFront… all have their own SLAs. Now, do the math: 6 hours of downtime in a 30-day month = 99.17% uptime. That's below every single SLA above. What That Means for You If your services were affected, you're entitled to service credits — typically 10–25% of your monthly bill. So if you spend $10K/month on Cognito or DynamoDB… that's real money sitting unclaimed. How to Claim It (Takes 10 Minutes) 1. Go to your AWS Support Center 2. Open a new case 3. List the affected services 4. Reference the SLA breach 5. Submit before the end of the second billing cycle AWS won't credit you automatically. You have to ask. The Takeaway Yes — improve your DR and multi-region strategy. But also — don't forget to claim what you're owed. It's quick, it's legit, and your FinOps team will thank you.

1 min read61741
Read more →