DevOps Rescue Story: Recovering an EC2 Instance Without a PEM Key
MOJAHID UL HAQUE
DevOps Engineer
"Lost PEM? No SSH? SSM dead? Don't panic — AWS always leaves a backdoor for those who know where to look."
Yesterday I ran into one of those heart-sinking moments: an EC2 instance was completely locked out. - PEM key gone → SSH impossible - SSM agent broken → root volume full, wouldn't start even after EBS expansion - EC2 Instance Connect failing
Basically… the instance was bricked. Or so it seemed.
The Recovery Playbook I Followed
- Spun up a helper EC2 instance with a fresh key pair.
- Detached the root volume from the locked instance → attached it to the helper.
- Mounted the volume → discovered the partition still capped at 100GB even though the EBS size was already 150GB.
- Ran growpart + resize2fs → filesystem finally stretched to the full 150GB. (49GB free instantly.)
- Cleared old logs and temp files for breathing room.
- Added a new SSH public key into ~/.ssh/authorized_keys.
- Detached the fixed root volume → reattached it back to the original instance.
- Rebooted → boom! SSH worked with the new PEM, and the SSM Agent sprang back to life.
Originally posted on LinkedIn
View original postRelated Posts
How I reduced AWS networking costs by 93% while removing public attack surface
I recently tackled a common but expensive challenge in AWS: the hidden cost of public IPv4 addresses. In a setup with dozens of ECS Fargate tasks, my "In-use Public IP" charges were hitting hundreds of dollars per month. Beyond the cost, having backend workers exposed to the public internet was a security risk I wanted to eliminate. The Fix: I transitioned the entire architecture to a private-first model. 1. Disabled Public IPs: Moved all Fargate tasks to private mode within the VPC. 2. VPC Peering: Connected multiple VPCs using VPC Peering to enable secure, private communication between services across environments, no internet routing required. 3. Optimized Routing: Navigated complex DNS and routing requirements to ensure seamless communication between services without needing a NAT Gateway. 4. Added a Public Load Balancer: Introduced an internet-facing Application Load Balancer to handle inbound traffic. Only the load balancer is publicly accessible backend services remain private. The Results: - Cost: Monthly networking spend for public IPs was eliminated entirely, replaced by a much smaller, fixed endpoint fee. - Security: Drastically reduced the attack surface by ensuring backend workers are no longer reachable from the internet. - Efficiency: The system is now more robust, secure, and cost-predictable.
AWS ECS Mumbai has mood swings - DevOps engineer perspective
As a DevOps engineer, I've basically accepted that AWS ECS Mumbai has mood swings. Once or twice a month, it just… decides it's done with life. Deploy? Maybe. Pull images? If it feels like it. Random crash? Always a crowd pleaser. And of course, the AWS status page sits there smiling like everything's perfectly normal. Meanwhile, I'm digging through IAM, logs, task defs, pipelines, wondering if I forgot how computers work… only to realize it's just Mumbai taking a personal day again. But who gets blamed? "DevOps can't deploy." Yes. Clearly, I woke up and told ECS to stop doing its job. At this point, we just want a little stability and a status page that doesn't gaslight me while the region is on vacation.
Stop Leaving AWS Credits Unclaimed - That Outage Might've Owed You Money
Remember the AWS outage on October 20th? Six hours down. Over 100+ services affected. Millions of users impacted. Everyone's talked about the RCA, multi-region setups, and resilience planning. But here's what most teams completely miss: 👉 You might be owed money. The SLA Reality Check AWS makes uptime promises like: - Cognito — 99.9% - DynamoDB — 99.99% (and 99.999% for Global Tables) - EC2, Lambda, CloudFront… all have their own SLAs. Now, do the math: 6 hours of downtime in a 30-day month = 99.17% uptime. That's below every single SLA above. What That Means for You If your services were affected, you're entitled to service credits — typically 10–25% of your monthly bill. So if you spend $10K/month on Cognito or DynamoDB… that's real money sitting unclaimed. How to Claim It (Takes 10 Minutes) 1. Go to your AWS Support Center 2. Open a new case 3. List the affected services 4. Reference the SLA breach 5. Submit before the end of the second billing cycle AWS won't credit you automatically. You have to ask. The Takeaway Yes — improve your DR and multi-region strategy. But also — don't forget to claim what you're owed. It's quick, it's legit, and your FinOps team will thank you.