GitLab's 2017 'oops' moment - One command. Wrong server. 6 hours of data gone.
MOJAHID UL HAQUE
DevOps Engineer
GitLab's 2017 "oops" moment
One command. Wrong server. 6 hours of production data… gone.
What went wrong? - A spam attack overloaded their DB → replication lag. - An engineer tried to resync the replica… but ran the wipe command on the primary. - Backups? Many were broken or untested. - Final fix: restoring from a 6-hour-old staging copy (painfully slow).
Lessons for us DevOps folks: 1. Backups mean nothing until you've tested restores. 2. Guardrails on destructive ops save careers. 3. Treat RPO/RTO as facts, not assumptions. 4. Blameless culture = faster learning, fewer cover-ups.
If you've never practiced a restore, you don't have a backup — you have a bedtime story.
Originally posted on LinkedIn
View original postRelated Posts
Most DevOps problems aren't tech problems - They're organizational chaos wearing a YAML hoodie
Most "DevOps problems" aren't tech problems. They're just organizational chaos wearing a YAML hoodie. We love to buy tools to fix culture. It never works. If you want to actually ship faster, try this Rule of Three: 1. Repeat it 3 times? Automate it. 2. Need a 12-step README to run it? You didn't automate it, you just outsourced the confusion. 3. Takes more time to maintain than it saves? Delete it. You've built a monument, not a tool. The Boring DevOps Checklist: Pipelines: Should be idempotent and predictable. If it's "flaky," it's broken. Infra: Treat it like code. If it's not versioned and reviewed, it's a liability. Alerts: If it doesn't require immediate human action? Don't send a page. The Goal: One command to deploy. One dashboard to verify. Stop looking for more tools. Start looking for less surprise.
Common Production Issues in DevOps and Fixes
Review common production issues in DevOps and the practical fixes that reduce outage time, from DNS and certificates to capacity, queues, and bad deploys.
Stop Leaving AWS Credits Unclaimed - That Outage Might've Owed You Money
Remember the AWS outage on October 20th? Six hours down. Over 100+ services affected. Millions of users impacted. Everyone's talked about the RCA, multi-region setups, and resilience planning. But here's what most teams completely miss: 👉 You might be owed money. The SLA Reality Check AWS makes uptime promises like: - Cognito — 99.9% - DynamoDB — 99.99% (and 99.999% for Global Tables) - EC2, Lambda, CloudFront… all have their own SLAs. Now, do the math: 6 hours of downtime in a 30-day month = 99.17% uptime. That's below every single SLA above. What That Means for You If your services were affected, you're entitled to service credits — typically 10–25% of your monthly bill. So if you spend $10K/month on Cognito or DynamoDB… that's real money sitting unclaimed. How to Claim It (Takes 10 Minutes) 1. Go to your AWS Support Center 2. Open a new case 3. List the affected services 4. Reference the SLA breach 5. Submit before the end of the second billing cycle AWS won't credit you automatically. You have to ask. The Takeaway Yes — improve your DR and multi-region strategy. But also — don't forget to claim what you're owed. It's quick, it's legit, and your FinOps team will thank you.