DevOpsLinkedIn PostSeptember 24, 20251 min read113 words

GitLab's 2017 'oops' moment - One command. Wrong server. 6 hours of data gone.

M

MOJAHID UL HAQUE

DevOps Engineer

7 likes0 comments476 views

GitLab's 2017 "oops" moment

One command. Wrong server. 6 hours of production data… gone.

What went wrong? - A spam attack overloaded their DB → replication lag. - An engineer tried to resync the replica… but ran the wipe command on the primary. - Backups? Many were broken or untested. - Final fix: restoring from a 6-hour-old staging copy (painfully slow).

Lessons for us DevOps folks: 1. Backups mean nothing until you've tested restores. 2. Guardrails on destructive ops save careers. 3. Treat RPO/RTO as facts, not assumptions. 4. Blameless culture = faster learning, fewer cover-ups.

If you've never practiced a restore, you don't have a backup — you have a bedtime story.

Originally posted on LinkedIn

View original post

Related Posts

DevOpsLinkedIn PostFeb 20, 2026

Most DevOps problems aren't tech problems - They're organizational chaos wearing a YAML hoodie

Most "DevOps problems" aren't tech problems. They're just organizational chaos wearing a YAML hoodie. We love to buy tools to fix culture. It never works. If you want to actually ship faster, try this Rule of Three: 1. Repeat it 3 times? Automate it. 2. Need a 12-step README to run it? You didn't automate it, you just outsourced the confusion. 3. Takes more time to maintain than it saves? Delete it. You've built a monument, not a tool. The Boring DevOps Checklist: Pipelines: Should be idempotent and predictable. If it's "flaky," it's broken. Infra: Treat it like code. If it's not versioned and reviewed, it's a liability. Alerts: If it doesn't require immediate human action? Don't send a page. The Goal: One command to deploy. One dashboard to verify. Stop looking for more tools. Start looking for less surprise.

1 min read90485
Read more →
DevOpsField GuideMar 7, 2026

Common Production Issues in DevOps and Fixes

Review common production issues in DevOps and the practical fixes that reduce outage time, from DNS and certificates to capacity, queues, and bad deploys.

6 min read00
Read more →
AWSLinkedIn PostOct 21, 2025

Stop Leaving AWS Credits Unclaimed - That Outage Might've Owed You Money

Remember the AWS outage on October 20th? Six hours down. Over 100+ services affected. Millions of users impacted. Everyone's talked about the RCA, multi-region setups, and resilience planning. But here's what most teams completely miss: 👉 You might be owed money. The SLA Reality Check AWS makes uptime promises like: - Cognito — 99.9% - DynamoDB — 99.99% (and 99.999% for Global Tables) - EC2, Lambda, CloudFront… all have their own SLAs. Now, do the math: 6 hours of downtime in a 30-day month = 99.17% uptime. That's below every single SLA above. What That Means for You If your services were affected, you're entitled to service credits — typically 10–25% of your monthly bill. So if you spend $10K/month on Cognito or DynamoDB… that's real money sitting unclaimed. How to Claim It (Takes 10 Minutes) 1. Go to your AWS Support Center 2. Open a new case 3. List the affected services 4. Reference the SLA breach 5. Submit before the end of the second billing cycle AWS won't credit you automatically. You have to ask. The Takeaway Yes — improve your DR and multi-region strategy. But also — don't forget to claim what you're owed. It's quick, it's legit, and your FinOps team will thank you.

1 min read61741
Read more →