GitLab's 2017 'oops' moment - One command. Wrong server. 6 hours of data gone.
Mojahid Ul Haque
DevOps Engineer
GitLab's 2017 "oops" moment
One command. Wrong server. 6 hours of production data… gone.
What went wrong? - A spam attack overloaded their DB → replication lag. - An engineer tried to resync the replica… but ran the wipe command on the primary. - Backups? Many were broken or untested. - Final fix: restoring from a 6-hour-old staging copy (painfully slow).
Lessons for us DevOps folks: 1. Backups mean nothing until you've tested restores. 2. Guardrails on destructive ops save careers. 3. Treat RPO/RTO as facts, not assumptions. 4. Blameless culture = faster learning, fewer cover-ups.
If you've never practiced a restore, you don't have a backup — you have a bedtime story.
Originally posted on LinkedIn
View original postRelated Posts
Most DevOps problems aren't tech problems - They're organizational chaos wearing a YAML hoodie
Most "DevOps problems" aren't tech problems. They're just organizational chaos wearing a YAML hoodie. We love to buy tools to fix culture. It never works. If you want to actuall...
Common Production Issues in DevOps and Fixes
Review common production issues in DevOps and the practical fixes that reduce outage time, from DNS and certificates to capacity, queues, and bad deploys.
Stop Leaving AWS Credits Unclaimed - That Outage Might've Owed You Money
Remember the AWS outage on October 20th? Six hours down. Over 100+ services affected. Millions of users impacted. Everyone's talked about the RC multi-region setups, and resilie...