What's the worst way you ever broke production?

@[email protected] · 1 year ago

What's the worst way you ever broke production?

Quazatron · 1 year ago

Did you know that “Terminate” is not an appropriate way to stop an AWS EC2 instance? I sure as hell didn’t.

Billegh · 1 year ago

It doesn’t help that the webui used to hide stop. I think it still does.

Flax · 1 year ago

Explain more?

@[email protected] · 1 year ago

Apparently Terminate means stop and destroy. Definitely something to use with care.

@[email protected] · 1 year ago

Maybe there should be some warning message… Maybe a question requiring you to manually type “yes I want it” or something.

synae[he/him] · 1 year ago

Maybe an entire feature that disables it so you can’t do it accidentally, call it “termination protection” or something

Quazatron · 1 year ago

Noob was told to change some parameters on an AWS EC2 instance, requiring a stop/start. Selected terminate instead, killing the instance.

Crappy company, running production infrastructure in AWS without giving proper training and securing a suitable backup process.

@[email protected] · 1 year ago

“Stop” is the AWS EC2 verb for shutting down a box, but leaving the configuration and storage alone. You do it for load balancing, or when you’re done testing or developing something for the day but you’ll need to go back to it tomorrow. To undo a Stop, you just do a Start, and it’s just like power cycling a computer.

“Terminate” is the AWS EC2 verb for shutting down a box, deleting the configuration and (usually) deleting the storage as well. It’s the “nuke it from orbit” option. You do it for temporary instances or instances with sensitive information that needs to go away. To undo a Terminate, you weep profusely and then manually rebuild everything; or, if you’re very, very lucky, you restore from backups (or an AMI).

@[email protected] · edit-2 1 year ago

removed by mod

@[email protected] · 1 year ago

Was troubleshooting a failed drive in a raid array on a small business DC/File Serv/Print/Everything else box. Replaced drive still showed failed. Moved to another bay thinking it was the slot not the drive. Accidentally hit yes when asked to initialize the array. Blew the whole thing away. It was an OLD server the customer was working on replacing, so I told them it finally gave up the ghost and I was taking it back to the office to keep working on it. I had been on the job for about 4 months and thought for SURE I was fired. Turns out we were already working on moving them to the cloud, so it ended up not being a big deal.

Futs · 1 year ago

Advertised an OS deployment to the ‘All Wokstations’ collection by mistake. I only realized after 30 minutes when peoples workstations started rebooting. Worked right through the night recovering and restoring about 200 machines.

@[email protected] · edit-2 1 year ago

Extracted a sizeable archive to a pretty small root/OS volume

@[email protected] · 1 year ago

I fixed a bug and gave everyone administrator access once. I didn’t know that bug was… in use (is that the right way to put it?) by the authentication library. So every successful login request, instead of being returned the user who just logged in, was returned the first user in the DB, “admin”.

Had to take down prod for that one. In my four years there, that was the only time we ever took down prod without an announcement.

@[email protected] · 1 year ago

Then colleague upgraded glibc by copying it in via scp. Then we couldn’t ssh in anymore. :) Not sure how important that server was. I think it was reinstalled soon-ish.

@[email protected] · edit-2 1 year ago

Worked for an MSP, we had a large storage array which was our cloud backup repository for all of our clients. It locked up and was doing this semi-regularly, so we decided to run an “OS reinstall”. Basically these things install the OS across all of the disks, on a separate partition to where the data lives. “OS Reinstall” clones the OS from the flash drive plugged into the mainboard back to all the disks and retains all configuration and data. “Factory default”, however, does not.

This array was particularly… special… In that you booted it up, held a paperclip into the reset pin, and the LEDs would flash a pattern to let you know you’re in the boot menu. You click the pin to move through the boot menu options, each time you click it the lights flash a different pattern to tell you which option is selected. First option was normal boot, second or third was OS reinstall, the very next option was factory default.

I head into the data centre. I had the manual, I watched those lights like a hawk and verified the “OS reinstall” LED flash pattern matched up, then I held the pin in for a few seconds to select the option.

All the disks lit up, away we go. 10 minutes pass. Nothing. Not responding on its interface. 15 minutes. 20 minutes, I start sweating. I plug directly into the NIC and head to the default IP filled with dread. It loads. I enter the default password, it works.

There staring back at me: “0B of 45TB used”.

Fuck.

This was in the days where 50M fibre was rare and most clients had 1-20M ADSL. Yes, asymmetric. We had to send guys out as far as 3 hour trips with portable hard disks to re-seed the backups over a painful 30ish days of re-ingesting them into the NAS.

The worst part? Years later I discovered that, completely undocumented, you can plug a VGA cable in and you get a text menu on the screen that shows you which option you have selected.

I (somehow) did not get fired.

@[email protected] · 1 year ago

You still remember so. That means you learned and probably won’t do it again.

EmasXP · 1 year ago

Two things pop up

I once left an alert() asking “what the fuck?”. That was mostly laughed upon, so no worry.
I accidentally dropped the production database and replaced it by the staging one. That was not laughed upon.

@[email protected] · 1 year ago

I once dropped a table in the production database. I did not replace it with the same table from staging.

On the bright side, we discovered our vendor wasn’t doing daily backups.

@[email protected] · 1 year ago

Was doing two deployments at the same time. On the first one, I got to the point where I had to clear the cache. I was typing out the command to remove the temp folder, and looked down at the other deployment instructions I had in front of me, and typed the folder for the prod deployments and hit enter, deleting all of the currently installed code. It was a clustered machine, and the other machine removed it’s files within milliseconds. When I realized what I had done, I just jumped up from my desk and said out loud “I’m fired!!” over and over. Once I calmed down, I had to get back on the call and ask everyone to check their apps. Sure enough they were all failing. I told them what I had done, and we immediately went to the clustered machine and files were gone there too. It took about 8 hours for the backup team to restore everything. They kept having to go find tapes to put in the machine, and it took way longer than anyone expected. Once we got the files restored, well we determined that we were all back to the previous day, and everyone’s work from that night was all gone, so we had to start the nights deployments over. I got grilled about it, and had to write a script to clear the cache from that point on. No more manually removing files. The other thing that came out of this for the good was no more doing two deployments at the same time. I told them exactly what happened and that when you push people like this, mistakes get made.

@[email protected] · 1 year ago

I acidentally destroyed the production system completely thru improper partition resize. We got the database snapshot, but it’s in that server as well. After scrambling around for half a day, I managed to recover some of the older data dumps.

So I spun up the new server from scratch, restored the database with some slightly outdated dump, installed the code (which was thankfully managed thru git), and configured everything to run all in an hour or two.

The best part: everybody else knows this as some trivial misconfiguration. This happened in 2021.

@[email protected] · 1 year ago

Was wondering if anybody here had made the news.

Nomecks · edit-2 1 year ago

There was a nasty bug with some storage system software that I had the bad fortune to find, which resulted in me deleting 6.4TB of live VMs. All just gone in a flash. It took months to restore everything.

Kata1yst · edit-2 1 year ago

It was the bad old days of sysadmin, where literally every critical service ran on an iron box in the basement.

I was on my first oncall rotation. Got my first call from helpdesk, exchange was down, it’s 3AM, and the oncall backup and Exchange SMEs weren’t responding to pages.

Now I knew Exchange well enough, but I was new to this role and this architecture. I knew the system was clustered, so I quickly pulled the documentation and logged into the cluster manager.

I reviewed the docs several times, we had Exchange server 1 named something thoughtful like exh-001 and server 2 named exh-002 or something.

Well, I’d reviewed the docs and helpdesk and stakeholders were desperate to move forward, so I initiated a failover from clustered mode with 001 as the primary, instead to unclustered mode pointing directly to server 10.x.x.xx2

What’s that you ask? Why did I suddenly switch to the IP address rather than the DNS name? Well that’s how the servers were registered in the cluster manager. Nothing to worry about.

Well… Anyone want to guess which DNS name 10.x.x.xx2 was registered to?

Yeah. Not exh-002. For some crazy legacy reason the DNS names had been remapped in the distant past.

So anyway that’s how I made a 15 minute outage into a 5 hour one.

On the plus side, I learned a lot and didn’t get fired.

@[email protected] · 1 year ago

Two exhibitors, both alike in ~~dignity~~ naming. One needed a critical sw update on their Doremi to fix an issue. The other was running The Force Awakens to a packed auditorium.

slazer2au · 1 year ago

I took down an ISPfor a couple hours because I forgot the ‘add’ keyword at the end of a Cisco configuration line

@[email protected] · 1 year ago

That’s a rite of passage for anyone working on Cisco’s shit TUI. At least its gotten better with some of the newer stuff. IOS-XR supported commits and diffing.