claidheamhmor: (AthlonX2)
[personal profile] claidheamhmor
This is one of my personal nightmares - having some or other script I've written go out and trash large areas of the network because I wasn't careful enough about the parameters.

Bourne Into Oblivion
2009-07-21
by Mark Bowytz in Feature Articles

Jerry wasn't the sort of guy who would normally vent frustration out loud at work, yet here he was - cursing into the air at two individuals in particular - the first round of explitives being directed at the toolbag, somewhere, who had botched months of server backups by reusing the same set of tapes for months and the other being a long ago departed developer whose name he was continually being subjected to in the comments of the rotten shell script he was now stepping through.

What had started out as a 7:30am ticket from an early-bird user getting a error message when trying to open a spreadsheet test plan from the week before had turned into a full-on, corporate-wide DEFCON 1.

To make matters worse, Jerry had just delivered his two-week notice a few days prior which meant that in every meeting Jerry was getting "thanked" for the company's current nuclear crisis and that he should have set his little "time bomb" to go off AFTER he was gone. Naturally, while his being "blamed" helped to improve the morale of everyone else, it didn't do much to help Jerry's outlook - especially since it appeared as if this was someone else's "parting gift".

Questions? Please Refer to the Scriptonomicon

For as long as anyone could remember, everyone just kind of just coped with the Bourne shell script that was the framework to a test environment. It was originally designed to run automated tests for a single product, but management was so thrilled at how well it worked that they got other projects to adapt the framework.

Over the next few years, it became the de-facto test framework used by applications throughout the corporation. However, in order to make "one size fit all", it had morphed into something... different. It became one of those gnarly applications that everybody acknowledged was a bit sketchy behind the scenes, but it worked. So long as you stuck to the S.O.P. and knew the different locations where the same value had to be defined and accepted that P_OPERATOR_ID was a unique network identifier that is NOT a normal network ID that you had to get from Chuck in the Infrastructure Group, you'd be ok.

However, recently, the developer who had originally created the framework had left the company in search of greener pastures and, rather than handing off the task of running the scripts to a developer, it was given to a co-op student. After all, running the script was like checking off steps on a list, right? The co-op set up the configuration, scheduled it to run over the weekend, and merrily left it to return the following week. As it turned out, he missed a few details.

Cleaning Up

From a high level, the Bourne script would essentially ssh into each target machine, do its thing, and then exit. As part of its "thing", the designer of the framework wanted to make sure the script cleaned up after itself so subsequent runs of the framework would not re-process old data. To accomplish this, one of the enhancements after the initial release was to add two cryptic variables that (redundantly) contained the project name and the version being tested. Utilizing an unpatched flaw in sudo's setup to gain real root access, the script would then do the following as part of the clean up:

rm -rf $var1/$var2

Ordinarily, this worked just fine, but the co-op student was unaware these SPECIFIC variables needed to be set. With them being left blank, the following was the end result upon execution of the script:

rm -rf /

With the script running as root on a setup with NFS (which, in turn, granted access to everything on the entire UNIX/Linux network and a few Windows Servers via SAMBA), the script had a chance to do a good bit of damage... and it did. Home directories, file repositories, customer data, test results, all seemingly evaporated into nothingness.

All told, it took 6 hours to wipe out the entire network. It took 4 hours to figure out what happened (turns out the script ssh'd onto its own server and the rm -rf wiped out the scripts which did the rm -rf and most the evidence of what happened) and it only took 10 seconds to realize that the latest backups were completely SNAFU'd.

So, as his parting gift, while the most critical drives were being sent off for possible forensic recovery, Jerry was asked to review the test framework and look for any possible flaws where something similar could re-occur. After hitting the 10th instance where deviating from the normal routine would result in some degree of disaster, Jerry knew one thing - even though he had less than two weeks to go, this is one script that would be haunting his nightmares for a long time to come.

Source: The Daily WTF

Date: Tuesday, 21 July 2009 15:45 (UTC)
From: [identity profile] pcb.livejournal.com
I picked up the
rm -rf $var1/$var2
on peripheral vision before I'd even got that far down the page and physically flinched...

Date: Tuesday, 21 July 2009 22:05 (UTC)
From: [identity profile] pcb.livejournal.com
You've an almost English appreciation of the art of understatement ;-)

Date: Wednesday, 22 July 2009 09:52 (UTC)

Date: Tuesday, 21 July 2009 21:19 (UTC)
From: [identity profile] winterhawk.livejournal.com
I picked up the
rm -rf $var1/$var2
on peripheral vision before I'd even got that far down the page and physically flinched...


Me too...::eep::

Date: Tuesday, 21 July 2009 22:14 (UTC)
From: [identity profile] pcb.livejournal.com
And almost as fun as the time somebody managed to run something along the lines of
find . -print | xargs mv '{}' /tmp
in the customer's / - as root, of course. Ah the glad days before sudo.
That one didn't actually lose anything, which was worse than losing the lot with an rm -rf... because then it all had to be unpicked and returned. This was back in the days when a system could conceivably be moved into a /tmp filesystem and not run out of space. Not that that condition lasted for long ;-)
Y'know, you know it's been a long day when you start trying to parse ;-) as a command line argument...

Date: Wednesday, 22 July 2009 09:56 (UTC)
From: [identity profile] yasutani.livejournal.com
people who have a habit of closing their Unix session with 'kill -9 -1' - which means "kill all processes you can kill" ...

except there they did not pay attention and the terminal they typed the command into had a "#" not a "$" as the end of the prompt...

result == one Sun E10000 server coming down like a ton of bricks, pulling out RAm, CPU and disk out from under relational databases....

Took about 3 days to recover...

Date: Wednesday, 22 July 2009 11:47 (UTC)
From: [identity profile] pcb.livejournal.com
kill -9: proving that a little knowledge is a DAMN dangerous thing since 1970!

(I'm reckoning it took, perhaps, a year for somebody to let a user near that sort of thing, back in the days when computers were impressive)

Date: Wednesday, 22 July 2009 10:26 (UTC)
From: [identity profile] lyonza.livejournal.com
I flinched and I don't even work in IT... I'd give 24hr notice and let someone else deal with it

Profile

claidheamhmor: (Default)
claidheamhmor

June 2025

S M T W T F S
1234567
891011121314
15161718192021
22 232425262728
2930     

Tags

Active Entries

Expand Cut Tags

No cut tags