How to destroy a network in one easy step
Tuesday, 21 July 2009 16:46![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
This is one of my personal nightmares - having some or other script I've written go out and trash large areas of the network because I wasn't careful enough about the parameters.
Bourne Into Oblivion
2009-07-21
by Mark Bowytz in Feature Articles
Jerry wasn't the sort of guy who would normally vent frustration out loud at work, yet here he was - cursing into the air at two individuals in particular - the first round of explitives being directed at the toolbag, somewhere, who had botched months of server backups by reusing the same set of tapes for months and the other being a long ago departed developer whose name he was continually being subjected to in the comments of the rotten shell script he was now stepping through.
What had started out as a 7:30am ticket from an early-bird user getting a error message when trying to open a spreadsheet test plan from the week before had turned into a full-on, corporate-wide DEFCON 1.
To make matters worse, Jerry had just delivered his two-week notice a few days prior which meant that in every meeting Jerry was getting "thanked" for the company's current nuclear crisis and that he should have set his little "time bomb" to go off AFTER he was gone. Naturally, while his being "blamed" helped to improve the morale of everyone else, it didn't do much to help Jerry's outlook - especially since it appeared as if this was someone else's "parting gift".
Questions? Please Refer to the Scriptonomicon
For as long as anyone could remember, everyone just kind of just coped with the Bourne shell script that was the framework to a test environment. It was originally designed to run automated tests for a single product, but management was so thrilled at how well it worked that they got other projects to adapt the framework.
Over the next few years, it became the de-facto test framework used by applications throughout the corporation. However, in order to make "one size fit all", it had morphed into something... different. It became one of those gnarly applications that everybody acknowledged was a bit sketchy behind the scenes, but it worked. So long as you stuck to the S.O.P. and knew the different locations where the same value had to be defined and accepted that P_OPERATOR_ID was a unique network identifier that is NOT a normal network ID that you had to get from Chuck in the Infrastructure Group, you'd be ok.
However, recently, the developer who had originally created the framework had left the company in search of greener pastures and, rather than handing off the task of running the scripts to a developer, it was given to a co-op student. After all, running the script was like checking off steps on a list, right? The co-op set up the configuration, scheduled it to run over the weekend, and merrily left it to return the following week. As it turned out, he missed a few details.
Cleaning Up
From a high level, the Bourne script would essentially ssh into each target machine, do its thing, and then exit. As part of its "thing", the designer of the framework wanted to make sure the script cleaned up after itself so subsequent runs of the framework would not re-process old data. To accomplish this, one of the enhancements after the initial release was to add two cryptic variables that (redundantly) contained the project name and the version being tested. Utilizing an unpatched flaw in sudo's setup to gain real root access, the script would then do the following as part of the clean up:
rm -rf $var1/$var2
Ordinarily, this worked just fine, but the co-op student was unaware these SPECIFIC variables needed to be set. With them being left blank, the following was the end result upon execution of the script:
rm -rf /
With the script running as root on a setup with NFS (which, in turn, granted access to everything on the entire UNIX/Linux network and a few Windows Servers via SAMBA), the script had a chance to do a good bit of damage... and it did. Home directories, file repositories, customer data, test results, all seemingly evaporated into nothingness.
All told, it took 6 hours to wipe out the entire network. It took 4 hours to figure out what happened (turns out the script ssh'd onto its own server and the rm -rf wiped out the scripts which did the rm -rf and most the evidence of what happened) and it only took 10 seconds to realize that the latest backups were completely SNAFU'd.
So, as his parting gift, while the most critical drives were being sent off for possible forensic recovery, Jerry was asked to review the test framework and look for any possible flaws where something similar could re-occur. After hitting the 10th instance where deviating from the normal routine would result in some degree of disaster, Jerry knew one thing - even though he had less than two weeks to go, this is one script that would be haunting his nightmares for a long time to come.
Source: The Daily WTF
no subject
Date: Tuesday, 21 July 2009 15:45 (UTC)rm -rf $var1/$var2
on peripheral vision before I'd even got that far down the page and physically flinched...
no subject
Date: Tuesday, 21 July 2009 18:57 (UTC)no subject
Date: Tuesday, 21 July 2009 22:05 (UTC)no subject
Date: Wednesday, 22 July 2009 09:52 (UTC)no subject
Date: Tuesday, 21 July 2009 21:19 (UTC)rm -rf $var1/$var2
on peripheral vision before I'd even got that far down the page and physically flinched...
Me too...::eep::
no subject
Date: Tuesday, 21 July 2009 22:14 (UTC)find . -print | xargs mv '{}' /tmp
in the customer's / - as root, of course. Ah the glad days before sudo.
That one didn't actually lose anything, which was worse than losing the lot with an rm -rf... because then it all had to be unpicked and returned. This was back in the days when a system could conceivably be moved into a /tmp filesystem and not run out of space. Not that that condition lasted for long ;-)
Y'know, you know it's been a long day when you start trying to parse ;-) as a command line argument...
no subject
Date: Wednesday, 22 July 2009 09:56 (UTC)except there they did not pay attention and the terminal they typed the command into had a "#" not a "$" as the end of the prompt...
result == one Sun E10000 server coming down like a ton of bricks, pulling out RAm, CPU and disk out from under relational databases....
Took about 3 days to recover...
no subject
Date: Wednesday, 22 July 2009 11:47 (UTC)(I'm reckoning it took, perhaps, a year for somebody to let a user near that sort of thing, back in the days when computers were impressive)
no subject
Date: Wednesday, 22 July 2009 10:26 (UTC)