Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Unix Technology IT

The Decline and Fall of System Administration 500

snydeq writes "Deep End's Paul Venezia questions whether server virtualization technologies are contributing to the decline of real server administration skills, as more and more sysadmins argue in favor of re-imaging as a solution to Unix server woes. 'This has always been the (many times undeserved) joke about clueless Windows admins: They have a small arsenal of possible fixes, and once they've exhausted the supply, they punt and rebuild the server from scratch rather than dig deeper. On the Unix side of the house, that concept has been met with derision since the dawn of time, but as Linux has moved into the mainstream — and the number of marginal Linux admins has grown — those ideas are suddenly somehow rational.'"
This discussion has been archived. No new comments can be posted.

The Decline and Fall of System Administration

Comments Filter:
  • by Xacid ( 560407 ) on Wednesday March 02, 2011 @09:59AM (#35356114) Journal

    "they punt and rebuild the server from scratch rather than dig deeper."

    From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.

    I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.

  • Re:Gee, ya think? (Score:5, Insightful)

    by rhsanborn ( 773855 ) on Wednesday March 02, 2011 @10:04AM (#35356170)
    There are a lot of cases where pressing the button means that the problem will go away...for a few weeks. It will work right until you hit the same conditions that caused the problem in the first place. Suddenly, your using the refresh to cover up either a poor implementation, or a standing bug, and it isn't going to go away until you call that guy in suspenders.
  • Re:Clone my car! (Score:5, Insightful)

    by shawb ( 16347 ) on Wednesday March 02, 2011 @10:07AM (#35356210)
    The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.
  • by Anonymous Coward on Wednesday March 02, 2011 @10:13AM (#35356260)

    To a small degree, you are correct. The bigger problem is that in *nix pretty much all the tools you need are available to you, but in the Windows world everything costs money. So often the solution comes down to either spend money to fix it or spend time to rebuild it. Since management thinks computers are simple push button things, "just reboot" because the go to solution.

  • by Nerdfest ( 867930 ) on Wednesday March 02, 2011 @10:19AM (#35356334)
    As I've said below, there is a benefit ... you can actually investigate and fix the problem rather than the symptom. The bonus with VMs though is that you can frequently do both. You can create a copy of the VM tio dig into, and create a new fresh instance for production to get them working again.
  • by laffer1 ( 701823 ) <luke@@@foolishgames...com> on Wednesday March 02, 2011 @10:29AM (#35356416) Homepage Journal

    There is benefit because there is downtime even if only a few minutes to restore the VM. What if the software running in the VM is old and someone has been attacking it? Restoring will result in the same problem a few days or hours later.

    If there is a bug in a specific kernel version that's not playing nice with the VM, it will cause stability problem again.

    Redeploying and finding the problem is the only real answer. In the long run, it may save work.

  • Re:Sad but smart (Score:4, Insightful)

    by causality ( 777677 ) on Wednesday March 02, 2011 @10:31AM (#35356442)

    I’m not a system admin but I don’t see how this is a bad approach.

    I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.

    But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.

    I think the issue here is that the need for a business to get a production system back up and operational with as little downtime as possible can sometimes conflict with the principles that most effectively assure sound system administration.

    Unix/Linux systems don't just break for no reason, particularly servers with enterprise hardware. The idea that a system just breaks for no apparent reason and a reboot, reset, or re-image is going to actually fix the cause and somehow prevent a future reoccurrence is alien to this realm. That's a mentality that comes from running Windows (esp. previous incarnations) on commodity hardware.

    Something on that "known working" image is faulty or capable of breaking. Otherwise, normal use would not have led to a state of system breakage.

    The ideal course of action would be to do whatever is necessary to get the system back online, which may include re-imaging, and then discover what is wrong with the "known working" image that eventually broke. That could be greatly assisted, of course, by saving the data (at least the logs) from the known-faulty system prior to re-imaging.

  • Re:Clone my car! (Score:4, Insightful)

    by bsDaemon ( 87307 ) on Wednesday March 02, 2011 @10:33AM (#35356464)

    Traditionally? College. Way back when, long before I was born, system admins tended to be graduate students in computer science or other department staff, and those in industry did it in college first. System administration itself wasn't taught, but that's not the point. The point is several technologies grew up together and are generally described in terms of one another: Unix, C, TCP/IP, etc. -- You don't really get what's going on with one without the others in most cases.

    C, of course, is the foundational building block. Unix is the cathedral and TCP/IP is the road that connects each building together. Most of the so-called system admins I've seen in the past have been "web developers" who have been put in over their head and forced to deal with things they don't fully understand. I learned C and Unix concurrently, starting by teaching myself in jr. and high school. Try explaining an mbuf to some kid who only knows PHP some time -- it's painful.

    The lack of fundamental understanding which would enable them to be competent admins is the same lack of fundamental understanding which keeps them from writing secure code, debugging network issues, etc. But, because there is a large influx of semi-skilled people who think that the fact they installed Ubuntu on their PC at home makes them a sever admin, employers are less willing to offer up the salaries necessary to attract competent admins, and frankly the salaries need to be even higher to make dealing with idiots less of a hassle.

    I'm so glad I'm not in web hosting anymore I can't possibly overstate it.

  • Re:Clone my car! (Score:5, Insightful)

    by Ephemeriis ( 315124 ) on Wednesday March 02, 2011 @10:34AM (#35356470)

    The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.

    Exactly.

    If the machine is in production it needs to be working. You don't have time to dig around and find the root cause. You need it to work. Now. If you've got a virtualized environment it is trivial to bring up a new VM, throw an image at it, and migrate the data.

    Then you take your old, malfunctioning VM into a development environment and dig for the root cause, so that you don't see the same problem crop up on your new production machine.

  • by Darth_brooks ( 180756 ) <clipper377@@@gmail...com> on Wednesday March 02, 2011 @10:34AM (#35356484) Homepage

    ....and his was the right answer. With XP, you're almost certainly talking about a client machine. Why bother dicking with it? It's a hundred dollar OS on a four hundred dollar piece of hardware. Wipe, reload, move on to big boy problems. Even if you're talking about a problem that ends up affecting a number of users, and it happens to be a client side problem, you're farther ahead to nuke and reload.

    In my last position I was the only end user support guy for 150 to 200 people. If I sat around and fucked with every little nuance of XP and it's associated ills, I'd have ended up even farther behind than I was when I left. I wrote up a quick backup script that grabbed anything the user didn't (against company policy) store on the network drive, grabbed their local e-mail (Notes), then nuked the machine and reloaded. I could take a user who was dead in the water and have them back up and running in 15-20 minutes. If they had a lot of data to restore, maybe 35-45. Spending an hour 'troubleshooting' was a waste of company time, and my time.

  • by jcoy42 ( 412359 ) on Wednesday March 02, 2011 @10:35AM (#35356494) Homepage Journal

    deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.

    There can be a benefit. I generally try to get the system working first, then figure out what went wrong. And sometimes it takes a few days of poking at it to figure it out, but when a problem like that comes up again, I'm ready for it.

    That's the benefit of an experienced system administrator. Anyone can just make it work again, but someone who has been doing that for a few years is going to be used to writing scripts that hunt for said issues and either correct the problem on the fly or send a notification with some details about where to look first.

    I've seen the "make it work and move on" approach result in systems that become increasingly unstable because no one ever tracks down the root problem.

  • Re:Clone my car! (Score:5, Insightful)

    by Isca ( 550291 ) on Wednesday March 02, 2011 @10:35AM (#35356498)
    That's assuming your new tool that's vitally important actually has a man page. Very little is documented as well as it was 10 years ago.
  • endless cycle (Score:5, Insightful)

    by roc97007 ( 608802 ) on Wednesday March 02, 2011 @10:37AM (#35356518) Journal

    I'm not sure I buy everything in TFA, but have to admit to a certain extent this phenomenon is real. I've noticed, however a tendency to regenerate an instance, and when it doesn't work regen it again, and again and again because the purposely overextended and/or undertrained admin doesn't have time to figure out that the problem is in his template or due to something external like a dup ip. Come to think of it, this type of endless cycle seems to be fairly common in the Windows world. I guess we've caught up.

    Sometimes the user has to diagnose the problem themselves, which is a win for the IT manager because the time didn't come out of the IT budget.

    I'm hoping that at some point these practices will be recognized as the false economics they are. But I'm not holding my breath.

  • by powerlord ( 28156 ) on Wednesday March 02, 2011 @10:45AM (#35356612) Journal

    Oh, and re-installing the machine means 24h of downtime

    I am with you except here. If re-installing a machine incurs 24h of downtime, you do not have a suitable contingency plan. Most environments I deal with are 15-20 minutes from offline to production on reinstall at the long end.

    I agree that if the system is as critical as they say, they should have a better failover in place, however in a lot of companies, very little importance is placed on Live Failover systems. More than likely he's including lots more than the OS/Application build in that 24 hour timeframe.

    Probably database reload/recovery time, or file system initialization (inadequate RAID controller to Disk design?).

  • by Tom ( 822 ) on Wednesday March 02, 2011 @10:46AM (#35356614) Homepage Journal

    In an environment of thousands of servers (or even dozens), deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.

    Except, of course, finding what the heck was wrong in the first place and fixing it, preventing future outtages.

    Sometimes, rebuilding is faster than fixing, and in some contexts, it makes sense. Even then, the original machine should still be examined and the "root cause" (if you need a management buzzwod) identified. At the very least, a reasonable amount of time should be given towards the attempt. It's true that it is pointless to dig around for days and days - but that is not a reason to not at least start looking, as it might turn out you only need a few hours. And more often than not, finding the real problem tells you something that helps you
    a) fix other bugs,
    b) avoid the same problem on the next server,
    c) avoid a repeat performance,
    d) makes you realize what you thought was a random server crash was really a break-in / hardware failure / systematic problem and other, additional steps need to be taken.
    All of the above have happened before, you would by far not be the first.

    A proper incident management process does allocate resources towards follow-up examination. The right thing to do is not suppress it with generic blabla about wasted time, but to set the proper amount of resources for your organisation. Maybe it's half an hour and no money, so some sysadmin can check the logs and do a quick check-up. Maybe it's a full-out forensics analysis. That depends on your needs, your resources, your environment and context.

  • by causality ( 777677 ) on Wednesday March 02, 2011 @10:51AM (#35356666)

    Reminds me of our Windoze guy in my previous job. I had run into some kind of problem with XP at home and spent a good amount of time banging my head against it. Finally I told him what was going on and asked what he recommended. "Reinstall XP" was the answer. I thought he was joking, nope he was serious. :P

    Not only was he completely serious, he probably can't understand why you might have thought he was joking.

    The idea that it's a black box and you shouldn't expect to understand how or why something happened is definitely one of the more subtle costs of Microsoft systems. It lends credibility to the (false) notion, so common among average users, that you're either a completely unskilled newbie or a serious expert who can discern the inner workings of the mysterious black box. It discourages middle ground for intermediary skill levels, the kind of thing that would otherwise occur naturally as users gain experience over time.

    Most of all, it's supports the falsehood that it's unreasonable to expect the most basic competence from non-experts.

  • Re:Hyperviser (Score:5, Insightful)

    by __aamnbm3774 ( 989827 ) on Wednesday March 02, 2011 @10:57AM (#35356736)
    This whole argument is retarded. I always pick the most appropriate response to the problem at hand. If your server is hosed and not booting, I don't have time to mess around with some Knoppix DVD, trying to figure out exactly where in the boot process it is dying. Especially if you have nightly backups! Sometimes a clean sweep and restore is perfectly acceptable and reasonable. Why even sacrifice downtime trying to troubleshoot an issue that could be resolved within minutes?!

    Now, if it happens again the following night, you do have a deeper problem and should investigate it further, because constantly restoring the machine is now the inefficient part in the process.

    It's like we've lost common sense in favor of our technical ego.
  • Re:Hyperviser (Score:5, Insightful)

    by jc42 ( 318812 ) on Wednesday March 02, 2011 @11:00AM (#35356774) Homepage Journal

    ... documentation for a GUI, if it exists at all, is often useless..,

    How true. There popular explanation of the difference between a CLI and a GUI is that CLIs are so complicated that you need a manual to use it, whereas GUIs are so simple and intuitively obvious that no manual is needed.

    Of course, the reality is that this attitude allows vendors to supply GUIs without any "unnecessary" manuals, but make the nested tree of windows and menus so deep and complex that nobody can ever remember where everything has been hidden, and there are no good tools to help you find something that you know is in there somewhere.

    Meanwhile, the people who build the CLI know that nobody can ever remember it all, so they include tools for finding your way around. They also tend to make the defaults for the commands fit the most common cases, so you don't have to use the manuals all that often. And most tools have a -help option (though they can't quite agree on how to spell it), to provide quick reminders. And the CLI includes a current directory, search paths and aliasing, so you don't have to remember full paths to everything.

    One of the ongoing frustrations with every GUI is constantly seeing a new window pop up, which is positioned back at the root directories, and I have to laboriously poke at things to get down to the directory that I'm working in. Then, when I do what the window was opened for, it closes, all that navigation is lost, and I have to do it all over again the next time I want to access a file in the same directory.

    GUIs may have some aesthetic appeal (aka "pretty pictures"), but they remain the slowest, clumsiest way to use a computer that we've yet developed. But I trust that people are working on finding ways to make it even clumsier and slower. This seems to be happening with the "cloud" approach, for example.

  • Re:Hyperviser (Score:4, Insightful)

    by Courageous ( 228506 ) on Wednesday March 02, 2011 @11:03AM (#35356808)

    Someone still has to maintain the machines that are actually running the VMs.

    This is true. What's also true is that those admins can be fairly intensely busy running those machines. The summary mixes the concepts of the growing age of virtualization with "marginal admins." The summarizer doesn't really know what's going on, I think. In intensive virtualization operations, the talent pool is shrinking, but growing more concentrated. Cross training is now becoming more common, with the few critical people one has for the core operation being, trained in operating systems (both windows and linux), storage administration, and network administration.

    These admins are often far too busy to spend a great deal of time on a specific VM. They're might be literally thousands of virtual machines in a large operation. For just one VM to draw their attention, it has to be something important and shared. Domain controllers, DNS systems, Radius servers, or other shared production systems will often get close attention, but if a quick reboot might resolve things and isn't any more disruptive than the current problem, of course you are going to do that.

    What I think the summarizer isn't really grokking is that in this growing age of virtualization, the number of admins per server is going down a lot, and the focus of these admins has changed.

    C//

  • Re:Hyperviser (Score:4, Insightful)

    by drsmithy ( 35869 ) <drsmithy@ g m ail.com> on Wednesday March 02, 2011 @12:23PM (#35357700)

    How true. There popular explanation of the difference between a CLI and a GUI is that CLIs are so complicated that you need a manual to use it, whereas GUIs are so simple and intuitively obvious that no manual is needed.

    No, the difference is that a CLI is nearly impossible to use if you aren't familiar with it - the semantics and syntax are as, if not more, important than the concepts - whereas a GUI requires much less focussing on the "how", allowing much more focussing on the "what".

    GUIs may have some aesthetic appeal (aka "pretty pictures"), but they remain the slowest, clumsiest way to use a computer that we've yet developed.

    Ridiculously untrue, particularly in the context of non-specialised, non-expert users.

  • Re:Hyperviser (Score:4, Insightful)

    by locofungus ( 179280 ) on Wednesday March 02, 2011 @12:26PM (#35357776)

    Of course, the reality is that this attitude allows vendors to supply GUIs without any "unnecessary" manuals, but make the nested tree of windows and menus so deep and complex that nobody can ever remember where everything has been hidden, and there are no good tools to help you find something that you know is in there somewhere.

    I think it's even worse than that.

    If you have a problem to solve with a CLI then you might spend several days trying to make something work the way you need it to, but, once it's sorted it's very easy to document it for next time. (where next time might be several years down the line)

    With GUI it's almost impossible to know what you've actually done at the time, let alone several years later.

    Need to change a config file - take a copy, make changes, experiment etc. Once you've worked out what it is you actually need to to, restore the copy and then make the required changes. (Or just diff the original with the new version, "Hmmm, don't think I should have changed that setting, I'll change it back".)

    With a GUI that "try this, try that" means that you have no idea what you might inadvertently/incorrectly have changed on your way to fixing the issue that you were really interested in.

    And five years later when you need to do it again - CLI, all the options have changed subtly but your notes immediately give you a point to google and half an hour later you've worked out the correct set of switches to achieve what you need with the current version.

    With the GUI, even if you've got perfect notes on what you did back then if it's even slightly non-obvious then it's very likely that the configuration option you need doesn't even exist any more (but no way to tell that of course).

    Tim.

  • by DerekLyons ( 302214 ) <fairwater AT gmail DOT com> on Wednesday March 02, 2011 @01:46PM (#35358948) Homepage

    Some of the comments here remind me of a post on a woodworking board a few months back. Essentially, the poster was lamenting because he had to fire a guy because he couldn't afford to keep him... Not because of the economy, but because the guy was an absolutely inflexible perfectionist. He'd spend $300 worth of time on what should have been a $60 job... The guy was a hell of a woodworker, at home in his own shop, but just couldn't adapt to a production environment.

    This isn't about Windows vs. Unix. This is about admins not understanding their job is to get production rolling again, not to satisfy their obsessive need to understand every problem or their need to satisfy their ego. ("I'm a UNIX admin dammit, I refuse to use habits that make me look like a Windows admin" or it's equivalent is a refrain modded up again and again here on Slashdot.) If a reboot or a re-imaging fixes the problem, that's the right solution. If it doesn't, *then* you dig deeper.

  • Re:Hyperviser (Score:5, Insightful)

    by drsmithy ( 35869 ) <drsmithy@ g m ail.com> on Wednesday March 02, 2011 @02:03PM (#35359206)

    For example, Linux is extremely easy to use -- if you understand it. Windows is a hell of a lot easier to learn but knowing all about it won't make it much easier to use.

    That, is entirely a matter of opinion.

    Your comment there describes what is easy to learn.

    No, it doesn't. Your comment assumes that an interface should *have* to be learnt, to be easy to use.

    The CLI appeals to people who are willing to learn, who like learning new things and consider it worthwhile.

    No, the CLI appeals primarily to people who like to focus on memorising semantic minutiae and believe that doing so is, in and of itself, a productive endeavour.

    The terminal is for non-trivial tasks.

    The implication that GUIs are only used for "trivial" tasks is ridiculous on its face.

    The average Windows user who views learning as an unreasonable burden that should never be expected of anyone who wants to use a complex machine ... they avoid the up-front investment of learning to understand the system. Instead, they can jump in and start using the system right now. But they continuously pay for it over time in the form of enjoying few or none of those advantages.

    There is nothing unique to Windows, or even computers, about this. Do you know the intricacies of how your car works ? How about your blender or oven ? Could you fabricate a new bed or sofa from raw materials, and without modern tools ? Do you grow your own produce ? Could you butcher a cow or chicken ? Could you set a complex fracture or create your own painkillers ? Can you brew your own beer ?

    It's like the difference between people who live within their means and use plastic only as a form of payment, saving up until they can actually afford something before they purchase it, versus those who live all the time on credit. The person living on credit gets the stuff they want right now but ultimately pays quite a bit more for it and can quickly find themselves in over their head. The discipline and delayed gratification that the latter is trying so hard to avoid is something that the former considers to be virtues worth cultivating.

    No, it's nothing like that at all. One is an example of financial irresponsibility and the other is simply realising that you do not need a deep and intricate understanding of a given thing to use or take advantage of the services or benefits it provides.

"The only way I can lose this election is if I'm caught in bed with a dead girl or a live boy." -- Louisiana governor Edwin Edwards

Working...