The Decline and Fall of System Administration 500
snydeq writes "Deep End's Paul Venezia questions whether server virtualization technologies are contributing to the decline of real server administration skills, as more and more sysadmins argue in favor of re-imaging as a solution to Unix server woes. 'This has always been the (many times undeserved) joke about clueless Windows admins: They have a small arsenal of possible fixes, and once they've exhausted the supply, they punt and rebuild the server from scratch rather than dig deeper. On the Unix side of the house, that concept has been met with derision since the dawn of time, but as Linux has moved into the mainstream — and the number of marginal Linux admins has grown — those ideas are suddenly somehow rational.'"
Hyperviser (Score:2, Informative)
Re: (Score:3)
WHOOSH!
Re: (Score:3)
With bare metal virtualizatoin, there is not that much to maintain, and there is pointy clicky software to do that. No real admin skills required.
Re:Hyperviser (Score:4, Interesting)
Because pointing and clicking inherently takes more skill than using CLI, right? Never mind that most CLI commands will readily assist you with syntax if your format incorrectly, whereas documentation for a GUI, if it exists at all, is often useless..,
Re:Hyperviser (Score:5, Insightful)
... documentation for a GUI, if it exists at all, is often useless..,
How true. There popular explanation of the difference between a CLI and a GUI is that CLIs are so complicated that you need a manual to use it, whereas GUIs are so simple and intuitively obvious that no manual is needed.
Of course, the reality is that this attitude allows vendors to supply GUIs without any "unnecessary" manuals, but make the nested tree of windows and menus so deep and complex that nobody can ever remember where everything has been hidden, and there are no good tools to help you find something that you know is in there somewhere.
Meanwhile, the people who build the CLI know that nobody can ever remember it all, so they include tools for finding your way around. They also tend to make the defaults for the commands fit the most common cases, so you don't have to use the manuals all that often. And most tools have a -help option (though they can't quite agree on how to spell it), to provide quick reminders. And the CLI includes a current directory, search paths and aliasing, so you don't have to remember full paths to everything.
One of the ongoing frustrations with every GUI is constantly seeing a new window pop up, which is positioned back at the root directories, and I have to laboriously poke at things to get down to the directory that I'm working in. Then, when I do what the window was opened for, it closes, all that navigation is lost, and I have to do it all over again the next time I want to access a file in the same directory.
GUIs may have some aesthetic appeal (aka "pretty pictures"), but they remain the slowest, clumsiest way to use a computer that we've yet developed. But I trust that people are working on finding ways to make it even clumsier and slower. This seems to be happening with the "cloud" approach, for example.
Re:Hyperviser (Score:4, Insightful)
No, the difference is that a CLI is nearly impossible to use if you aren't familiar with it - the semantics and syntax are as, if not more, important than the concepts - whereas a GUI requires much less focussing on the "how", allowing much more focussing on the "what".
Ridiculously untrue, particularly in the context of non-specialised, non-expert users.
Re: (Score:3)
No, the difference is that a CLI is nearly impossible to use if you aren't familiar with it - the semantics and syntax are as, if not more, important than the concepts - whereas a GUI requires much less focussing on the "how", allowing much more focussing on the "what".
Yes. The only problem is the "where" gets a little jumbled up every now and then, but that's the result of a sloppy implementation, not a flaw inherent to GUIs.
Re:Hyperviser (Score:5, Interesting)
No, the difference is that a CLI is nearly impossible to use if you aren't familiar with it - the semantics and syntax are as, if not more, important than the concepts - whereas a GUI requires much less focussing on the "how", allowing much more focussing on the "what".
While there's a certain truth to this, GUIs are in general a lot less "intuitive" than people tend to believe. Without documentation and training, most users are unaware of most of their GUI's capabilities, and have great difficulty in learning much more than the basics.
An example I've read a number of warnings about in web-design documents is that a significant number (often estimated at around 50%) of "non-geek" users don't understand scroll bars. This is usually mentioned along with the advice to put the important part of your web pages close to the top, because the non-scrolling users won't be able to see anything below that.
Yes, I was dubious when I first read this. But over the years, I've run into several clear examples. I've been involved in building web sites for some very non-geeky organizations. The orgs' leaders generally want a lot of stuff on their main page, and at the top they usually want some text about the organization, its purposes, its main activities, etc. They also agree that it's good to have a list of upcoming public events on the main page, and inevitably that's positioned below the introductory text, so it's often not visible unless the user has a rather large window.
In each case, there were eventually meetings with discussions of how to improve the web site. One thing that would come up was suggestions from users (including members) that the home page should have a list of upcoming events. The leaders have always been dumfounded by this. "But, but, ... There is such a list on the home page." "What?? No, there isn't."
Eventually, I have to interrupt, and explain to the org's leaders that they're hearing from people who don't understand scrollbars, have never seen the events table because they don't scroll down to see it. The users are, of course, confused; they know that there's no such table because they've never seen it. We bring up the site on a handy machine (preferably a laptop or tablet with a small screen), and I show the users that it's there by scrolling down to it. Their response again is confusion, because they don't know what I did or how I did it. "Why's it hidden like that?"
So I teach them about scrollbars, and a few users have learned something useful. But this has a more important effect: It gets across to the leaders why their design was wrong, as I'd been telling them, and they'll have a better web site if they'll let me fix it.
One instance of this happened just last week. The org's web site now has that block of extensive history and purpose in a separate box at the bottom of the page, and the table of coming events is positioned near the top, just below the logo bar, where non-geek users will see it and be able to read at least the first few entries.
Examples like this abound in GUI design. Many of the common widgets are not at all intuitive to most people. Even if they accidentally poke at things and trigger the actions, it's often difficult to grasp what the effect was. You see things change, but the changes don't make sense, and have no obvious relation to the icon that you clicked on. Often the icons don't look like anything that most users can name. The result is that most of the GUI is unusable to most of the users.
I wish I knew good ways around this. But truly making a GUI obvious is very difficult, and takes a lot of time studying the users and learning about their misconceptions. I very rarely have the time to do this, and in many cases the people paying me have expressly forbid wasting time with dumb users.
And that's something that's very difficult to program around. ;-)
Re:Hyperviser (Score:4, Interesting)
Sure, but the point is with a CLI and no understanding of its syntax and semantics, you're pretty much dead in the water from the get-go. You could have a deep understanding of networking, but if you're unfamiliar with the syntax of iptables, you're not going to be able to configure a Linux firewall.
Your scrollbar example is actually a good one, because it highlights the key differences between a GUI and a CLI. In a GUI, there is both a visual indicator that the content is larger than a single page, positive feedback from the UI element if the user tries to interact with it (ie: it reacts to a mouse click), and secondary feedback that the UI element is important even if it is triggered "accidentally" (ie: it moves if the user presses page down, space, or in some other way makes the page scroll).
In a CLI, you would simply be presented with a single page of text. Advancing to the next page would require knowing which key(s) to press to do so. If you don't know the key, you're screwed. Some CLIs may present a "press space to continue", or similar, message, but that's starting to blur the line between CLI and GUI, IMHO.
Further, the new knowledge those users have about the scrollbar is now applicable to pretty much any GUI they use in the future, even ones running on completely different OSes (I recognise this doesn't apply to all UI elements, but the fundamentals - buttons, menus, scrollbars, selection boxes, etc - are pretty consistently implemented in similar ways across the board). The knowledge they have gained about the CLI interaction is probably specific to that CLI only (how many different ways in different CLIs do you know of to trigger a page down ?).
Sure, but the point is that there *ARE* things there to "poke at" and there is feedback that something actually happened. A CLI has neither - you need to know the commands in advance to do anything, and often the only feedback from a command is to indicate an error (and frequently said feedback is not useful at all in understanding what the error was).
Human cognition is highly depend on visualisation, context and feedback. A CLI interface lacks - or typically has very minimal implementations of - all of those.
Re:Hyperviser (Score:5, Insightful)
That, is entirely a matter of opinion.
No, it doesn't. Your comment assumes that an interface should *have* to be learnt, to be easy to use.
No, the CLI appeals primarily to people who like to focus on memorising semantic minutiae and believe that doing so is, in and of itself, a productive endeavour.
The implication that GUIs are only used for "trivial" tasks is ridiculous on its face.
There is nothing unique to Windows, or even computers, about this. Do you know the intricacies of how your car works ? How about your blender or oven ? Could you fabricate a new bed or sofa from raw materials, and without modern tools ? Do you grow your own produce ? Could you butcher a cow or chicken ? Could you set a complex fracture or create your own painkillers ? Can you brew your own beer ?
No, it's nothing like that at all. One is an example of financial irresponsibility and the other is simply realising that you do not need a deep and intricate understanding of a given thing to use or take advantage of the services or benefits it provides.
Re:Hyperviser (Score:4, Insightful)
Of course, the reality is that this attitude allows vendors to supply GUIs without any "unnecessary" manuals, but make the nested tree of windows and menus so deep and complex that nobody can ever remember where everything has been hidden, and there are no good tools to help you find something that you know is in there somewhere.
I think it's even worse than that.
If you have a problem to solve with a CLI then you might spend several days trying to make something work the way you need it to, but, once it's sorted it's very easy to document it for next time. (where next time might be several years down the line)
With GUI it's almost impossible to know what you've actually done at the time, let alone several years later.
Need to change a config file - take a copy, make changes, experiment etc. Once you've worked out what it is you actually need to to, restore the copy and then make the required changes. (Or just diff the original with the new version, "Hmmm, don't think I should have changed that setting, I'll change it back".)
With a GUI that "try this, try that" means that you have no idea what you might inadvertently/incorrectly have changed on your way to fixing the issue that you were really interested in.
And five years later when you need to do it again - CLI, all the options have changed subtly but your notes immediately give you a point to google and half an hour later you've worked out the correct set of switches to achieve what you need with the current version.
With the GUI, even if you've got perfect notes on what you did back then if it's even slightly non-obvious then it's very likely that the configuration option you need doesn't even exist any more (but no way to tell that of course).
Tim.
Re: (Score:3)
> with a CLI ... it's very easy to document it for next time.
Indeed - just run "script" before starting typing.
Show me the equivalent of that for any GUI too.
And once you've cleaned up your document (changing 'vi filename' to 'sed .... filename') you can usually get to the point where you can just run your documentation with /bin/sh the next time you need it.
Re: (Score:2)
Re: (Score:2)
Someone still has to write the programs that average end-users run. That requires real programming skill.
Yet I don't see too many average end-users who are skilled programmers.
Point is, you only need one person with actual sysadmin skill to make and maintain an imagine. Hundreds of point-and-click types can then use that image. It happens in large organizations all the time. Why pay for a hundred s
Re:Hyperviser (Score:4, Funny)
I always thought it was MCRE..??
Microsoft Certified Reboot Engineer...?
Re: (Score:2)
Re:Hyperviser (Score:5, Insightful)
Now, if it happens again the following night, you do have a deeper problem and should investigate it further, because constantly restoring the machine is now the inefficient part in the process.
It's like we've lost common sense in favor of our technical ego.
Re: (Score:3)
To my mind restoring from image isn't a replace for system administration, but it can buy you precious time. Too many times in the past I've had a gun to my head over trying to figure out why this database server or that mail server was barfing, and if I could have just kept things going while I tested on a sandboxed copy, it would have been a lifesaver. VM images and other type of OS images are tools, nothing more and nothing less. At the end of the day you still have to have some skill in troubleshooti
Re:Hyperviser (Score:4, Interesting)
> a clean sweep and restore is perfectly acceptable and reasonable
NNNOOOOOOO!
Often a glitch like that is the only evidence you'll have that a machine had been compromised or that hardware is failing.
If you must do a clean sweep, do that on a standby system, and keep an image of the failed one until you can investigate the exact reason for the failure.
Re: (Score:3)
Why even sacrifice downtime trying to troubleshoot an issue that could be resolved within minutes?! Now, if it happens again the following night, you do have a deeper problem and should investigate it further, because constantly restoring the machine is now the inefficient part in the process. It's like we've lost common sense in favor of our technical ego.
You make a fair point. However, the fundamental question is more complex than you're giving it credit for. There's always the question of tradeoffs between immediate, fast fixes and long-term advantage. That has to be balanced with the situation at hand, of course. But there are times when the initial time / effort investment pays off in the long run. And that trade-off as much a philosophical question within the admin world as a technical one.
Quite a few years ago, we were migrating our institutiona
Re: (Score:3)
In fact failover provides support for this sort of thing and is hardly a step away from proper administration.
This is actually a very good thing. Far too often people have automatic failover but never test it and are shocked to find it doesn't automatically failover and remained undetected because failover is never tested.
Re:Hyperviser (Score:4, Insightful)
Someone still has to maintain the machines that are actually running the VMs.
This is true. What's also true is that those admins can be fairly intensely busy running those machines. The summary mixes the concepts of the growing age of virtualization with "marginal admins." The summarizer doesn't really know what's going on, I think. In intensive virtualization operations, the talent pool is shrinking, but growing more concentrated. Cross training is now becoming more common, with the few critical people one has for the core operation being, trained in operating systems (both windows and linux), storage administration, and network administration.
These admins are often far too busy to spend a great deal of time on a specific VM. They're might be literally thousands of virtual machines in a large operation. For just one VM to draw their attention, it has to be something important and shared. Domain controllers, DNS systems, Radius servers, or other shared production systems will often get close attention, but if a quick reboot might resolve things and isn't any more disruptive than the current problem, of course you are going to do that.
What I think the summarizer isn't really grokking is that in this growing age of virtualization, the number of admins per server is going down a lot, and the focus of these admins has changed.
C//
Re:Hyperviser (Score:4, Informative)
It's VMs all the way down.
Could be. One of my favorite cosmological theories is that our universe is a simulation. In the "real" universe, there's a big computer that has a data object for every elementary particle in our universe. The simulation software (probably massively parallel) "steps" through the simulation, by calculating the position and velocity of each particle after the next time quantum. The beings running the simulation can stop it, do a bit of editing, and restart, which explains the religious "miracles" that have been so often reported.
It's hard to imagine how we could test this hypothesis. If we were to do a successful test, the simulation could just be stopped, reloaded from backup, and edited so our test came out inconclusive.
Of course, if this is valid, then we should also consider that the simulation might itself be running in a simulated universe ...
That's really not too far from Hermetic thought, which is quite ancient. What follows is an oversimplification I hope is still useful. The main difference could just be that they didn't have computers thousands of years ago. Rather than imagining that the simulation is running on a highly advanced computer that's basically a machine of the ultimate sophistication, they conceive the simulation (the "software") to be thoughts in the mind of God. It's also an explanation for how God could be transcendental, beyond the Universe, omniscient and omnipresent, but not some old man in the clouds you could shake hands with like the more childish notions of God.
The Matrix is based on some very old ideas.
I also think it's fascinating to wonder ... if you could see the Universe as a whole, in its entirety, all at once, like perhaps from the perspective of another Universe, what would it look like? Would it look like a single living being, recognizable as such? Would it look sort of like a man even, as in the "we are made in the 'image of God'" idea? What fascinates me about that is the notion of galaxies being like cells in its body, which are made of stars, which have planets, which have organisms, which are made of cells, which are made of molecules, which are made of atoms, which are made of subatomic particles, etc, potentially to infinity. It could be infinite both ways, scaling ever smaller and also scaling ever larger. It's like the fractal Universe idea.
That, in turn, reminds me of the holographic Universe idea. It's a notion of such a fractal nature in terms of interrelatedness. It's an analogy for how the "parts contain the whole". Basically, if you take a glass photographic plate and take an ordinary photograph on it, and then break that plate ... you get something like a jigsaw puzzle. Each piece has an incomplete fraction of the total information. If you put a hologram onto a photographic plate and then break that plate into pieces, you get something quite different. You don't get a jigsaw puzzle at all. If it breaks into 10 pieces, then you get 10 complete holograms containing the full information of the original, just with each of them 1/10th the size of the original.
It's like the notion that truly understanding yourself would require truly understanding the Universe. Carl Sagan may or may not have been thinking something like that when he said, "in order to make an apple pie from scratch, you must first invent the Universe."
Clone my car! (Score:2)
Re:Clone my car! (Score:5, Insightful)
Re:Clone my car! (Score:5, Insightful)
The real solution? Reimage the production server to just get it working, then you dig around on the dev server until you find out what's actually going on.
Exactly.
If the machine is in production it needs to be working. You don't have time to dig around and find the root cause. You need it to work. Now. If you've got a virtualized environment it is trivial to bring up a new VM, throw an image at it, and migrate the data.
Then you take your old, malfunctioning VM into a development environment and dig for the root cause, so that you don't see the same problem crop up on your new production machine.
Re: (Score:3)
On reflection, this is a good analogy for modern society in general.
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
Re:Clone my car! (Score:5, Insightful)
Re: (Score:2)
Well, pre-Xbox attention spans it was digging through man pages.
$man man
NAME
man - an interface to the on-line reference manuals
DESCRIPTION
man is the system's manual pager.
SEE ALSO
The full documentation for man is maintained as a Texinfo manual. If
the info and man programs are properly installed at your site, the
command
info man
should give you access to the complete manual.
And no, this isn't really what man man says, but I expect it to eventually. I hate info and its hypertext-ified, hiding-
Re:Clone my car! (Score:4, Insightful)
Traditionally? College. Way back when, long before I was born, system admins tended to be graduate students in computer science or other department staff, and those in industry did it in college first. System administration itself wasn't taught, but that's not the point. The point is several technologies grew up together and are generally described in terms of one another: Unix, C, TCP/IP, etc. -- You don't really get what's going on with one without the others in most cases.
C, of course, is the foundational building block. Unix is the cathedral and TCP/IP is the road that connects each building together. Most of the so-called system admins I've seen in the past have been "web developers" who have been put in over their head and forced to deal with things they don't fully understand. I learned C and Unix concurrently, starting by teaching myself in jr. and high school. Try explaining an mbuf to some kid who only knows PHP some time -- it's painful.
The lack of fundamental understanding which would enable them to be competent admins is the same lack of fundamental understanding which keeps them from writing secure code, debugging network issues, etc. But, because there is a large influx of semi-skilled people who think that the fact they installed Ubuntu on their PC at home makes them a sever admin, employers are less willing to offer up the salaries necessary to attract competent admins, and frankly the salaries need to be even higher to make dealing with idiots less of a hassle.
I'm so glad I'm not in web hosting anymore I can't possibly overstate it.
Re: (Score:3)
Problem is Companies don't want to PAY for college educated Sysadmins or IT people in general. They want to pay $16-$18 an hour instead of the $32-$42 my BSEE and BSCS deserves. The Cert mills churn out the useless certified IT people that gladly lap up the low wages.
THAT is the demise of the educated Sysadmin. Companies that want to pay the IT department less than the custodial department.
Sad but smart (Score:5, Interesting)
I’m not a system admin but I don’t see how this is a bad approach.
I see value in finding out what the problem is and why it happened.. if you just blindly re-image then the problem might pop up again at a less opportune time.
But if you know what the problem is... and you have an image of the server in a working state, or a documented procedure on how to set up the server in it’s intended configuration then why would anyone waste time trying to repair it.
I think you have this kind of problem in most jobs. New approaches that make more sense but require less skill (and imply less e-pene) are always hated by people who have already learnt how to do it “the hard way”.
I see this as a programmer all the time and have been a victim of it. I’ve seen a huge chunk of my chosen industry migrate from meat and potato problem solving to gluing libraries together and sprinkling in business logic.
I’ve been fortunate to land in a job where there’s still a lot of “from the ground up” work, but these jobs are getting scarcer as even the components that everyone uses are made from other components. And executable UML (or something of its ilk) is probably going to be the next thing to cut the legs off us.
Re: (Score:2)
That's why you have backup servers. Sometimes it simply isn't worth the time or effort to dig deeper. Re-imaging is completely rational from a business perspective.
Doesn't always work (Score:3)
Sometimes a server is gradually degrading due to some issue. During that time, things are being modified. If you learn that the problem started a few months ago, you can't just re-image an old state and loose everything that had changed since then.
Of course to make app servers as stateless as possible helps against this problem. One of the reasons that my company enforces that data are kept on physically separate DB servers, and (virtualized) app server instances should be as dedicated to a single app as po
Re:Sad but smart (Score:5, Interesting)
Re:Sad but smart (Score:4, Insightful)
I think the issue here is that the need for a business to get a production system back up and operational with as little downtime as possible can sometimes conflict with the principles that most effectively assure sound system administration.
Unix/Linux systems don't just break for no reason, particularly servers with enterprise hardware. The idea that a system just breaks for no apparent reason and a reboot, reset, or re-image is going to actually fix the cause and somehow prevent a future reoccurrence is alien to this realm. That's a mentality that comes from running Windows (esp. previous incarnations) on commodity hardware.
Something on that "known working" image is faulty or capable of breaking. Otherwise, normal use would not have led to a state of system breakage.
The ideal course of action would be to do whatever is necessary to get the system back online, which may include re-imaging, and then discover what is wrong with the "known working" image that eventually broke. That could be greatly assisted, of course, by saving the data (at least the logs) from the known-faulty system prior to re-imaging.
Cost and primary business (Score:2)
Re: (Score:2)
Unfortunately, if your primary business is not IT, it is also the easiest one to cut.
Fortunately, if your primary business is not IT, it is also the easiest one to cut.
FTFY
From personal experience (Score:5, Insightful)
"they punt and rebuild the server from scratch rather than dig deeper."
From personal experience this is normally due to management jumping down our throats to simply "get it done" which unfortunately runs counter to our inquisitive desires to actually solve the problem.
I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.
Re: (Score:3, Insightful)
To a small degree, you are correct. The bigger problem is that in *nix pretty much all the tools you need are available to you, but in the Windows world everything costs money. So often the solution comes down to either spend money to fix it or spend time to rebuild it. Since management thinks computers are simple push button things, "just reboot" because the go to solution.
Re: (Score:2)
I suspect it's the end result of pressure to get more bang for their bucks in a tight economy, but that's pure speculation. It really could be a trend of the times.
Both? Maybe interaction?
CC.
Re:From personal experience (Score:5, Insightful)
Re:From personal experience (Score:4, Insightful)
There is benefit because there is downtime even if only a few minutes to restore the VM. What if the software running in the VM is old and someone has been attacking it? Restoring will result in the same problem a few days or hours later.
If there is a bug in a specific kernel version that's not playing nice with the VM, it will cause stability problem again.
Redeploying and finding the problem is the only real answer. In the long run, it may save work.
Re: (Score:3, Insightful)
deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.
There can be a benefit. I generally try to get the system working first, then figure out what went wrong. And sometimes it takes a few days of poking at it to figure it out, but when a problem like that comes up again, I'm ready for it.
That's the benefit of an experienced system administrator. Anyone can just make it work again, but someone who has been doing that for a few years is going to be used to writing scripts that hunt for said issues and either correct the problem on the fly or send a notificat
Re: (Score:2)
You win a fine cigar!
Re: (Score:2)
Valid point, but it does have its merits if it's a recurring problem. A wise manager will know when to call for deeper inspection.
And to be fair - I'm fine with reimaging a system to fix a problem if it's not recurring as the downtime typically isn't worth it.
I am not on Unix (Score:2)
but know the teams that implement/admin them and I am constantly amazed.
Amazed in all that I read here and elsewhere points to incredibly resilient systems yet I have never been anywhere where they don't have scheduled down time on at minimum a quarterly basis and every major outage relied on a reload. So which is it? They make fun of the windows guys and just hope the windows guys don't look at their statistics (and no I am not on Windows either, think IBM Z and I).
My serious question is, is their a cert
Re: (Score:3)
Sounds like you have poor unix admins that are exactly the reason this mindset is prevalent. I can tell you from 15+ years as a Unix admin, the only times I have "needed" to reboot were: upgrades (OS or hardware), hardware failure, and testing of init scripts. Real, stable, properly administered systems don't need rebooting. I even think this is fair to say of Windows. The problem is, as already described: there are not many good Windows Admins.
Re:From personal experience (Score:5, Insightful)
In an environment of thousands of servers (or even dozens), deep-diving into a problem [generally] is a waste of time. While it is interesting intellectually, there is no other benefit.
Except, of course, finding what the heck was wrong in the first place and fixing it, preventing future outtages.
Sometimes, rebuilding is faster than fixing, and in some contexts, it makes sense. Even then, the original machine should still be examined and the "root cause" (if you need a management buzzwod) identified. At the very least, a reasonable amount of time should be given towards the attempt. It's true that it is pointless to dig around for days and days - but that is not a reason to not at least start looking, as it might turn out you only need a few hours. And more often than not, finding the real problem tells you something that helps you
a) fix other bugs,
b) avoid the same problem on the next server,
c) avoid a repeat performance,
d) makes you realize what you thought was a random server crash was really a break-in / hardware failure / systematic problem and other, additional steps need to be taken.
All of the above have happened before, you would by far not be the first.
A proper incident management process does allocate resources towards follow-up examination. The right thing to do is not suppress it with generic blabla about wasted time, but to set the proper amount of resources for your organisation. Maybe it's half an hour and no money, so some sysadmin can check the logs and do a quick check-up. Maybe it's a full-out forensics analysis. That depends on your needs, your resources, your environment and context.
Re:From personal experience (Score:5, Insightful)
....and his was the right answer. With XP, you're almost certainly talking about a client machine. Why bother dicking with it? It's a hundred dollar OS on a four hundred dollar piece of hardware. Wipe, reload, move on to big boy problems. Even if you're talking about a problem that ends up affecting a number of users, and it happens to be a client side problem, you're farther ahead to nuke and reload.
In my last position I was the only end user support guy for 150 to 200 people. If I sat around and fucked with every little nuance of XP and it's associated ills, I'd have ended up even farther behind than I was when I left. I wrote up a quick backup script that grabbed anything the user didn't (against company policy) store on the network drive, grabbed their local e-mail (Notes), then nuked the machine and reloaded. I could take a user who was dead in the water and have them back up and running in 15-20 minutes. If they had a lot of data to restore, maybe 35-45. Spending an hour 'troubleshooting' was a waste of company time, and my time.
Re:From personal experience (Score:5, Insightful)
Reminds me of our Windoze guy in my previous job. I had run into some kind of problem with XP at home and spent a good amount of time banging my head against it. Finally I told him what was going on and asked what he recommended. "Reinstall XP" was the answer. I thought he was joking, nope he was serious. :P
Not only was he completely serious, he probably can't understand why you might have thought he was joking.
The idea that it's a black box and you shouldn't expect to understand how or why something happened is definitely one of the more subtle costs of Microsoft systems. It lends credibility to the (false) notion, so common among average users, that you're either a completely unskilled newbie or a serious expert who can discern the inner workings of the mysterious black box. It discourages middle ground for intermediary skill levels, the kind of thing that would otherwise occur naturally as users gain experience over time.
Most of all, it's supports the falsehood that it's unreasonable to expect the most basic competence from non-experts.
Re: (Score:3)
That's the problem I am talking about, yes. You've restated it more succinctly than I did.
The notion that thare are no intermediary skill levels between "drooling noob" and "serious expert" is the false part. That's easy to explain: some Windows users are more skilled than others. If you need a concrete example, some
Comment removed (Score:3)
Blah blah blah (Score:2)
What's better is whatever keeps your employer's company making money for the most time. If re-imaging the server every weekend gives them 100% uptime during the week, do it. If you can inject patches into the app during runtime, more bully to you, but I can't, so I'm going with "re-image to working state and roll forward." If that costs my employer less than you cost your employer, I know who's all of a sudden more employable!
Might want to shave off those
Re: (Score:2)
Till the problem occurs in the middle of the work week and you still don't know what the actual problem is. Then your looking at an hour of downtime during business hours while you re-image yet again with your boss asking what the hell the problem is and what you were doing on the weekend if you weren't solving the actual problem.
Covering up a problem is not the same as solving it.
Expediency wins! (Score:2)
Seriously, which way gets the job done faster?
Being a sysadmin is not about you and the system and your marvelous detecting and repair skills, it's *always and only* about your users. If VM technology improves the speed of recovery so the users can get back to what they were doing (probably messing up your carefully architected system), then so be it.
The decline of language skills? (Score:2)
Re: (Score:3)
http://en.wikipedia.org/wiki/The_decline_and_fall_of_the_roman_empire [wikipedia.org]
Re: (Score:2)
There was a book by Will Cuppy (1894-1949) titled The Decline and Fall of Practically Everybody (1950; http://www.amazon.com/Decline-Fall-Practically-Everybody-Nonpareil/dp/0879235144 [amazon.com]) that was an absolutely funny take on history. Will Cuppy's style was to write very straightforward articles, but pepper them liberally with very funny footnotes. I remember seeing a paperback version of this as a kid, and got hooked.
Actually, the phrase "decline and fall" describes the shape of a drop-off not unlike the sha
Re: (Score:2)
I just thought it was amusing to post a headline with decline and fall both in the same sentence, when they are clearly the same thing in this instance. Should it actually have been "the rise and fall of ..." or "the decline of ..."?
No - you can "decline" and not "fall", so the headline is fine.
I can't tell you how many times I have heard this. (Score:5, Interesting)
Many times, what I hear as "solutions" are simply variations on the theme: "Why can't we reboot the server?" or "Why can't we reinstall the server from scratch?".
And my answer usually was: "Listen, I don't care how many times you do this on a Windows machine, but this is UNIX - I'll only reboot this machine if I absolutely need to. In the meantime, watch and learn as I kill the offending processes. Oh, and re-installing the machine means 24h of downtime".
These days, I help run a (very) large application, which runs on top of a (very) large "enterprise" SQL database for a (very) large company. The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it. Logical solution (and the one recommended by sysadmins): upgrade application to version X, which is supposed to have a much better database management.
What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.
Did I mention the application is considered mission-critical and runs 24x7? And that downtime can cost more than 6 figures to said (nameless) company?
And, since you asked, yes, I am looking for another job. (Clueless admins and pointy-haired bosses: a match made in...)
Re: (Score:2)
Oh, and re-installing the machine means 24h of downtime
I am with you except here. If re-installing a machine incurs 24h of downtime, you do not have a suitable contingency plan. Most environments I deal with are 15-20 minutes from offline to production on reinstall at the long end.
Re:I can't tell you how many times I have heard th (Score:4, Insightful)
Oh, and re-installing the machine means 24h of downtime
I am with you except here. If re-installing a machine incurs 24h of downtime, you do not have a suitable contingency plan. Most environments I deal with are 15-20 minutes from offline to production on reinstall at the long end.
I agree that if the system is as critical as they say, they should have a better failover in place, however in a lot of companies, very little importance is placed on Live Failover systems. More than likely he's including lots more than the OS/Application build in that 24 hour timeframe.
Probably database reload/recovery time, or file system initialization (inadequate RAID controller to Disk design?).
Re: (Score:2)
The only problem is: enterprise application does not manage database very well, and leaves zombie processes on the database server. After a while, the database server just crashes (hard) and takes down the application server with it.
Did I mention the application ... runs 24x7
So which is it, it crashes "often" enough to be a problem, or it never crashes ever?
The obvious solution is to reload it every day at the least inconvenient time.
If they will not "permit" a controlled reboot, then work around it by running health testing scripts that just happen to knock it out, sort of a euthanasia approach.
The next "solution" is a (caching?) sql proxy server in the middle, no one will notice if the reboot is fast.
Is the upgrade suggested by the admins themselves whom have tested it under
Re: (Score:2)
Logical solution (and the one recommended by sysadmins): upgrade application to version X, which is supposed to have a much better database management.
What do you think the PHB/management solution is? Ask the DBAs to write a script that will monitor zombie processes, so the sysadmins will be warned in advance... Like, around 20 minutes before the application crashes. Just enough time to tell all users to save their work, because we need to reboot everything. Just like under Windows.
The new version costs money. And, no matter how important everyone thinks this application is, they obviously don't think it's worth that price. They're willing to deal with a reboot rather than spend the money. I'd recommend the upgrade, too... But I don't write the checks. Nor do I really use the app. I just keep it running. And if you tell me you can live without the app for 10-15 minutes while the server reboots, and you'd rather save $X instead of buying the new version, that's what we're going
Re-imaging != bad administration (Score:2)
Re: (Score:2)
Sure it was cool, back in the day, to spend 72 hours working on "the server" because even rebooting was not an option. Back then I had 3 servers, 10 years later I had 15.
We've got about 30 servers to worry about, and this is a small hospital.
And downtime is basically never an option.
If I can rebuild a server and restore a data backup in 4 hours or I can spend an infinite amount of time "fixing" the existing install, which option do you think my PHB would prefer? It is not bad administration, it is just different.
Yup.
72 hours to dig out a problem on some machine that's being cranky? Yeah, that's not gonna happen. We'll restore a snapshot or provision a new VM and be back up and running within hours. Hell, even if we have to rebuild a physical box and restore from tape we can get it up and running in a day.
It will just get worse (depending on your view) (Score:2)
Obviously employers (if they wake up to this) will realize "Hey, I can pay a kid to restore snapshots" instead of "Hey... I need to hire this super expensive IT veteran."
Re:It will just get worse (depending on your view) (Score:4, Interesting)
As VM's are virtualized and taking snapshots of them becomes so easy, why would you bother troubleshooting anything when you can just restore to a snap that is an hour old?
The security exploit that cracked the old image in less than a second, will crack the "identical" new image in less than a second. Or data sample #1213 which overflowed the buffer and crashed image A will simply overload and crash image B.
What it really brings up is a class distinction in sysadmins. Theres the guy whom actually fix systems, like patching security holes in system libraries to work around app bugs, redesigning firewall ACLs to avoid a new threat, do scalability assessments before the overload crashes something, and there are the guys that fix individual things like motherboards and hard drives, not administer systems, basically help desk people with the fancy sysadmin job title. Virtualization means the helpdesk board swappers with the cool job titles are outta here, but the real sysadmins have little if anything to fear.
To be honest (Score:5, Informative)
Just because you might learn more by spending days chasing down an issue instead of using your available tools to quickly redeploy the server and get the business back up and running, doesn't make that the correct decision. If you really want dig into the root cause, clone the broken VM off and research it after you get a fresh one deployed from template.
Not surprised at all (Score:2)
It's funny how many admins out there can't even set permissions in *NIX. I was working with a guy who was very well-versed in the VM world. Several certs after his name, in fact. But when he had to actually set permissions on the .vmdk files on the ESX host from the command line, he was clueless. I explained to him the whole rwxX and how each numerical value changes the bit for that permission and it was a completely wasted effort. I guess Veeam will take care of all that from a GUI.
Still, seems like they w
Virtualization != marginalization of skills... (Score:5, Interesting)
Now - having said that - don't get me wrong. I have seen entirely too many *nix sysadmins (full disclosure: I got an RHCE in 2003) who don't know where the network config files are because they only know the GUI, and are hired by a team of people who have never logged into a *nix box. However, I think the ill that is most egregious is not that it sets some moral and ethical imperative fo fixing rather than reloading (or in this case, recovering from a VM image) a server, but the fact that it misses the point that there has been a dearth of qualified IT candidates since the dawn of our industry and that the fixes to this don't have to do with how we fix a server, but how we hire and more importantly who we hire. As is everything in IT, garbage in == garbage out.
Finally - I absolutely agree with the Infoworld argument. It assumes an unexpected failure within the server, not some external thing that needs to be diagnosed and fixed. If your app crashes because the SQL table isn't there on the SQL server you don't control, rebooting ain't going to do a hill of beans worth of good.
Re: (Score:3)
The problem is that the new "crop" of developers don't have any real problems to solve. They've all been solved, and solved well. So now we're adding unnecessary abstraction layers that hide what's really going on.
People that spent 3 days figuring out how to burn a CD back in the 90's tend to know how everything works, but the "kids" coming up in recent years only know (and only care to know) the flashy point-and-click abstraction layers, and only program within "frameworks".
Years ago I used to talk to pe
Fun at scale. (Score:2)
You have 1000 servers. You need to upgrade them to RHEL 6. Do you put a DVD in each of 1000 DVD drives?
NO!
You use an image server. Kickstart. Cobbler. Figure out how the new image looks like, and then pxeboot 1000 servers. That goes much faster. (to the sysadmin above, reimaging a server should take 25 minutes, most of which is spent surfing slashdot, not an hour).
So now, you've got a server that's misbehaving. One of 1000. Out of pure coincidence, honest, the one server you were manually fut
Re: (Score:2)
Maybe if RH bothered to ship rpm-4.6.x to RHEL5, you would only need to reboot once during an upgrade from RHEL5 to RHEL6 ...
Like you can on other distros, including other RPM-based distros.
If you used a VCS or a configuration automation tool (cfengine, puppet etc.), then you wouldn't need to re-image or re-install a server to get it's config in-line ...
Here we go again (Score:2)
Is this the old geezer versus the new wet diapers yet again? (trying to be as evil on both sides ;) )
There are new technologies and we should embrace them. I am not a proponent of VMs, I don't like them in general, but I do see its uses and it's very effective. Like in C++, you got STL, with very similar and nearly interchangeable std::vector, std::list, std::deque and so on (and not talking about boost or 3rd parties here). You need to know when to apply them or else you'll get problems. Well, in the '10s,
Time is Money (Score:2)
I used to scoff at reformatting and reinstalling, but today it's a simple calculation. Will the fix take longer than either reverting from a snapshot or cloning from a template? Many may cringe at that as a solution, but the bottom line is time is money. It used to be that reinstalling, restoring from backup simply took too long, and it was better to fix the problem at the console if possible. Today, that isn't so with automatic snapshots of virtual machines, SAN replication, etc. I don't scoff at it
Faster is nice, but... (Score:2)
Sometimes a one-off mistake happens, and reinstall makes sense. Many other times, the reason you had to reinstall is due to a more persistent problem (program/script systematically messing up or an admin that just needs to not be doing admin work), and skipping root cause analysis means you'll lose more time in the aggregate.
Re: (Score:3)
At seriously large scales, the rate of problems caused by events previously considered so improbable that you'll never likely see them in your career, become likely. TCP checksums are weak. Cosmic rays cause bit flips. Sometimes those bit flips mutate data on the way to the disk, so you never notice unless you've also checksummed the data and read it back and re-check it after writing it.
At these scales, it's fruitless to try and root cause every problem that happens, because you will hit problems like t
outsourcing (Score:2)
I think part of this phenomenon might be due to outsourcing, which puts a layer of call center personnel armed with loose-leaf binders of procedures between you and the one or two remaining competent sysadmins, who are then regulated to firefighting. In this world, there isn't time to diagnose problems because the level of expertise and admin/customer ratio are kept purposefully low.
endless cycle (Score:5, Insightful)
I'm not sure I buy everything in TFA, but have to admit to a certain extent this phenomenon is real. I've noticed, however a tendency to regenerate an instance, and when it doesn't work regen it again, and again and again because the purposely overextended and/or undertrained admin doesn't have time to figure out that the problem is in his template or due to something external like a dup ip. Come to think of it, this type of endless cycle seems to be fairly common in the Windows world. I guess we've caught up.
Sometimes the user has to diagnose the problem themselves, which is a win for the IT manager because the time didn't come out of the IT budget.
I'm hoping that at some point these practices will be recognized as the false economics they are. But I'm not holding my breath.
OK Slashdot - I get it... (Score:2)
...I'm a poor, lowly Windows admin who doesn't know my ass from a hole in the ground. ALL HAIL THE 1337 *NIX H4X0R5!
Seriously...how long is this windows admin vs *nix admin comparison going to last? I can't help it that there are apps that absolutely need to run in a Windows environment. The job needs to get done. If I could run my industry specific software on Linux, I would. I would love to save my company money from licensing.
Now if you'll excuse me, I need to go back to flinging poo all over my ser
It shouldn't be a Windows vs. Unix issue... (Score:2)
It is a cost comparison issue. When the time cost to "punt and reload" is lower than the time costs of further troubleshooting that is the correct solution to the problem. Having virtual servers makes it easier and quicker to reload a server by having a default image on stand-by so it makes less troubleshooting worth the time.
That said only a very poor admin would discard the old image without discovering the root cause of the issues in order to prevent it from happening again. Thus saving future trouble
Reboot frequency = Rebuild frequency (Score:3)
Not surprisingly, most of the reboots are there exactly for installation (aka "rebuild") of an updated OS usually on the next generation of server hardware. Major package upgrades (e.g. MySQL, Apache) almost never require any tinkering with the OS.
I compare that to typical Windows servers in my group, where reboots happen in many cases nightly as a preventative measure, and the system is still some crufty old version of Windows (e.g. Windows NT), the application packages are deeply tied to DLL's and drivers, and I suspect that the statistics and attitudes are apples vs oranges.
Production enviroments (Score:5, Insightful)
Some of the comments here remind me of a post on a woodworking board a few months back. Essentially, the poster was lamenting because he had to fire a guy because he couldn't afford to keep him... Not because of the economy, but because the guy was an absolutely inflexible perfectionist. He'd spend $300 worth of time on what should have been a $60 job... The guy was a hell of a woodworker, at home in his own shop, but just couldn't adapt to a production environment.
This isn't about Windows vs. Unix. This is about admins not understanding their job is to get production rolling again, not to satisfy their obsessive need to understand every problem or their need to satisfy their ego. ("I'm a UNIX admin dammit, I refuse to use habits that make me look like a Windows admin" or it's equivalent is a refrain modded up again and again here on Slashdot.) If a reboot or a re-imaging fixes the problem, that's the right solution. If it doesn't, *then* you dig deeper.
Re: (Score:3)
You brought up a similar point to the one I was going to make. In a production environment, down time costs money. Often times the quickest way to get an application back into production is to restore the machine to a known good state. With virtualization that is trivial to do. If the problem keeps recurring then you need to dig deeper to figure out what is going on.
Re:Gee, ya think? (Score:5, Insightful)
Re: (Score:2)
Re: (Score:3)
If you end up rebuilding the server 5 times over the course of the year, at 2 hours per pop,
W T F. If you rebuild a server more then once every 3 years...
..Then your hardware sucks and you need better equipment.
..Then your applications suck and need to quick dicking with the operating system.
..Then your admins suck and need to be fired.
While I've only been doing admin work since '95, I can say with any modern server operating system is not going to fall over and die unless there is an underlying issue th
Re: (Score:2)
Let's see. When I have a security or performance issue I can
A: Pay a bearded guy in suspenders for hours while he incants various arcane phrases like "sudo" and "grep" and hope that he actually manages to clean up the problem at the end, or
B: Press a button and have a factory fresh install in seconds.
Assuming that you have a decent build done first (Pay the bearded guy big for that) why on earth would *anyone* pick A? It's hardly just Unix- we're a Windows shop and we're heavily virtualized because it makes sense from so many different angles- security, load balance/failover, ease of setup, etc.
Well part of it too is that nobody gives a fuck if your Etsy-inspired e-store is working well. The beards are there for critical systems.
Re: (Score:2)
Re: (Score:2)
If it's something that happens more than once in a small enough time period, then of course one would immediately dig deeper. However, if it's a one-off problem or a repeated but reasonably rare issue then either restore from backup, or nuke the server and rebuild.
Most of the admins I know (myself included) will still dig on the issue afterward, even if we've have to restore or rebuild from image. But the first responsibility is to get the system back up and running, not spend hours on bug hunts.
Re: (Score:2)
Re: (Score:2)
I'm going to get flamed for this, but what the hell.
I've always thought that it is more important to get a server back up and operational as quickly as possible, then it is to keep the server down until you find the problem. Now don't get me wrong, you still need to find the ultimate problem, or at least find out if the problem is repeatable, and then find the answer to it.
So I'm in favour of any method that help me in getting the system back up and running; be it re-imaging or anything else.
Doesn't apply to anything that outputs reports to management. Any chance that you're giving them provably wrong data that dude gets shut down till fixed.
Re: (Score:2)
You better know what went wrong in the first place or it will happen again and you will (appropriately) look like an idiot.
If you work for me you better *want* to know what went wrong, even if I don't give you time and make you re-image, or I will (appropriately) think you're lazy to learn and can't be trusted to provide the best solution in all cases.
If someone comes to you with a problem or an idea, do you give them a menu of canned solutions or do you say "Tell me exactly what you want and how you want i