Icemaann writes "Pingdom and Network World are reporting that the SE tld dropped off the internet yesterday due to a bug in the script that generates the SE zone file. The SE tld has close to one million domains that all went down due to missing the trailing dot in the SE zone file. Some caching nameservers may still be returning invalid DNS responses for 24 hours."
Yeah, been there done that. *My* fumble only brought 10,000 domains down for about 10 minutes, and no one noticed. (I think all the domains hosted only cat pictures anyway.)
Sorry, that's as big a responsibility as any employer has ever deemed suitable for my incompetent ass.
The downtime lasted 30 minutes, and most domains were probably cached by nameservers anyway.
I once viddied an animated documentary about a small town in Colorado that lost the internet for 22 minutes [wikipedia.org]. It was not pretty. Our hearts and minds go out to you, people of Sweden. I cannot even fathom what that would be like... I hope the looting and rioting has died down with the restoration of the internet.
While the impact of this is no big deal, it's still kind of scary that the people running a decently-sized ccTLD would make such a novice mistake on their zonefile.
Incorrect. The zone file is hosted by Autonomica AB (who own the servers that are authoritative for the "se" domain according to the root servers).
If you were talking about a change to the NS records, you'd - I assume - be correct - Verisign operates a.root-servers.net (which I assume is the root)
The actual downtime is no big deal, but the reason it happened is. Evidently, the registrar for an entire country's domain likes to roll out changes to the primary zone file without any sort of testing or syntax checking first. Simply having a small network (one or two computers) running a test root server, and running your scripts against that first, would have discovered the bug.
DNS is very simple, but it's just as prone to human error as anything else. If you're responsible for the records of a large number of domains (like, say, an entire country), you probably ought to take some time to develop proper testing and change control procedures before you fiddle with it. It sounds like these guys didn't take it seriously enough and got burned. I hope they'll learn their lesson from this and change their procedures.
I wish browsers would store the IP address of the page as well as the domain name in bookmarks. That way if the DNS server goes down you could still get to the site. Of course, the primary lookup should still be the domain name, since a site can have its address changed; the browser would only look at the IP if the DNS lookup failed.
The Internet was started as, and always has been, a "best effort" network. If a packet gets through, great. If not, well, it's not the end of the world. People have tried to code more and more resilient protocols on top to be as robust as possible, but in the end it's a very fragile system that can go down quite easily.
Anything sufficiently "high stakes" shouldn't rely on an unreliable medium.
Huh... that's interesting. I've never heard of that one before... I think, though, that based on your recommendation I'll share the link with the rest of the office. I've seen a lot of your posts here in Slashdot, Anonymous Coward, and all the ones I've seen have been pretty highly rated, so I'm guessing you wouldn't link me to a website that wasn't interesting.
I seriously hope someone is fired or loses a contract over this. Where was the validation, change control, etc? I would expect that at the TLD level, a change to a configuration file would have to be inspected by someone AND run through some syntax-checking scripts...
As for the person who was modded up for saying "hey, no big deal, fixed in 30 minutes!", not quite. DNS servers (and individual computers!) cache negative results. Anything anyone did a query on during those 30 minutes will be negatively cached by their system and their local DNS server. Granted, a whole lot of local Swedish ISPs and network providers have probably flushed their DNS server caches, but it's still going to seriously impact traffic to many, many sites, especially for everyone outside Sweden.
It really isn't a big deal. The mistake was made, the world has the opportunity to learn from it and the economic impact was probably small but scalable enough to take seriously.
Now if it happened again I'd hope action were taken... don't be so vengeful, SuperBanana!
I'll go one better and say we should try him in a military tribunal and sentenced to hard time in ADX. That will send the world a message - NO MISTAKES OR ELSE.
Get real man, this is a human error. Your struggle for perfection baffles my monkey brain.
I seriously hope someone is fired or loses a contract over this.
You'll be happy to know that the person responsible have been found. The person in question was described as having unusual bushy eyebrows and speaking in a thick Swedish accent. His last comment about the incident, before being dragged away, was "bork bork bork".
As a DNS admin myself, touching high value zones, let me tell you, missing a stupid dot happens all the time. All the change control in the world doesn't help when you just don't type one little period. Even more helpfully, most tools won't notice and the zone will pass a configuration check because missing the trailing "." is syntactically correct.
Let me add as well that "change management" that you want is just fantastic.. no making changes during core hours. When you run a 24/7 business, non-core hours means something like 2am. at 2am, I, and most mammals, are not at their mental best, so missing a single dot isn't horribly hard.
The only thing I'd suggest they do is use an offline test box for zones, then promote that change to prod. Then, you can load all the mistakes you want, do your digs, and if stuff works, THEN you move it to prod. I never ever make changes on production servers, they are done offline, tested, then put into prod with scripts. It makes it a lot harder for missing periods to make it into production.
Finally, this is a good reason why negative caching should have low TTLs. If you run a DNS server that can't handle low neg-caching TTLs, it's time to upgrade from a 386.
Excessive paperwork like 30 min to fill out a change request form to do something like make a 30 second edit to a config file and sighup a daemon is stupid and you'll hear no argument from me on that. Change control per se however, is essential, particularly in a large enterprise. Running part of that kind of infrastructure without change control would be like trying to manage the kernel source tree without cvs (or svn or $REPOS_OF_CHOICE, analogy holds either way.)
The problem is not change control, its the way it is implemented. Change control methodology is designed by PHBs who haven't actually done the tech work in years, if they ever did. It's then scribbled all over by a "business analyst" who thinks a sigpipe is a plumbing problem and by the time guys actually doing the work get hold of it it has become a nightmare of procedural BS when all you really needed was a way to make sure everything you do to a live production system is documented and that anything other than emergency break-fix at least got basic testing and a second pair of eyes looking at it before rolling it out.
It still boggles my mind that anyone thought zone files are a good idea. The file format is so damn brittle, that a single byte can spell disaster. On top of that, the hierarchical naming structure presents an inherent systemic risk for all sub-domains as exhibited by this.se fiasco. Nevermind the injection attacks, Pakistan taking out Youtube, and the rest, you have organizations like Verisign which profit immensely off of keeping the system broken. And don't even bother mentioning DNSSEC, as it still doesn't resolve this fundamental issue. The next systemic fuckup will simply be a signed fuckup.
Except the Pakistan affair was about the BGP routing protocol. I agree the file format is nutty, though.
I can't think of a better alternative to the hierarchical system, perhaps you have a suggestion. A flat namespace would be an administrative impossiblity, not to mention the stress it would put on name servers. Increasing the number of TLDs would lessen the impact of a single failure, though.
Pakistan taking out Youtube had absolutely nothing to do with DNS, they wrongly propagated a BGP announcement for the youtube IPs outside of Pakistan, so about 1/3 of the internet routed traffic into their black hole instead of to Youtube. Pretty effective blocking had they kept it internal, but they didn't.
Well in the 1980's when the RFC was written for zone files (1034/1035) it probably sounded like a perfectly sound way to configure this sort of thing, same with DNS in general (RFC's for which were also written in the 1980's).
If it were invented from scratch today I'm sure it would resemble something like LDAP.
The fact we haven't had more mass DNS failures like this is actually surprising.
Part of the problem with DNS these days, which your post exemplifies, is that from very early on "BIND's implementation of DNS", and "DNS The Protocol" have been mashed together and confused by the RFC authors (who were involved with the BIND implementation and had motive to encourage the world to think only in BIND terms) and basically everyone who ever used DNS in any capacity. Zonefiles are not implicit in DNS address resolution (neither for authoritative servers or recursive caches). They really aren't any part of the wire DNS protocol for resolving names. They *are* part of a wire protocol for secondary servers that slave zonefiles from primary servers, but even in that case it's really more a "BIND convention" than a necessity. Ultimately how you transfer a zone's records from a master server to a slave server is up to however those two servers and their administrators agree to do so. You can skip the AXFR protocol that uses zonefiles and instead do something else that works for both of you. Inventing a new method of slaving zone data is easy and doesn't involved much complicated rollout. Some people just rsync zonefiles for instance instead of using AXFR today.
It's really frustrating (believe me, I've done it) when you try to implement a new DNS server daemon from scratch from the RFCs, and you have to wade through this mess of "what's a BIND convention that doesn't matter and what's important to the actual DNS protocol for resolving names on the wire".
It gets worse. In 2007, Paul Vixie wrote an article in ACM Queue [acm.org] basically praising the vagueness of the DNS protocol specifications:
From this overview, it is possible to conclude that DNS is a poorly specified protocol, but that would be unfair and untrue. DNS was specified loosely, on purpose. This protocol design is a fine example of what M.A. Padlipsky meant by “descriptive rather than prescriptive” in his 1984 thriller, The Elements of Networking Style (Prentice Hall). Functional interoperability and ease of implementation were the goals of the DNS protocol specification, and from the relative ease with which DNS has grown from its petri dish into a world-devouring monster, it’s clear to me that those goals were met. A stronger document set would have eliminated some of the “gotchas” that DNS implementers face, but the essential and intentional looseness of the specification has to be seen as a strength rather than a weakness.
The file format is so damn brittle, that a single byte can spell disaster.
You know what, so is ELF. Who said you should write zonefiles by hand let alone without any kind of syntax verification.
Input syntax is never really an issue. You only ever lack the necessary tools or you are unable to use them properly. It can always be hidden behind a precompiler or whatever necessary.
This is why MaraDNS [maradns.org] (my open-source DNS server) uses a special zone file format.
MaraDNS uses a zone file format that, for the most part, resembles BIND zone files. However, the zone file format has some minor differences so the common "Forgot to put a dot at the end of a hostname" and the "forgot to update the SOA serial number" problems do not happen; a domain name without a dot at the end in a syntax error in MaraDNS' zone file parser; if you want to end a hostname with the name of the zone in questio
Uh, it would make no difference. DNS is hierarchical, and has teh caching.
2 independent groups running DNS would strive to make sure they sync with each other quickly - thus all failures would sync quickly too.
The difference between - the delay of a correct change propagating across the two firms running DNS
- the delay of an incorrect change propagating within a single DNS
would essentially be zero.
No good things could come from what you propose unless it was specifically designed to have a 24
It looks like someone messed up the summary. I'm pretty sure it should be:
Peengdum und Netvurk Vurld ere-a repurteeng thet zee SE tld drupped ooffff zee internet yesterdey dooe-a tu a boog in zee screept thet generetes zee SE zune-a feele-a. Zee SE tld hes cluse-a tu oone-a meelliun dumeeens thet ell vent doon dooe-a tu meessing zee treeeling dut in zee SE zune-a feele-a. Sume-a cecheeng nemeserfers mey steell be-a retoorneeng infeleed DNS respunses fur 24 huoors.
No big deal (Score:2, Informative)
The downtime lasted 30 minutes, and most domains were probably cached by nameservers anyway.
Re:No big deal (Score:4, Informative)
Yeah, been there done that. *My* fumble only brought 10,000 domains down for about 10 minutes, and no one noticed. (I think all the domains hosted only cat pictures anyway.)
Sorry, that's as big a responsibility as any employer has ever deemed suitable for my incompetent ass.
Parent
Re: (Score:3, Interesting)
My biggest bug resulted in about a dozen tigers getting tranquilized.
Re:No big deal (Score:4, Funny)
Are you my motherboard?
Parent
Re:No big deal (Score:5, Funny)
The downtime lasted 30 minutes, and most domains were probably cached by nameservers anyway.
I once viddied an animated documentary about a small town in Colorado that lost the internet for 22 minutes [wikipedia.org]. It was not pretty. Our hearts and minds go out to you, people of Sweden. I cannot even fathom what that would be like ... I hope the looting and rioting has died down with the restoration of the internet.
Parent
Re: (Score:2)
For the pool souls in the .se domain, it was the end of the universe.
Re: (Score:3, Insightful)
While the impact of this is no big deal, it's still kind of scary that the people running a decently-sized ccTLD would make such a novice mistake on their zonefile.
Re: (Score:3, Insightful)
You expect them to be absolutely perfect all the time no matter what, forever and ever? /That's/ unrealistic.
Re: (Score:3, Informative)
Incorrect. The zone file is hosted by Autonomica AB (who own the servers that are authoritative for the "se" domain according to the root servers).
If you were talking about a change to the NS records, you'd - I assume - be correct - Verisign operates a.root-servers.net (which I assume is the root)
Re:No big deal (Score:5, Funny)
The downtime lasted 30 minutes, and most domains were probably cached by nameservers anyway.
I didn't notice the DNS freak out, but I did notice the internet's smug meter had dropped about 30%.
Parent
Re:No big deal (Score:5, Funny)
but I did notice the internet's smug meter had dropped about 30%.
Norwegian detected.
Parent
Re:No big deal (Score:5, Insightful)
DNS is very simple, but it's just as prone to human error as anything else. If you're responsible for the records of a large number of domains (like, say, an entire country), you probably ought to take some time to develop proper testing and change control procedures before you fiddle with it. It sounds like these guys didn't take it seriously enough and got burned. I hope they'll learn their lesson from this and change their procedures.
Parent
Re:No big deal (Score:5, Funny)
DNS is very simple, but it's just as prone to human error as anything else.
Are you kidding? I've been programming DNS for a long time, and if theirs one thing I learned, its that programmers like me don't make errors.
Parent
Re: (Score:3, Insightful)
I wish browsers would store the IP address of the page as well as the domain name in bookmarks. That way if the DNS server goes down you could still get to the site. Of course, the primary lookup should still be the domain name, since a site can have its address changed; the browser would only look at the IP if the DNS lookup failed.
Re:unless you are swedish (Score:4, Insightful)
its "no big deal" until you need to know something off the internet right now, high stakes
I need to know what a fourteen year old thinks about copyright law and I need to know it NOW [smbc-comics.com] !
Parent
Re:unless you are swedish (Score:4, Insightful)
Anything sufficiently "high stakes" shouldn't rely on an unreliable medium.
Parent
Re: (Score:3, Funny)
If a packet gets through, great. If not, well, it's not the end of the world.
Sounds like a lot of cities' approaches to freeway systems/traffic control.
Re: (Score:3, Funny)
Cache your porn, folks. Just sayin'.
There goes my favorite web site ! (Score:3, Funny)
Goat.se
Re: (Score:3, Funny)
Goat.se
Huh... that's interesting. I've never heard of that one before... I think, though, that based on your recommendation I'll share the link with the rest of the office. I've seen a lot of your posts here in Slashdot, Anonymous Coward, and all the ones I've seen have been pretty highly rated, so I'm guessing you wouldn't link me to a website that wasn't interesting.
Re: (Score:3, Funny)
Goat.se
Arrgh... the horror... http://goat.se/cx [goat.se] You'll want to claw your eyes out!
change control / management, anyone? (Score:5, Insightful)
I seriously hope someone is fired or loses a contract over this. Where was the validation, change control, etc? I would expect that at the TLD level, a change to a configuration file would have to be inspected by someone AND run through some syntax-checking scripts...
As for the person who was modded up for saying "hey, no big deal, fixed in 30 minutes!", not quite. DNS servers (and individual computers!) cache negative results. Anything anyone did a query on during those 30 minutes will be negatively cached by their system and their local DNS server. Granted, a whole lot of local Swedish ISPs and network providers have probably flushed their DNS server caches, but it's still going to seriously impact traffic to many, many sites, especially for everyone outside Sweden.
Re: (Score:2)
It really isn't a big deal. The mistake was made, the world has the opportunity to learn from it and the economic impact was probably small but scalable enough to take seriously.
Now if it happened again I'd hope action were taken... don't be so vengeful, SuperBanana!
Re: (Score:3, Insightful)
I'll go one better and say we should try him in a military tribunal and sentenced to hard time in ADX. That will send the world a message - NO MISTAKES OR ELSE.
Get real man, this is a human error. Your struggle for perfection baffles my monkey brain.
Re:change control / management, anyone? (Score:4, Funny)
I seriously hope someone is fired or loses a contract over this.
You'll be happy to know that the person responsible have been found. The person in question was described as having unusual bushy eyebrows and speaking in a thick Swedish accent. His last comment about the incident, before being dragged away, was "bork bork bork".
Parent
Re:change control / management, anyone? (Score:5, Insightful)
As a DNS admin myself, touching high value zones, let me tell you, missing a stupid dot happens all the time. All the change control in the world doesn't help when you just don't type one little period. Even more helpfully, most tools won't notice and the zone will pass a configuration check because missing the trailing "." is syntactically correct.
Let me add as well that "change management" that you want is just fantastic .. no making changes during core hours. When you run a 24/7 business, non-core hours means something like 2am. at 2am, I, and most mammals, are not at their mental best, so missing a single dot isn't horribly hard.
The only thing I'd suggest they do is use an offline test box for zones, then promote that change to prod. Then, you can load all the mistakes you want, do your digs, and if stuff works, THEN you move it to prod. I never ever make changes on production servers, they are done offline, tested, then put into prod with scripts. It makes it a lot harder for missing periods to make it into production.
Finally, this is a good reason why negative caching should have low TTLs. If you run a DNS server that can't handle low neg-caching TTLs, it's time to upgrade from a 386.
Cheers.
Parent
Re: (Score:3, Insightful)
Not if the configuration check you wrote checks for the trailing "." anyways. And if it doesn't, you need to rewrite it.
Re:change control / management, anyone? (Score:4, Funny)
Sweden porn?
IKEA instruction manuals?
Parent
Re: (Score:2)
Don't you mean "I wrong code" in this context?
Re:change control / management, anyone? (Score:4, Insightful)
Excessive paperwork like 30 min to fill out a change request form to do something like make a 30 second edit to a config file and sighup a daemon is stupid and you'll hear no argument from me on that. Change control per se however, is essential, particularly in a large enterprise. Running part of that kind of infrastructure without change control would be like trying to manage the kernel source tree without cvs (or svn or $REPOS_OF_CHOICE, analogy holds either way.)
The problem is not change control, its the way it is implemented. Change control methodology is designed by PHBs who haven't actually done the tech work in years, if they ever did. It's then scribbled all over by a "business analyst" who thinks a sigpipe is a plumbing problem and by the time guys actually doing the work get hold of it it has become a nightmare of procedural BS when all you really needed was a way to make sure everything you do to a live production system is documented and that anything other than emergency break-fix at least got basic testing and a second pair of eyes looking at it before rolling it out.
Parent
So I guess it's... (Score:5, Funny)
Re: (Score:2)
I'm chopping up the zone files if that's ok with you (tosses random shyte over shoulder)
We'll scoop up all the trailing dots and put them in the stew
BORKBORKBORK!
Ah, the joy of automated oopsies. (Score:2)
somewhere in sweden: (Score:3, Funny)
DNS is the problem (Score:5, Interesting)
Re: (Score:2)
And your robust solution to a scalable global directory of name-to-ip address mapping is... ?
Re:DNS is the problem (Score:5, Funny)
Regedit32.exe
Parent
Re:DNS is the problem (Score:4, Insightful)
Except the Pakistan affair was about the BGP routing protocol. I agree the file format is nutty, though.
I can't think of a better alternative to the hierarchical system, perhaps you have a suggestion. A flat namespace would be an administrative impossiblity, not to mention the stress it would put on name servers. Increasing the number of TLDs would lessen the impact of a single failure, though.
Parent
Re: (Score:3, Insightful)
Pakistan taking out Youtube had absolutely nothing to do with DNS, they wrongly propagated a BGP announcement for the youtube IPs outside of Pakistan, so about 1/3 of the internet routed traffic into their black hole instead of to Youtube. Pretty effective blocking had they kept it internal, but they didn't.
Re: (Score:3, Informative)
Well in the 1980's when the RFC was written for zone files (1034/1035) it probably sounded like a perfectly sound way to configure this sort of thing, same with DNS in general (RFC's for which were also written in the 1980's).
If it were invented from scratch today I'm sure it would resemble something like LDAP.
The fact we haven't had more mass DNS failures like this is actually surprising.
Re:DNS is the problem (Score:5, Informative)
Part of the problem with DNS these days, which your post exemplifies, is that from very early on "BIND's implementation of DNS", and "DNS The Protocol" have been mashed together and confused by the RFC authors (who were involved with the BIND implementation and had motive to encourage the world to think only in BIND terms) and basically everyone who ever used DNS in any capacity. Zonefiles are not implicit in DNS address resolution (neither for authoritative servers or recursive caches). They really aren't any part of the wire DNS protocol for resolving names. They *are* part of a wire protocol for secondary servers that slave zonefiles from primary servers, but even in that case it's really more a "BIND convention" than a necessity. Ultimately how you transfer a zone's records from a master server to a slave server is up to however those two servers and their administrators agree to do so. You can skip the AXFR protocol that uses zonefiles and instead do something else that works for both of you. Inventing a new method of slaving zone data is easy and doesn't involved much complicated rollout. Some people just rsync zonefiles for instance instead of using AXFR today.
It's really frustrating (believe me, I've done it) when you try to implement a new DNS server daemon from scratch from the RFCs, and you have to wade through this mess of "what's a BIND convention that doesn't matter and what's important to the actual DNS protocol for resolving names on the wire".
Parent
Re: (Score:3, Interesting)
From this overview, it is possible to conclude that DNS is a poorly specified protocol, but that would be unfair and untrue. DNS was specified loosely, on purpose. This protocol design is a fine example of what M.A. Padlipsky meant by “descriptive rather than prescriptive” in his 1984 thriller, The Elements of Networking Style (Prentice Hall). Functional interoperability and ease of implementation were the goals of the DNS protocol specification, and from the relative ease with which DNS has grown from its petri dish into a world-devouring monster, it’s clear to me that those goals were met. A stronger document set would have eliminated some of the “gotchas” that DNS implementers face, but the essential and intentional looseness of the specification has to be seen as a strength rather than a weakness.
Re: (Score:3, Interesting)
The file format is so damn brittle, that a single byte can spell disaster.
You know what, so is ELF. Who said you should write zonefiles by hand let alone without any kind of syntax verification.
Input syntax is never really an issue. You only ever lack the necessary tools or you are unable to use them properly. It can always be hidden behind a precompiler or whatever necessary.
Hmmm... wait, termcap. I stand corrected.
Why MaraDNS uses a special zone file format (Score:2, Interesting)
This is why MaraDNS [maradns.org] (my open-source DNS server) uses a special zone file format.
MaraDNS uses a zone file format that, for the most part, resembles BIND zone files. However, the zone file format has some minor differences so the common "Forgot to put a dot at the end of a hostname" and the "forgot to update the SOA serial number" problems do not happen; a domain name without a dot at the end in a syntax error in MaraDNS' zone file parser; if you want to end a hostname with the name of the zone in questio
Re: (Score:3, Insightful)
Can MaraDNS handle IPv6 now? Last time I used it I had to ditch it in end as IPv6 support was lacking.
There's møre to Sweden than .se (Score:5, Funny)
Wi nøt trei a høliday in Sweden this yer?
See the løveli lakes
The wonderful telephøne system
And mani interesting furry animals
Re:There's møre to Sweden than .se (Score:5, Funny)
We apologise for the fault in the previous post. Those responsible have been sacked.
Parent
Re: (Score:3, Interesting)
Uh, it would make no difference.
DNS is hierarchical, and has teh caching.
2 independent groups running DNS would strive to make sure they sync with each other quickly - thus all failures would sync quickly too.
The difference between
- the delay of a correct change propagating across the two firms running DNS
- the delay of an incorrect change propagating within a single DNS
would essentially be zero.
No good things could come from what you propose unless it was specifically designed to have a 24
Re: (Score:3, Funny)
It looks like someone messed up the summary. I'm pretty sure it should be:
Peengdum und Netvurk Vurld ere-a repurteeng thet zee SE tld drupped ooffff zee internet yesterdey dooe-a tu a boog in zee screept thet generetes zee SE zune-a feele-a. Zee SE tld hes cluse-a tu oone-a meelliun dumeeens thet ell vent doon dooe-a tu meessing zee treeeling dut in zee SE zune-a feele-a. Sume-a cecheeng nemeserfers mey steell be-a retoorneeng infeleed DNS respunses fur 24 huoors.
Re: (Score:2)