Cisco Routers to Blame for Japan Net Outtage 78
An anonymous reader passed us a link to a Network World article filling in the details behind the massive internet outage Japanese web users experienced earlier this week. According to the site faulty Cisco routers were to blame for the lapse, which left millions of customers without service from late evening Tuesday until early in the morning on Wednesday. "NTT East and NTT West, both group companies of Japanese telecom giant Nippon Telegraph and Telephone (NTT), are in the process of finalizing their decisions on a core router upgrade, according to the report. The routing table rewrite overflowed the routing tables and caused the routers' forwarding process to fail, the CIBC report states."
Djikstra (Score:5, Funny)
Eggs in one basket (Score:4, Insightful)
Re:Eggs in one basket (Score:5, Insightful)
I don't agree the blame is with Cisco, not until I see more evidence. Cisco has some of the most stable operating systems. The cmd line interface can sometimes suck, but their stability is very remarkable. The fault I am guessing is with the ISP for not planning network redundancy and not scaling their networks in time. Cisco might look bad in this article, but their track record in creating an OS with less number of bugs is much better than Microsoft, Sun and the others.
Re: (Score:2)
Microsoft sure, but Solaris is pretty reliable.
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
Re: (Score:2)
Effectively, it's a two stage lookup - BGP will tell you that your gran
Re: (Score:2)
This is a terrible analogy. It isn't a two stage lookup- it's a single routing table lookup. BGP populates the routing table with routes it learns from external autonomous systems, and an interior routing protocol like OSPF populates the routing table with routes learned from within the autonomous system itself. Where both protocols know of the same route th
Re: (Score:2)
As I interprete TFA, it was BGP problem - possibly a failover
situation not handled correctly. Or they (NTT) did some
seriously weird thing with their BGP design.
Re: (Score:2)
Having worked at Cisco, I strongly disagree (Score:3, Interesting)
Having worked at many of the companies which supply OS's, Cisco is, IMHO, the worst. They go for lots of cheap talent. The common theme is to hire lots of low paid talent rather than focusing on getting the best and the brightest. And it shows.
Re: (Score:3, Insightful)
Cisco has some of the most stable operating systems.
You must be using some Cisco OS I don't know about. I am in the process of upgrading 120 Cisco boxes thanks to that "stable operating system."
Junipers are a different matter. MUCH more stable.
Cisco might look bad in this article, but their track record in creating an OS with less number of bugs is much better than Microsoft, Sun and the others.
Riiiiight. Apparently you have never had to deal with Cisco's inability to produce an IOS which doesn't have a BGP bug in it. Or MPLS bug. Or... Well, the list is long.
Re: (Score:1)
Ah, yeah. I am not sure, but it seems you never really worked with Cisco gear in an serviceprovider world...
I just have one word for you: CEF-Bug
Re: (Score:2)
I think you need more time with Cisc
Re: (Score:2, Insightful)
PEBCAK
Re: (Score:3, Funny)
Apparently... (Score:4, Funny)
JunOS (Score:5, Funny)
"A Juniper router is like my girlfriend.. It will never go down on me."
Re: (Score:2)
Yeesh - sorry man!
Re: (Score:2)
Juniper M7i with 1 onboard gigabit port and a quad gigabit card (oversubscribed 4:1) and 16Mpps forwarding speed- $52k.
Cisco 7604 with Sup32 and 9 gigabit ports and 15Mpps forwarding speed- $18k.
Juniper definitely makes a better router in many cases- but does it justify paying three times as much?
I love both systems- and I would run Juniper everywhere if I could (for no other reason than the single JunOS software image) but they are just price prohbitive sometimes.
And
girlfriend? (Score:1)
CEF and the routers. (Score:5, Informative)
Usually what happens is that the router doesn't have enough memory to store all the CEF (Cisco Express Forwarding) info, causing the router to not forward packets for certain subnets. I've seen it happen often enough to know. While Cisco is right, the problem is caused by a lack of memory for the config, I think it shouldn't stop forwarding the packets all together (as in stop using CEF if the table gets out of hand).
While I think Cisco is not completely to blame (badly scaled networks, not upgrading routers in time), it sucks that this will hit them. There are better solutions out there, but I have to say that Cisco's support is quite good and they're pretty fast. I work in an all-Cisco environment (for the routers) and they've been fast whenever we needed a router analyzed.
Re: (Score:1, Interesting)
I have a Cisco with a complex config with tunnels and there really is no way it will work reliably with CEF enabled.
Re: (Score:1)
Re: (Score:1)
Underspec routers (Score:5, Interesting)
Re: (Score:2)
Who's middle management? Cisco's?
Their routers have been perpetually running out of memory for reasonable routing tables since at least 1992.
Properly Filtering Prefixes (Score:5, Informative)
Ok.. That says to me that their routing tables got really big, the routers ran out of memory... Or.. they Had a prefix limit set, and it kept dropping the BGP session(s)...
If either of the above is true, properly designed filtering of the prefixes they send/receive to their BGP neighbors would have resolved this outage... It sounds like someone may have been incompetent, and they are trying to pawn off the "ownership" of this outage on Cisco.
Either that, or its a major IOS bug, and the article's author just sucks and didn't mention that..
Should have used Junipers (Score:3, Insightful)
To be fair Cisco is untouchable in the enterprise class with their CPE's..
Re:Should have used Junipers (Score:5, Interesting)
Re: (Score:1, Insightful)
one train of code to follow and so on.
Juniper has different code loads for M/T series vs. J-series vs. E-series. The nice thing is that there's only 1 load per series.
Unlike Cisco 6500/7600 Sup720-3B/3BXL with hardware limitation of 256K and 512K IPv4 routes respectively
The Cisco 6500 is a layer3 switch, not a router. The 7600 was designed by Cisco to be an edge aggregation router, not
Re: (Score:1)
The NW story is too vague to rule out human error (Score:5, Informative)
Re:The NW story is too vague to rule out human err (Score:1, Informative)
*IF* (and I don't know for sure that this is the case, so it's a big IF) the project this person has been working on relates to the problem in this story, then I would say that your guess of 'human error' is likely a very large part of what happened.
I say this because my friend has filled me in on some of the stories relatin
nice work blaming cisco (Score:4, Insightful)
Also.. (Score:4, Insightful)
TCAM exhaustion (Score:5, Informative)
If NTT had been following Cisco mailing lists, or keeping up to date on what their salesmen had been telling them for several years, they would have seen this problem looming and changed their routing structure or at least upgraded the processors for something with slightly more TCAM. The size of the internet is not going to stop growing because many companies chose to go with underpowered Cisco kit. The internet will continue to grow by 12,000 to 17,000 routes per month, accelerating over the next few years as IPv4 space becomes exhausted and de-aggregation becomes the norm.
This is one of my long standing grudges about Cisco design. They always are designing their core routers to be just slightly ahead of the size of the internet, forcing people to upgrade within a few years. Designed obsolescence is the term. Even their new CRS1 platform will fail over to CPU near 512,000 routes (0x80000), or sometime around the end of 2008 to mid 2009. By then, they'll probably have an expensive upgrade path for customers that will hold for just another year or two.
It's not just Cisco kit that is going to have problems over the next few months. By the end of June the internet will be at 256,000 routes (really 262,144 or 0x40000), which will be a problem for some other manufacturers. Some are starting to fail at 0x3C000 (245,000) routes, some already failed at 0x30000 last year.
On the plus side, the OpenBGPd crowd doesn't suffer from this, since their code is all CPU switched (but using very clever and efficiently coded routing tables) so their routing table is limited only by memory. But an OpenBGPd machine will never have the raw efficiency of a VLSI based hardware solution.
A quick look at my local looking glass shows 233,979 routes on the internet this morning.
the AC
Re: (Score:1)
Re:TCAM exhaustion (Score:5, Informative)
It appears to be four separate instances of 512K routes, the total is for MPLS customers shoving full BGP tables into their mesh. With more than 8 MPLS customers doing screwy things today, the box starts hitting its CPUs. I haven't received a denial from the CRS-1 guys, just some hand waving and a promise to look into it. Implications that a better config would help hasn't actually produced an example of what to do, and the XR code is just different enough to hide underlying architecture deficiencies. The other problem is that every CRS-1 seems to be put into production before engineering has time to play with them and learn their tricks. Given time, all kinds of clever designs for XR code will spread around, just as there are tricks of the trade the most experienced IOS-based engineers grok.
It should be enough to 2015 or more.
And 640k should be enough for everyone. Seriously, I keep running across 2500s still doing their thing, but not as core BGP routers. So the CRS-1 platforms may quite well be running tucked into edges in 2015. Bean counters love kit that has amortised many times over.
the AC
Re: (Score:2, Informative)
Regarding routing table growth, hopefully IPv6 might stifle that a bit as we're going to be running out of IPv4 space in the next 3-5 years and IPv6 space is allocated in much larger blocks requiring fewer routes.
Re:TCAM exhaustion (Score:4, Interesting)
Not one of MY designs, but you are right about the mistake part. I know of a carrier with CRS-1s struggling with a poor design coupled with an out of control sales force that will not ever say "NO!" to a customer doing bad things to their MPLS service. That's the origin of the idea of a maximum of four instances of 512K routes in 4 separate TCAMs per chassis (or per line card, or per virtual machine, or something). Not really my job any more, so I learn this over beers next to the data centre and extend my sympathies to those stuck in the Cisco world.
hopefully IPv6 might stifle that a bit
Well, the IPv6 table is ~850 routes right now, growing by 10 to 20 new routes per month. Just like the early days of the internet as BGP rolled out. Now I can toss out the obligatory "You kids get off my LAN".
Problems are already starting to be seen by the RIRs, where speculative companies have started grabbing IPv4 allocations with no intention of using them, betting on a market for buying and selling prefixes and forcing the RIRs out of business. Exactly what happened to the DNS market when it became apparent that second level domains could be rented for yearly fees for a large profit.
If companies start buying and selling prefixes in an unregulated free market frenzy, aggregation will become a fond memory and expect every router to need several Gigabytes to hold the 2 million+ routes on the old IPv4 internet. At RIPE meetings, there is a hope that this is a worst case scenario, but it seems to be a business plan for some less altruistic people at ICANN.
the AC
Re: (Score:3, Insightful)
Seriously though - would you try run a datacentre on a home router from NetGear? If I did and the network fell over in a fiery mass of routing tables I wouldn't say NetGear was to blame for building a bad router. I'd blame the network architect who thought they could shove hundreds of servers through a 5-port-with-wifi device.
Exactly (Score:2)
The problem was that the internet had grown beyond the capacity of their core routers, hence the core router upgrade that was "in progress". The headline should actually read:
OLD Cisco Routers to Blame for Japan Net Outage (with only one 't' in "Outage', just as in TFA!)
Hey folks, don't stop CIDR'ing routes just because there seems to be enough routing table space "right now"!
Re: (Score:1)
Failure to Invest to Blame for Japan Net Outage
Or, more to the point:
Management Idiocy To Blame for Japan Net Outage
Evidently, Japan has it's share of PHBs too.
Re: (Score:3, Funny)
It sounds like what we need is legislation to enforce some hard limits on the growth of Internet routing tables in order to avoid these kinds of DoS attacks in the future. If we lobby Congress now we can hopefully a
Re: (Score:3, Insightful)
250k is lame, I just tested 1.1m on a Juniper (Score:2, Informative)
This is no shock: Juniper's first innovation was the use of high speed ram rather than tcam for tcam lookup so route table scaling has never been a problem for them...
Marketing on the other hand... geesh those cartoons still give me nightmares.
Re: (Score:3, Informative)
As others already noted the 6500/7600 is a switch with limited
routing capabilities. you use it as a core router at your own
risk (and peril).
Re: (Score:1, Interesting)
Re: (Score:2)
If you have a design where a 6500 or 7600 isn't doing core routing, somewhere out on the edge, just buy the chassis and line cards from Cisco, and pick up one of the TCAM-poor routing engines for less than 5% of GPL.
Juniper has
Re: (Score:2)
The thing is, 4GB would be enough to handle routing even if each single IP address was assigned randomly (that is, if everyone was allocated nothing but /32s and it was done such that no aggregation at all was possable). 4GB is not THAT huge these days. There's not really a great reason (other than planned obsolescence) not to be fully future proof.
mmm... yeah (Score:2)
Zawnk, eye neede yore hellpe... (Score:1)
operators to blame for japan net outtage .. (Score:2)
"At this time, Cisco and NTT have not determined the specific cause of the problem"
why all full routes ? (Score:1)
Don't blame Cisco... blame the Monitoring (Score:5, Insightful)
1. Local status (Am I alive)
2. Path (can I get from me to you, what is the quality of the path?)
3. End point (are you there?)
If at any time you let the number of paths and interconnects overwhelms you. Get a new job. You've lost control. Draw pictures of the network. When you have an outage start looking immediately at what you have connectivity with and what you don't. Large data centers can get complex in their interconnects. Divide it up into "blocks" verify a block and move on.
The biggest problem in a situation like this is that I'm willing to bet the techs were wasting their time trying to figure out why the network went down. Who cares why. You need to quickly assess what is down. What you can do. What you can't do. You need to know what is normal and what is not. If you don't a situation like this can happen.
The worst thing that can happen is if the network is divided into "territories." Usually in a case like this people spend more time trying to blame the other guy then they do finding the cause of the problem. Finally design. Somewhere along the line some pencil pusher decided that a single point of failure was economically feasible. The techs were willing to sheep right along, the Sr Admin was played politics and didn't rock the boat.
In the end. The techs blew it. The after action report and follow up will tell the final tale.
Re: (Score:1)