Forgot your password?
typodupeerror
Intel Networking Bug

Intel Gigabit NIC Packet of Death 137

Posted by Soulskill
from the how-to-break-things dept.
An anonymous reader sends this quote from a blog post about a very odd technical issue and some clever debugging: "Packets of death. I started calling them that because that’s exactly what they are. ... This customer location, for some reason or another, could predictably bring down the ethernet controller with voice traffic on their network. Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller. The system and ethernet interfaces would appear fine and then after a random amount of traffic the interface would report a hardware error (lost communication with PHY) and lose link. Literally the link lights on the switch and interface would go out. It was dead. Nothing but a power cycle would bring it back. ... While debugging with this very patient reseller I started stopping the packet captures as soon as the interface dropped. Eventually I caught on to a pattern: the last packet out of the interface was always a 100 Trying provisional response, and it was always a specific length. Not only that, I ended up tracing this (Asterisk) response to a specific phone manufacturer’s INVITE. ... With a modified HTTP server configured to generate the data at byte value (based on headers, host, etc) you could easily configure an HTTP 200 response to contain the packet of death — and kill client machines behind firewalls!"
This discussion has been archived. No new comments can be posted.

Intel Gigabit NIC Packet of Death

Comments Filter:
  • by LordLimecat (1103839) on Wednesday February 06, 2013 @04:20PM (#42813673)

    One kind of thing makes it a zillion times easier to recognize a problem when it crops up, and makes it so you only ever have to troubleshoot an issue once.

    How much more awful would it be if something similar happened next week on more computers, and he had to troubleshoot it all over again-- not even knowing whether the machines had NICs in common?

    "Everything blew up" is a problem. "Everything blew up, I dont know why, and it will take 3 weeks to find a solution" is a huge problem. "Everything blew up AGAIN, and I it will take another 3 weeks because our environment is heterogenous" means you are out of a job.

  • by datapharmer (1099455) on Wednesday February 06, 2013 @04:35PM (#42813815) Homepage
    I'm guessing you didn't buy them with Linux on them... or prove it was a hardware issue. They have no reason to support something they didn't ship. Sure the support varies but their pro server support is actually decent if you get the right person on the other end. I had a case where teaming 2 nics caused windows to eat crap and die inexplicably and getting it back up was quite the ordeal. I couldn't even keep it stable long enough to unteam or remove the drivers (even in safe mode). Fortunately they did have documentation on the problem - a broadcom driver had a problem with a particular firmware set when teaming was used. I managed to flash the firmware update from a usb flash drive which got me to the point I could at least boot into safe mode and delete the drivers and then get a working older version of the driver from Dell's site up and running and teaming reconfigured. This was on an poweredge r610 btw. I feel bad for the poor sap who ran into this first and having dell support saved me unnecessary downtime, especially since there is no mention of this problem anywhere on broadcom's website. That said for 99% of the issues I've ever run into having on-site spares and a good internal KB has been far more effective than paying for Dell's support, but if it is free with the server why not use it...
  • Nice debugging (Score:5, Interesting)

    by Dishwasha (125561) on Wednesday February 06, 2013 @04:58PM (#42814079)

    I for one definitely appreciate the diligence of Kristian Kielhofner. Many years ago I was supporting a medium-sized hospital whose flat network kept having intermittent issues (and we all know intermittent issues are the worst to hunt down and resolve). Fortunately I was on-site that day and at the top of my game and after doing some ethereal sleuthing (what wireshark was called at the time), I happened to discover a NIC that was spitting out bad LLC frames. Doing some port tracking in the switches we were able to isolate which port it was on which happened to be at their campus across the street. Of all possible systems, the offending NIC was in their PACS. After pulling the PACS off the network for a while the problem went away and we had to get the vendor to replace the hardware.

  • by Zemplar (764598) on Wednesday February 06, 2013 @05:22PM (#42814349) Journal
    I'm glad Mr. Kielhofner contacted Intel about this issue and had Intel confirm the bug.

    Some years ago I had been diagnosing similar server NIC issues, and after many hours digging, Intel was able to determine the fault was due to the four-port server NIC being counterfeit. Damn good looking counterfeit part! I couldn't tell the difference between a real Intel NIC and counterfeit in front of me. Only with Intel's document specifying the very minor outward differences between a real and known counterfeit could I tell them apart.

    Intel NIC debugging step #1 = verify it's a real Intel NIC!
  • by Anonymous Coward on Wednesday February 06, 2013 @08:19PM (#42816117)

    A number of years ago I discovered that you can take down many routers, and Windows / Linux hosts by sending an ARP response that says "IP 0.0.0.0 is at MAC FF:FF:FF:FF:FF:FF". When you direct this packet to the access point in a wireless network, this makes the SSID broadcast disappear and the whole device go down. Never posted this until now, I wonder if this still works on modern devices.

If the facts don't fit the theory, change the facts. -- Albert Einstein

Working...