Intel Gigabit NIC Packet of Death

Intel Gigabit NIC Packet of Death 137

Posted by Soulskill on Wednesday February 06, 2013 @05:03PM from the how-to-break-things dept.

An anonymous reader sends this quote from a blog post about a very odd technical issue and some clever debugging: "Packets of death. I started calling them that because that’s exactly what they are. ... This customer location, for some reason or another, could predictably bring down the ethernet controller with voice traffic on their network. Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller. The system and ethernet interfaces would appear fine and then after a random amount of traffic the interface would report a hardware error (lost communication with PHY) and lose link. Literally the link lights on the switch and interface would go out. It was dead. Nothing but a power cycle would bring it back. ... While debugging with this very patient reseller I started stopping the packet captures as soon as the interface dropped. Eventually I caught on to a pattern: the last packet out of the interface was always a 100 Trying provisional response, and it was always a specific length. Not only that, I ended up tracing this (Asterisk) response to a specific phone manufacturer’s INVITE. ... With a modified HTTP server configured to generate the data at byte value (based on headers, host, etc) you could easily configure an HTTP 200 response to contain the packet of death — and kill client machines behind firewalls!"

Intel Gigabit NIC Packet of Death

This discussion has been archived. No new comments can be posted.

Search 137 Comments Log In/Create an Account

Comments Filter:

Re:This is why the equipment should be heterogeneo (Score:3, Informative)

by Anonymous Coward writes: on Wednesday February 06, 2013 @05:17PM (#42813625)

Agreed. OP clearly has no experience managing large server installations.

Re:Three Strikes... I'll Pass (Score:5, Informative)

by v1 ( 525388 ) writes: on Wednesday February 06, 2013 @05:23PM (#42813695) Homepage Journal

oh I think this is at least slightly interesting. I remember the "ping of death" (and pissing off a few windows heads in my sights) back in 'th day.
This is basically a DoS attack on hardware. The fact that it can get through someone's firewall makes it a bit more effective. Having your ethernet port check out every five minutes (requiring a reboot to fix) just because someone down the hall (or in Bulgaria) wants to be an ass is definitely annoying and something I'd like to know is a possibility when troubleshooting screwy network problems.
I just got done swapping out a gigabit switch that was being wonky and slow for no obvious reason. I don't mind so much when hardware keels over and dies, but when it throws symptoms that don't immediately suggest where the problem is, those are the real time wasters. And we've come to rely on hardware generally being more reliable than software. So if my ethernet was going out when I VOIP'ed, I might have spent (wasted) a lot of my time troubleshooting the VOIP software.

Re:Ouch (Score:4, Informative)

by chevelleSS ( 594683 ) writes: on Wednesday February 06, 2013 @05:42PM (#42813891) Homepage

If you read further down in the article, you would know that they worked with Intel and were given a patch to fix this issue. Brandon

Re:Ouch (Score:5, Informative)

by el borak ( 263323 ) writes: on Wednesday February 06, 2013 @05:48PM (#42813961)
What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?
I realize you found the article difficult to read, but it wasn't that long. 2/3 of your questions were addressed in the article.
- Ethernet controller? 82574L
- Reported? Yes, and Intel supplied an EEPROM fix.
Re:This is why the equipment should be heterogeneo (Score:5, Informative)

by eksith ( 2776419 ) writes: on Wednesday February 06, 2013 @05:48PM (#42813965) Homepage

There's a good reason a lot of our equipment is slightly older. No, we don't use ancient stuff, but they're not 100% top of the line made yesterday either. And that's because each time a new mobo, memory and storage combo that looks like its worth purchasing comes to market, the first thing we do is run a few sample sets under everything we can throw at it. Usually problems are narrowed down within the first couple of weeks or so, but that's why we have separate people just for testing equipment.
Now admittedly, it's getting harder with this economy so we have some people doing double duty on occasion (I've had to do a bit too when the flu came rolling in), but testing goes on for as long as we think is necessary before the combo goes live. We avoid a lot of the headaches that come with large deployments by keeping changes isolated to maybe 10-15 nodes at a time. It's a slow and steady rollout of mostly similar systems (maybe 3-4 identical) that helps us avoid down time.
We're not Google and we don't pretend to be, but common sense goes a long way to avoiding hiccups like "everything blew up". I think the biggest issue was when hurricane Sandy hit and we weren't sure if the backup generators would come online (this is a big problem with things that need fuel and oil, but stay off for a long time), so we brought in a generator truck for that too, just in case. Again, avoiding one of anything.

Re:This is why the equipment should be heterogeneo (Score:2, Informative)

by Anonymous Coward writes: on Wednesday February 06, 2013 @07:06PM (#42814853)

You will have to update the kernel, though. The linux e1000 and e1000e drivers have a fuckload of hardware bug workarounds, and the ASPM thing did hit some people recently. You *must* have ASPM L0s and L1 disabled on the Intel NIC *and* its parent PCIe bridge, and the kernel driver usually will only be able to disable it on the NIC itself, if the BIOS is crap and leaves ASPM L0s or L1 enabled on the bridge or has a crap NIC eeprom image that causes issue with 128b/256b maximum PCIe packet (this one can be fixed by Linux, *if* you give it a specific parameter, no idea why it isn't automatic since it is major utter braindamage by the BIOS that is known to hang the box hard sometimes), the NIC can hang.

Re:Ouch (Score:5, Informative)

by sirsnork ( 530512 ) writes: on Wednesday February 06, 2013 @07:48PM (#42815233)

Intel NIC's are held in high regard because a) they are fixed when a problem is found, and b) the bugs are documented.
You should have a look through some of the CPU errata on Intel's site. it'll open your eyes as to just how many bugs a desktop CPU has even once it's shipped

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Intel Gigabit NIC Packet of Death 137

Intel Gigabit NIC Packet of Death More Login

Intel Gigabit NIC Packet of Death

Re:This is why the equipment should be heterogeneo (Score:3, Informative)

Re:Three Strikes... I'll Pass (Score:5, Informative)

Re:Ouch (Score:4, Informative)

Re:Ouch (Score:5, Informative)

Re:This is why the equipment should be heterogeneo (Score:5, Informative)

Re:This is why the equipment should be heterogeneo (Score:2, Informative)

Re:Ouch (Score:5, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot