Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Graphics Open Source

Coding Mistake Made Intel GPUs 100X Slower in Ray Tracing (tomshardware.com) 59

Intel Linux GPU driver developers have released an update that results in a massive 100X boost in ray tracing performance. This is something to be celebrated, of course. However, on the flip side, the driver was 100X slower than it should have been because of a memory allocation oversight. Tom's Hardware reports: Linux-centric news site Phoronix reports that a fix merged into the open-source Intel Mesa Vulkan driver was implemented by Intel Linux graphics driver engineering stalwart Lionel Landwerlin on Thursday. The developer wryly commented that the merge request, which already landed in Mesa 22.2, would deliver "Like a 100x (not joking) improvement." Intel has been working on Vulkan raytracing support since late 2020, but this fix is better late than never.

Usually, the Vulkan driver would ensure temporary memory used for Vulkan raytracing work would be in local memory, i.e., the very fast graphics memory onboard the discrete GPU. A line of code was missing, so this memory allocation housekeeping task wasn't set. Thus, the Vulkan driver would shift ray tracing data to slower offboard system memory and back. Think of the continued convoluted transfers to this slower memory taking place, slowing down the raytracing performance significantly. It turns out, as per our headline, that setting a flag for "ANV_BO_ALLOC_LOCAL_MEM" ensured that the VRAM would be used instead, and a 100X performance boost was the result.
"Mesa 22.2, which includes the new code, is due to be branched in the coming days and will be included in a bundle of other driver refinements, which should reach end-users by the end of August," adds the report.
This discussion has been archived. No new comments can be posted.

Coding Mistake Made Intel GPUs 100X Slower in Ray Tracing

Comments Filter:
  • by quonset ( 4839537 ) on Monday July 25, 2022 @08:59PM (#62733584)

    36 years later, Intel enabled HIMEM.SYS instead of letting things ride.

    • Reminds me of a bug way back at a research lab where I was the junior computer support grunt. The scientist running his application was bogging down the system greatly with constant page faults, and he bitched that the new system was too slow. Turns out, possibly due to Fortran array layout causing confusion, he was accessing a giant array in the wrong order; increment x before y in the loops and you get a page fault on every access in his algorithm, but access y before x and it ran a thousand times faste

  • the Vulkan driver would shift ray tracing data to slower offboard system memory and back

    Wouldn't have happened on Apple Silicon's unified memory. :-)

    Going forward Apple marketing's comparisons will alway cite 2021 testing.

    • by Anonymous Coward

      I hope they pay you well, shill.

      • by drnb ( 2434720 )

        I hope they pay you well, shill.

        LOL - offended by computer architecture jokes.

        And since you are a little slow on the reading comprehension I'll point out the second half of the joke where Apple marketing sticks to the buggy 2021 Intel driver for their comparisons.

    • It might not have, but it doesnâ(TM)t mean it is immune to coding issues.

      At the same time, I suspect for memory to effective as shared memory, it needs to be faster overall, meaning your RAM is going to cost more.

      • by drnb ( 2434720 )

        I suspect for memory to effective as shared memory, it needs to be faster overall, meaning your RAM is going to cost more.

        Yes, its much faster, cost is mitigated somewhat by having the memory on the same chip. System On a Chip, CPU, RAM, GPU, neural engines, etc. You do lose options. Apple may manufacture one SoC generation in two RAM sizes, those are it, no other options.

        • Apple uses LPDDR5X.
          Its RAM was quicker for a while (prior to Intel's 12th Gen), but that is no longer the case.
          But fast RAM was never their trick. It was very wide architecture.
          Even then, they made trade-offs. The CPU blocks have vastly narrower access buses than the GPU blocks, but ultimately, with the GPU loaded to 100% on compute work, the CPU's aggregate bandwidth falls to below half of what it can normally access. (normally, around ~200GB/s)
          The bottleneck can be demonstrated on a Max by loading the
          • *The bottleneck can be demonstrated on an Ultra.

            The same applies for a Max, but almost perfectly cut in half (the scaling is impressively linear)
          • by drnb ( 2434720 )
            FWIW the M2 has a 50% bandwidth improvement over M1
    • by K. S. Kyosuke ( 729550 ) on Tuesday July 26, 2022 @07:27AM (#62734460)
      No, you would have just run out of money before getting system memory of usable size.
      • Indeed! Some people do need 1 TB of RAM or more, especially for certain scientific applications. Something that is not viable financially on Apple silicon, but that is trivial in Intel/AMD/IBM CPUs (and quite easy to find in lab machines).

        • Indeed! Some people do need 1 TB of RAM or more, especially for certain scientific applications. Something that is not viable financially on Apple silicon, but that is trivial in Intel/AMD/IBM CPUs (and quite easy to find in lab machines).

          Nice try, but 1TB of RAM is not available on common PCs either. Looking at Intel and AMD based motherboards there seems to be a 32GB per memory slot max, yielding a 128GB limit on the typical 4 slot motherboard. If I look at Intel CPUs, say an i9-12900HX I also see a 128GB limit.

          Like PC's Apple also tops out at 128GB with it M1 Ultra.

          • For consumer-grade stuff, you're absolutely correct. It's usually 32GBx4, max.
            For Xeons and EPYCs, you can have 1TB (32GBx32). We have many of them at work for VM clusters.

            Obviously, Apple isn't competing in that segment with their own silicon, yet (We have a Mac Studio, not a Mac Pro- still Xeon).
            • by drnb ( 2434720 )

              For consumer-grade stuff, you're absolutely correct. It's usually 32GBx4, max. For Xeons and EPYCs, you can have 1TB (32GBx32). We have many of them at work for VM clusters.

              Yes, but now we are talking about servers with $7,000'is 1TB ECC RAM. At least for the motherboards I've seen. Not sure about a non-ECC option.

              • Cheaper than I would have guessed. I know we usually end up spending ~25k/server.
                But yes- it's very expensive. It certainly isn't something someone is doing as a hobby.

                But then again, my boss dumped $10k on his iMac Pro with a Xeon... it might not be too unreasonable to scrounge up 1TB of RAM, a quad slot motherboard, and 4 xeons for not much more than that.
                • by drnb ( 2434720 )

                  Cheaper than I would have guessed. I know we usually end up spending ~25k/server.

                  My quick survey showed a motherboard $500-$1,000, a CPU $500-$1,000, 1TB ECC RAM $7,500-$10,000. And this is at do-it-yourself parts vendors. Buying a complete, supported and warrantied system I image is at least double that. I image you were not referring to the DIY route. :-)

                  • No- for our production units, we purchase warrantied full servers. DIY proved too problematic at scale. Your Dell suffers a critical failure, they'll deliver you a new one next-day. Which is a silly requirement to have, unless you happen to be in the uptime business.
        • Since the comment was in reply to a comment about Apple Silicon, wouldn't the limit be 64GB, since the Mac Pro isn't Apple Silicon, so doesn't apply to the conversation in any way?

          I could be wrong, but it looks like 64GB of unified memory is the max available (on the Mac Studio).

      • by drnb ( 2434720 )

        No, you would have just run out of money before getting system memory of usable size.

        Nope. Memory is not expandable, it on the same SoC as the CPU. For a particular version of Apple Silicon, Apple may manufacture only a couple variations with different memory sizes. Those memory sizes are your only options. 8 and 16 GB for M1, 32 and 64 GB for M1 Max, 64 and 128 GB for M1 Ultra, and 8, 16 and 24 GB for M2.

        Apple does not design products for everyone. They design for a certain profile. They recognize that users who only need lower end CPUs tend to only need smaller amounts or RAM, and user

    • Wouldn't have happened on Apple Silicon's unified memory. :-)

      Indeed, it would not have. Because Apple Silicon doesn't have the option for a discrete GPU, or dedicated ray tracing hardware.

      • by drnb ( 2434720 )

        Wouldn't have happened on Apple Silicon's unified memory. :-)

        Indeed, it would not have. Because Apple Silicon doesn't have the option for a discrete GPU, or dedicated ray tracing hardware.

        No, if Apple's Metal GPU had raytracing support it still would not have happened due to unified memory. Ray tracing is not an integral part of the bug, its just coincidentally where it happened. Any driver dealing with a discrete GPU and having to manage CPU and GPU ram is vulnerable to such a bug.

        • No, if Apple's Metal GPU had raytracing support it still would not have happened due to unified memory.

          Nope. You have to choose the memory model in Metal (and have had to for a long time) just like you have to do in DX12/Vulkan.
          Granted, if you fuck that up on current Apple Silicon, you won't be paying a 100x tax, because you're not fetching that data over a comparatively dog slow bus, you will still pay the tax. The GPU on Apple Silicon is a standard integrated GPU with standard unified memory. It cannot touch all blocks of memory- its access is gated by an IOMMU (DART in Apple parlance).
          Metal, like DX12 a

          • There is a minor caveat I will add that makes this problem less likely on Apple devices:
            Official guidance for Metal has been to use Shared memory model for a long time, because most shipping Apple devices used unified memory for many years, now.
            This is the opposite situation for PCs, where the "standard" is to have a discrete.
          • by tlhIngan ( 30335 )

            Any graphics API that allows for the tagging of blocks as local or remote is vulnerable to this class of bug, and that includes Metal on Apple Silicon. The cost of the bug is merely smaller. Just as it is on an Intel IGP.

            It should be noted that the tagging is merely a suggestion. If you tag memory as system use only, the driver will likely keep ti in the system memory, but it isn't forced to - it might decide that it has sufficient onboard memory that it can pre-load it into onboard memory. Likewise if you

            • It should be noted that the tagging is merely a suggestion. If you tag memory as system use only, the driver will likely keep ti in the system memory, but it isn't forced to - it might decide that it has sufficient onboard memory that it can pre-load it into onboard memory. Likewise if you tag everything as onboard memory, and there's not enough of it, the driver will spill the excess to system memory.

              Most certainly not. That would be a direct violation of memory protection.
              Private/Local (depending on API parlance) must be non-CPU local. On UMA systems, this is emulated as a block of RAM that is copied back and forth that you have no direct access to.

              It's a hint to the memory allocator how it should be handled. You might allocate everything in system RAM to have it available, then when you decide on the frame, you have the things in view moved to onboard memory - but the driver is not obligated to perform the move operation immediately - it might simply delay it and let the card pull it in on its own via DMA, then leave it there. Or if it's really idle, it might move it to reduce bandwidth consumption later.

              No.
              Don't know what else to say, but no.

              It's a hint, not a directive. Chances are the allocator will use the chosen memory location but there may be reasons it can't, won't or doesn't. Since error-checking in Vulkan is basically "none" (or minimal) the allocator will generally only fail to allocate memory when it really runs out of options - you want memory, you got it, even if it's in the wrong place.

              This is flatly untrue.

              On Vulkan, the system is a bit better, because you don't explicitly ask for "CPU" or "GPU" memory, you specific access bits, and the library finds an appropriate pool for you. If there isn't

          • by drnb ( 2434720 )

            Granted, if you fuck that up on current Apple Silicon, you won't be paying a 100x tax

            Which is the point of the original joke.

            The GPU on Apple Silicon is a standard integrated GPU with standard unified memory.

            Careful, "unified memory" means different things to different vendors, Apple's usage different than Intels. Apple still beats Intel's integrated GPUs. There are not exactly the same.

            Ray tracing is not an integral part of the bug, its just coincidentally where it happened.

            Correct. But that's circular logic, as the bug only exists because of the implementation of Ray Tracing.

            No, its a bug that can occur in any driver that must deal with CPU and GPU RAM. Where it happened is just a coincidence, its not a necessary condition.

            • Careful, "unified memory" means different things to different vendors, Apple's usage different than Intels. Apple still beats Intel's integrated GPUs. There are not exactly the same.

              And I'd implore you to be careful to not mistake marketing terms for technical terms.
              Unified Memory at the most technical sense merely means the CPU shares first-class citizenship with other function blocks with regard to access to system RAM.
              Implementation details of this vary *vastly* across vendors (which makes sense).

              However- from a graphics API standpoint, the implementation details just don't matter. That's just the special sauce the hardware vendor provides to make their stuff More Gooder.

              No, its a bug that can occur in any driver that must deal with CPU and GPU RAM. Where it happened is just a coincidence, its not a necessary condition.

              Where i

              • by drnb ( 2434720 )

                Unified Memory at the most technical sense merely means the CPU shares first-class citizenship with other function blocks with regard to access to system RAM. Implementation details of this vary *vastly* across vendors (which makes sense).

                Implementations vary, that is precisely my point. That is why simply saying both have unified memory is misleading.

                However- from a graphics API standpoint, the implementation details just don't matter.

                From a performance standpoint they do, which is what I am interested in.

                It matters that RT was a new feature for the causality of the bug.

                No. It matters that there was a new feature. It matters that the feature could access CPU and GPU memory. It did not matter that the feature was ray tracing.

                • Implementations vary, that is precisely my point. That is why simply saying both have unified memory is misleading.

                  Not for this purpose, because the mistake was made based upon something that matters regardless of implementation.
                  It was worse due to the fact that host accesses on discrete cards are really, really bad latency wise, but the "HOST|REMOTE" visible bitmask applies to *all* unified memory implementations, as well as non-unified memory implementations.

                  From a performance standpoint they do, which is what I am interested in.

                  OK- perfectly fair. From a performance standpoint, the implementation details do matter. But this mistake exists regardless of implementation details, the cost o

  • Why did it take this long to find? Surely devs should have realized that their ray tracing jobs were taking 100x longer than they should have.
    • That was my first response as well, if something's running 100x slower than it should and no-one even notices then maybe it's something that no-one actually cares about. Years ago we has $mission_critical_product_feature that was accidentally disabled at some point and for at least ten years no-one noticed that $mission_critical_product_feature wasn't actually present in the product. It was eventually caught in a code review rather than anyone noticing. It's remained disabled until we get the first actua
    • Yea, exactly. They got away with sandbagging Linux for years to make Windows look better. Not an isolated incident, either. The real question is what changed to make them decide to finally fix it. I guarantee it wasn't "Durrr... oops, I just noticed this!"

      • Going to market, hired more people, wrote more benchmarks, did more code profiling.

        Everything good should have been sooner and better - it's never not the case so that's a noop. Intel guys should be congratulated for improvement.

    • Their driver team sucks. Long story short.

      • by DrXym ( 126579 )
        More likely their team was focused on IGP where obviously ray tracing is a joke feature never to be used for real so no one complains or notices if some flag is unset. Now that they are creating discrete graphics cards then issues like this suddenly get shaken out during performance testing and patches start landing.
        • You're dead-on, but it's not because it's a joke feature that it wasn't noticed.
          It's because the IGP has local access to DRAM, just like the CPU.
          There simply was no 100X slowdown.

          That's not to say that you aren't right about the usefulness of that feature on that platform, just giving some additional context.
        • They have a separate team for iGPU. And the iGPU drivers have also sucked for a long time.

      • by Anonymous Coward
        So which H1B visa group at Intel copied the original bad code from StackExchange?
    • by DrXym ( 126579 )
      Probably because adding ray tracing to an IGP is a box ticking exercise rather than something of practical use. Maybe it took this long because nobody even noticed.
      • Maybe it took this long because nobody even noticed.

        I'm thinking that it took this long because everyone thought everything was fine. Intel graphics products are always at least 100x slower than everyone else's, so no one thought anything was out of order.

        • by DrXym ( 126579 )
          Up until now Intel GPUs are integrated things sharing memory with the rest of the system. Now they're doing discrete graphics cards, any glitches in what kind of memory they use are going to become obvious. To me that's the reason for it, it never matter in the past because a) nobody would be insane enough to use an IGP for ray tracing, b) it made no material impact to performance. I assume with the new hardware now it does. Personally I find raytracing kind of a joke still since it kills performance even o
    • I guess nobody is using an intel GPU to do ray-tracing on linux. Not that surprising if there aren't many of these discrete Intel GPUs in service. Number of Intel discrete GPUs divided by the number of linux desktops divided by the number of people trying to play games with ray-tracing enabled = something pretty small I'd bet.
      • Well, when you play a role playing game it might be a nice feature when you have a projector projecting the picture at the wall, and you go to the kitchen to snatch a beer, that you have not lost to many frames.

    • So naive to assume people test their code :-)

  • by deimios666 ( 1040904 ) on Tuesday July 26, 2022 @12:05AM (#62733876)

    Up until very recently intel had no discrete GPUs with dedicated VRAM.

    Since nobody has them, nobody could test them. It's a fair assumption that if you have an intel GPU you are using system RAM for it's framebuffer.

    • I guess if recently is Jan 2021.

      https://www.tomshardware.com/n... [tomshardware.com]

      • It's all relative.
        Intel has been building GPUs into their chips for a couple decades, now. So ya, a year and a change is recent.
      • But, and this is a genuine question, as I wasn't able to answer it, can you actually buy an Intel Arc GPU yet? I don't think they are actually sold yet, I saw an article about them getting ready to sell them that was written yesterday.

        I have been keeping an eye on them, as I am curious how their pricing compares to nVidia, and haven't seen anywhere selling them yet. I wouldn't mind throwing one in my desktop to see how well it handles VR over my current RTX 3060. I haven't seen one anywhere for sale thou

  • by RightwingNutjob ( 1302813 ) on Tuesday July 26, 2022 @08:30AM (#62734654)

    you pay for.

    About 12 years ago, I had the misfortune of trying to get to the bottom of a (conceptually) similar bug for the Linux driver for a data acquisition card from a well-respected maker of scientific equipment that sells to everyone from life sciences to astronomers.

    The gizmo plugged into a pcie card, and the card was supposed to transfer data into physical memory. The userspace part of the driver was a closed source binary but the kernel module was provided as code to be compiled on the customer end.

    That kernel module had all kinds of silly mistakes like treating pointers as 32 bit signed ints (necessitating an out-of-the-box setup where the machine memory was capped to 2G) and explicit use of calls to allocate uncached (vs cached) memory as the target for dma-'ing the device data into.

    The result was that (as delivered) the nice fancy super duper machine you wanted to run your gizmo on had to have its memory capped at 2G and each data transfer took a good fraction of a second to complete (it should have been under 1 msec).

    Before I got to digging through this mess (whereupon I discovered the aforementioned mistakes), I called the company's support line (many 10s of thousands of dollars to buy it, I expected a modicum of competence out of the vendor).

    After holding for however long, I got connected to a gentleman on the other side of the planet who I was told was their Linux driver guru. Dude could barely speak English, just quoted the readme file to me, and quite possibly never handled the piece of equipment in question or the data acquisition card for it ever in his life.

    Given what I found in the source code that shipped with it, I'm inclined to believe he *was* their Linux guru and *did* in fact write their kernel module.

    Probably the same thing here. They hired some code monkey to write the code based on a hardware spec, made sure it compiled and ran remotely, but almost certainly didn't have access to the hardware when things like 100x slowdowns would be noticeable.

  • This is a recent bug, becasue it's a Vulkan bug. So the bug is about 2 years old. It's in a ray tracing API for a not so great GPU, so probably not many people actually use that. I used Intel Graphics for various 3D but without using raytracing and it works fine. Over the last 6 years or so, the performance of Intel integrated GPUs really improved; it's a half decent hardware now for the power envelope.

  • It would have sucked supremely hard if this had been a hardware bug.

Keep up the good work! But please don't ask me to help.

Working...