Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Intel Technology

Quick and Dirty Penryn Benchmarks 90

An anonymous reader writes "So Intel has their quad-core Penryn processors all set and ready to launch in November. There are benchmarks for the dual-core Wolfdale all over the place, but this seems to be the first article to put the quad-core Yorkfield to the test. It looks like the Yorkfield is only about 7-8% faster than the Kentsfield with similar clock speeds and front-side bus."
This discussion has been archived. No new comments can be posted.

Quick and Dirty Penryn Benchmarks

Comments Filter:
  • My recent experience with quad-CPU Xeon machines is that multithread performance for a single is VERY poor, even with great care in coding, presumably because of cache-sloshing between these physically-separate CPUs dropped onto one die.

    (I compare with Niagara and even Core Duo which seem much better for threaded apps.)

    Has anyone else tested threadability of these CPUs, and power efficiency, sleep states, etc?

    Rgds

    Damon
    • They could probably make better use of the die space of the 4th, 3rd, or even 2nd CPU core by putting things like cache there instead. And in another direction, go with SoC (system on a chip) or certain subsets thereof. Combined with serialized bus technologies, this should work while also reducing pin counts.

      • by DamonHD ( 794830 )
        Well, what's nice about my Niagara T1000 box is that everything is on one chip, and the outermost level of cache serves all CPUs, so even a nominal cache flush for volatile/synchonized never need leave the chip and hit real RAM.

        I'm just concerned that threading seems poor when you really do have to go to memory to get data between CPUs, and your idea of giving up some individual cache for some shared cache would be quite right if Intel had the engineering time to do it.

        For my latest nasty performance surpri
      • They could probably make better use of the die space of the 4th, 3rd, or even 2nd CPU core by putting things like cache there instead.

        The benefits of having extra cache drop off very quickly above certain cache sizes (depending on the addressable RAM the cache is indexing). A lot more is involved with improving level-0/1/2 cache performance than just upping the cache size.

        I'd expect greater benefits from moving dedicated (but programmable) VLIW units into the CPU to increase instruction-level parallelism, f
      • They could probably make better use of the die space of the 4th, 3rd, or even 2nd CPU core by putting things like cache there instead.

        Except you won't pay the price.

        They can charge you more for a 4 core CPU with shit amounts of cache than they can with a dual core with shed loads. People are stupid. They assume more megahurts means more fast and more cores means more fast... Whether the additional cores are actually doing anything at all.

        Business CPUs it's a different matter, they actually benchmark their apps and yup, buy CPUs with loads of cache when they're faster.

        And really, the best thing they could do is add an FPGA.

        • by DamonHD ( 794830 )
          Hmm, looked at FPGAs too. Not generally worth the highly-specialised one-off development even when writing your own code.

          Much nicer to have something portable which next year will just run faster without your doing much because of an improved compiler, runtime, CPU, cache, bus, kernel, whatever... Usually...

          Rgds

          Damon
      • by rbanffy ( 584143 )
        For multi-threading apps, instead of multiple cores (nothing except caches are shared between complete CPU cores), it makes a lot of sense to have an HT-like architecture (multiple context stores, shared elements) that reduces the time it takes to do a context-switch. It would also help a lot to have a context-aware cache system where a swapped-in context would not wake up having to read every instruction from main memory.

        Since not all threads will be runnable at any given time, having more cores instead of
    • Re: (Score:2, Informative)

      by bjackson1 ( 953136 )
      Intel's Core Microarchitecture is not currently available in a quad-CPU platform. It is understandable the multithreaded performance would be poor, then.

      The current quad-cpu architecture is based on Tulsa, which a 65nm shrink of Paxville, which is essentially a Pentium 4 Smithfield, or two Prescotts shoved onto one ship. Basically, it's two years ago's technology. The new Tigerton chip will be in Core based, however, it's not out yet.
      • by Wavicle ( 181176 )
        How did this get modded informative?

        Intel's Core Microarchitecture is not currently available in a quad-CPU platform.

        Incorrect. Intel's "Core Microarchitecture" is marketed under the name "Core 2." The "Core 2 Quad" processors use the Core Microarchitecture. See Intel's product brief [intel.com] on the subject.

        It is understandable the multithreaded performance would be poor, then.

        The single threaded performance of quad core is similar to the single threaded performance of dual core, clock for clock. This should have ti
        • Incorrect. Intel's "Core Microarchitecture" is marketed under the name "Core 2." The "Core 2 Quad" processors use the Core Microarchitecture. See Intel's product brief on the subject.

            I said quad CPU not Quad Core. Socket 771 Core 2 Quads or Quad Xeons can only be used in pairs.

            Basically the answer to all of your arguements is that I said "Quad CPU" not "Quad Core". You should know there is a difference.
    • Your experience isn't shread by me, or by most other benchmarkers. Take a look at multi-threaded SPEC benchmarks for the Xeon 5300 series. SPEC_int_rate 2006, SPEC JBB_2005, etc, all show the Xeon 5300 as the clear per-socket performance leader for x86 systems. The quad-core Xeons are only bested by the IBM POWER 6, and Niagra in the Java benchmarks.

      See the SPECint_rate 2006 [spec.org] results page, and filter on two-chip systems.

      Perhaps your particular application is a degenerate case for the 5300s cache architectu

      • by DamonHD ( 794830 )
        It's clear that the nature of my app (which is in Java, BTW) is going to make a difference, and I've not seen quite this effect before in Java or C++ threading over 10+ years where I've had to run well short of a thread per CPU to maximise throughput or at least throughput per CPU. The threads are moderately tightly coupled but, as I say, rarely sharing mutable state. Usually running with a few too many threads to allow for a little parallel slackness is a better bet.

        Part of the problem in my particular a
      • by be-fan ( 61476 )
        The SPEC benchmarks are _almost_ perfectly parallelizable. They are just multiple instances of a single-threaded benchmark, and as such don't really test all the things that arise in true multi-threaded programs (cache line bouncing, etc).
        • Take a look at SPECjbb2005 or TPC-C, which resemble "real" applications a lot more than SPECint_rate. The Quad-core Xeons are 70-100% faster than the fastest dual-core Opteron systems.

          As much as I wish it weren't so, AMD has been toasted in the two-socket server space, which is the largest part of the server market. Barcelona proabably won't change that, as Penryn will arrive at the same time.

  • "Intel expects SSE4 optimizations to deliver performance improvements in video authoring, imaging, graphics, video search, off-chip accelerators, gaming and physics applications. Early benchmarks with an SSE4 optimized version of DivX 6.6 Alpha yielded a 116 percent performance improvement due to SSE4 optimizations." Not bad...
    • "Intel expects SSE4 optimizations to deliver performance improvements in video authoring, imaging, graphics, video search, off-chip accelerators, gaming and physics applications. Early benchmarks with an SSE4 optimized version of DivX 6.6 Alpha yielded a 116 percent performance improvement due to SSE4 optimizations." Not bad...

      Also, Intel have introduced a new instruction for adding sixteen to fourteen and dividing the result by two (ADDFTNSTNDIV2). This has produced a performance increase of up to 12,000%
  • Seriously (partly, at-least) : How many penguins I will see during the boot-up? 4?
  • by Dachannien ( 617929 ) on Saturday August 25, 2007 @09:33AM (#20353301)
    Penryn? Wolfdale? Yorkfield? Kentsfield? What are they doing here, making processors, or naming streets in a new upscale subdivision?

  • by osewa77 ( 603622 ) <naijasms@NOspaM.gmail.com> on Saturday August 25, 2007 @09:39AM (#20353341) Homepage
    AMD rose to this position primarily because they didn't make Intel's mistakes - trying to force a new CPU architecture on the market (Itanium) instead of incrementally developing the X86 line, and focusing on clock-speed (P4) at the expense of performance per watt. Now that Intel is focused on performance per watt, AMD needs to find a new differentiator for their chips.

    Perhaps they should start thinking about how to integrate a high quality Vista-capable GPU into their processors? (afterall they acquired ATI). How about sound cards, USB ports, et cetera. If they can fit 90% of a typical motherboard into the processor and usher in a new era of affordable and efficient computers while intel is busy playing with 64-core chips, why not?
    • They are doing exactly that.

      AMD is going the route of a true native quad core with Barcelona, coming out in september. They have the desktop version of that, Phenom, coming out closer to Christmas. Intel is taking the quick and dirty route to quad core - smash two dual core CPUs onto the same die. AMD is actually doing a proper quad core architecture.

      They have in their roadmap a GPGPU (general purpose graphics processing unit) for late 2008 or early 2009. I'm personally still trying to understand what t
      • Intel is taking the quick and dirty route to quad core - smash two dual core CPUs onto the same die. AMD is actually doing a proper quad core architecture.

                A 'smashed' Xeon runs much better than an AMD CPU that I can't buy. If I said AMD sucked because they took the 'quick and dirty' route with the K10's shared L3 victim cache, limited memory prefetching, and limited incomplete subset of SSE4 you'd probably just say those are buzzwords.
      • by init100 ( 915886 )

        Intel is taking the quick and dirty route to quad core - smash two dual core CPUs onto the same die. AMD is actually doing a proper quad core architecture.

        Do you think that the fact that the Intel method is cheaper due to higher yield is irrelevant? With a single-die quadcore, the entire processor needs to be discarded if just one core is broken. With dual-die quadcores, you only need to discard one half of the processor. This increases yield and lowers costs, and I cannot see what is so bad about that. Performance isn't everything, and it isn't like it suffers greatly from the dual-die design. I'd guess that it suffers more from the shared FSB design.

        • Its cheaper to just squish two dual cores together and have two dies but performance takes a hit if a processor is made that way. I think AMD was going for scalability on the quadcore they designed, probably in such a way to where they don't have to make a major redesigns until they are ready with their fusion line. Its funny that alot of people in the IT world say this is AMD's final years because they cant break X clock speed, but you have to remember, Intel has alot more fabs than AMD and a lot more mon
          • I don't think having two dual-cores in a package instead of four cores combined is necessarily a disadvantage. To compare these properly, you would have to assume same quality of implementation. So Intel could have gone for one unified 12MB L2 cache with four access paths instead of two 6MB L2 caches with two access paths each. With same quality of implementation, the four access paths will be slower because you have to cope with four processors accessing it at the same time instead of two. So each access w
    • AMD did try to play a different game when they announced Fusion and Torrenza. Intel played dirty by turning back the clock to a time when people were addicted to the single-core benchmarks (i.e. framerates) and chose the perfect timing: AMD's cash reserves were the lowest after the ATI merger. Intel are hard-pressed to kill AMD now, before they open their new fabs. (Malta, NY ?) and are capable of meeting demand. And maybe that anti-trust lawsuit has some real basis, otherwise Dell would certainly not both
    • AMD still seems to be doing good design but their fabbing lags Intel by a year. I think it's Intel's fab technology that carried them through despite their other technology misdirections. I hope that the results of the ATI merger become a long term positive, it seems to be holding them down in the short term. Betting on on-die GPU is quite a serious bet, quite a bit more serious than an on-die memory controller in my opinion, especially when they go into major debt to acquire another large company just t
  • Although Yorkfield uses a 45nm fab process and consumes less power, Intel plans to stick to its existing 95 Watt and 130 Watt thermal design power ratings.
    I don't get it, does it use less power or not? Or does this mean it uses less power per cycle, thus allowing them to ramp up the clock until it's back up to 130 watts?
    • Or does this mean it uses less power per cycle, thus allowing them to ramp up the clock until it's back up to 130 watts?

      Yes, they are increasing the clock to maintain the same TDP.
  • by Terje Mathisen ( 128806 ) on Saturday August 25, 2007 @03:12PM (#20355363)
    When decoding "full HD" h264, i.e. 40 Mbit/s BluRay or 30 MBit/s HD-DVD, with 1080p resolution, current cpus start to trash the L2 cache:

    Each 1080p frame consist of approximately 2 M pixels, which means that the luminance info will need 2 MB, right?

    Since the normal way to encode most of the frames is to have two source frames and one target, motion compensation (which can access any 4x4, 8x8 og 16x16 sub-block from either or both of the source frames), will need to have up to 2+2+2=6MB as the working set.

    Terje

I tell them to turn to the study of mathematics, for it is only there that they might escape the lusts of the flesh. -- Thomas Mann, "The Magic Mountain"

Working...