Researchers Claim 1,000 Core Chip Created 118
eldavojohn writes "Remember a few months ago when the feasibility was discussed of a thousand core processor? By using FPGAs, Glasgow University researchers have claimed a proof of concept 1,000 core chip that they demonstrated running an MPEG algorithm at a speed of 5Gbps. From one of the researchers: 'This is very early proof-of-concept work where we're trying to demonstrate a convenient way to program FPGAs so that their potential to provide very fast processing power could be used much more widely in future computing and electronics. While many existing technologies currently make use of FPGAs, including plasma and LCD televisions and computer network routers, their use in standard desktop computers is limited. However, we are already seeing some microchips which combine traditional CPUs with FPGA chips being announced by developers, including Intel and ARM. I believe these kinds of processors will only become more common and help to speed up computers even further over the next few years.'"
Programmable CPU's (Score:3, Interesting)
How long will it be before we will see the first motherboards with FPGA emerge?
Then you can download the CPU type of your choice:
-- naah, I don't like this new Intel core, I will try the latest AMD instead...
Re:Programmable CPU's (Score:5, Informative)
A desktop CPU in an FPGA will always cost more and perform worse (i.e. slower clock rate) than a full custom chip from Intel or AMD. Mind you I've seen embedded designs where a microcontroller, Ram, Rom and custom logic are implemented in a $10 FPGA - especially where volumes are too low for an ASIC.
On the other hand I could definitely see programmable logic inside Intel or AMD CPUs, a sort of super SSE. Then again even there you'd probably be better off using GPU like custom hardware for the heavy lifting. In fact I can see CPU/GPU hybrids being very common in low end machines. Full custom logic is always going to have a performance per $ advantage over FPGAs unless FPGA technology chains drastically.
Re: (Score:3)
Re: (Score:2)
Re: (Score:3)
I'd like to see a FPGA 1x Pci express daughter-board and a open and well defined interface so that software can reconfigure and then use the FPGA's on the daughter-board for useful PC tasks....
Game using it for high speed calculations, then DVD Fab uses it to crack BluRay encryption faster, Video encoding, Audio encoding, then the browser uses it for encryption, etc....
A nice open standard without greed attached so everyone can use it in their software. Although in the world of many cores not being use
Re: (Score:1)
you'll want >=8x PCIe since most interesting applications (especially those that distributed computing is worst at) are IO bound.
problem: FPGA parts big enough to have PCIe and DDR interfaces and still do interesting stuff with are expensive on their own @ $600
http://avnetexpress.avnet.com/store/em/EMController/FPGA/Xilinx/XC5VLX50T-1FF1136C/_/R-4696910/A-4696910/An-0?action=part&catalogId=500201&langId=-1&storeId=500201&listIndex=-1 [avnet.com]
http://www.em.avnet.com/evk/home/0,1707,RID%253D0%2526CI [avnet.com]
Re: (Score:2)
You should jump on some of the newer non-volatile FPGA's that can run a microcontroller core. I found one from Lattice who had it on sale for 29.95 with a jtag programmer. Now they are 50. There are other brands and I'm always looking for cheap dev kits. I think there's a devkit for Omap from Ti that's open source and not to expensive but I'm not finding it.
Re: (Score:2)
They're called FPGA accelerators and they already exist. You just won't find them in your general desktop as the entry level cards cost about as much as a high-end workstation.
Re: (Score:2)
given your last comment - I think pre-defined hardware such as AMD/Intel desktop chips will always be faster than FPGA for a pre-specified set of individual operations. It's only when you get to operation combinations not defined at manufacture time, but used frequently, that FPGA has an advantage.
The current CPU design will stay for most of the work, and an FPGA attachment would handle the specialty work that isn't needed most of the time, and can be dropped.
This issue is reprogramming time and muti-thread
Re: (Score:3)
It is however a stupid approach, a CPU is built to do general purpose calculations to allow for all software to exist without specialized hardware. A FPGA on the other hand is made to configure into specialized hardware in order to... well, i guess not having to build a lot of prototypes for hardware testing was its original purpose. But its use go far beyond that in that it could turn into
Re: (Score:2)
Re: (Score:2)
The typical home user rarely needs to do any really heavy number-crunching - the closest they get is physics in games.
For the past 5 to 8 years there has been a "rasterization vs. ray tracing" [google.com] debate in the game developing and graphics community (with ray tracing in real time in games only being a theoretical pipe dream until recently).
If someone were to make ray tracing feasible, cheap, and practical for either a console or desktop PC, then yes... Home users will need that number crunching as Ray Tracing i
Re: (Score:2)
Intel has already a line of Atom-processors with a FPGA for I/O operations.
Re: (Score:2)
How long will it be before we will see the first motherboards with FPGA emerge?
A desktop CPU in an FPGA will always cost more and perform worse (i.e. slower clock rate) than a full custom chip from Intel or AMD.
Sure, but no-one's going to do that anyway- if the OP thought that, then he missed the potential of his own idea.
I thought up something similar a few years back, and realised that, yes, the performance would obviously be horribly uncompetitive and pointless if you simply tried to reproduce (e.g.) an x86 chip's circuitry with an FPGA. The obvious idea (or rather, my idea, which I suspect countless other people also figured out independently) is that the FPGA *circuit* implemented in hardware replaces the *
Re: (Score:1)
You can get an AMD motherboard with a Hyptertransport link brought out, and then an FPGA to go into it.
(Just be careful before you look at the prices. They suffer from being a very niche market.)
Re: (Score:2)
Like this?
http://www.xilinx.com/products/devkits/HW-V5-ML510-G.htm [xilinx.com]
Re: (Score:1)
nice idea, but it will be dirt slow and 10x as expensive.
btw: welcome to 2003 when Xilinx released the Virtex II Pro.
Re: (Score:2)
Re: (Score:1)
How long will it be before we will see the first motherboards with FPGA emerge? Then you can download the CPU type of your choice...
I suspect we shall see them by the year 2002!
Motherboard containing FPGAs combined with dedicated hardware, allows downloadable FPGA cores to emulate CPU models, "chipsets" and architectures: http://en.wikipedia.org/wiki/C-One [wikipedia.org]
1,000 cores (Score:2)
Re: (Score:3)
My bet is 1,000 very simple cores - most decent-sized FPGAs contain 10's or 100's of thousands of 'logic blocks'. The Spartan 6 [xilinx.com] series has between 3,840 and 147,443 logic blocks.
Re: (Score:2)
Re: (Score:1)
"By creating more than 1,000 mini-circuits within the FPGA chip, the researchers effectively turned the chip into a 1,000-core processor - each core working on it's own instructions."
This is entirely feasible, but the 'cores' would have to be very very simple. Looking at the data sheet for the Xilinx Virtex 6 FPGA, it contains 118,560 Configurable Logic Blocks, which are equivalent to four Look Up Tables, and 8 flip-flops. If you wanted to create an 8-bit instruction set processor, it would require at minim
Re: (Score:2)
I agree, hence the "very simple" in my reply. I bet they are extremely limited, but fast. Other brands/models of FPGAs have different definitions of 'complex' - Altera has some pretty smokin' FPGAs, too.
Re: (Score:2)
Re: (Score:2)
FPGAs can be programmed to emulate any logic hardware (logically, though not usually electrically, so power and timing will not be accurate though the logical results will be identical). Many CPU cores have been rendered as library modules that can be programmed into an FPGA. Put 1,000 of them in your FPGA (or big array of FPGAs in this case) and route them together, and you can claim you have a 1,000-core CPU.
Of course, it takes more than one FPGA chip to do this, so you can't in any sense claim a 1,000-
Star Bridge Systems (Score:2)
Took long enough... (Score:3)
This story was already submitted two times before eldavojon managed to get it to the front page in a little over an hour...
http://tech.slashdot.org/submission/1432844/University-of-Glasgow-pioneers-1000-core-processor [slashdot.org]
http://tech.slashdot.org/submission/1432512/1000-core-processors- [slashdot.org]
Re: (Score:3)
Re: (Score:1)
There also was some news about a monkey with three asses...
Does anyone have a link... (Score:3)
...to a paper that assumes that the reader already knows what a cpu is? This article is content-free.
Life Cycle (Score:5, Interesting)
I think this is a great development. I've been using FPGAs in medical imaging for about 15 years. The groups that use the GPUs are getting great performance--definitely--but seeing as how MRI and CT machines are placed and need to run for 10, 15 20 years, I don't see how the GPUs will survive that time. One large OEM was pushing the GPUs for their architecture and I can't believe it will be successful if success is measured on the longevity scale. I'm sure the service sales guy will clean up.
Why do GPUs fail? I'm not sure of the exact modes of failure but the amount of heat has got to have something to do with it. FPGAs will run much cooler and in the FLOPS/Watt game, will win.
Re: (Score:2)
If they make the GPU replaceable, it's not such a big deal.
If they underclock the GPU to reduce heat, again, not such a big deal.
A GPU might have an expected 5-10 lifetime at full throttle, but if you knock it back to 25%, you probably will get a much better survival rate.
Re: (Score:1)
The drawback with using FPGAs compared to commodity processors is that the FPGA market currently does not support using the bleeding edge processes that CPUs are manufactured with. Typically a competitively priced FPGA will be at least one generation behind a CPU. In HPC FPGA's are a plausible improvement, but at a smaller scale the development costs for incorporating a custom firmware for an FPGA into an application are significant. It all really rests on what demand is out there for a particular algorithm
Re: (Score:1)
Re: (Score:2)
FPGAs are much slower and less efficient and bigger than a dedicated design because even the simplest gate is actually a block that can be controlled to perform many different functions. That block consists of several latches and a complex gate, perhaps a hundred transistors in all, whereas a 2-input nand gate consists of four transistors. So it's 25 times bigger (area), and the distance to the next gate is increased by 5x (linear). The complexity makes the block inherently slower than a simple gate, and th
Re: (Score:2)
FPGAs are more difficult to program.
You don't program FPGAs.
FPGA development is synchronous digital logic design. Verilog and VHDL are hardware description languages; they are not programming languages. Having a software-engineering or programming background does not mean you can simply learn Verilog and start doing FPGA design.
Re:Life Cycle (Score:4, Interesting)
Re: (Score:2)
Two things--if there's a failure, then there's a problem. The machines for years used to use military grade hardware. Machines that were designed in 1992, sold in 1994 are still running strong today. Then to cut costs, the OEMs switched to more commodity hardware and they've effectively sucked in uptime since. You make it sound like it's no big deal to call tech support. It is a big deal. To put it in dollar terms, we had a machine go down for technically 4 hours. The tech was there, made the diagnos
Re: (Score:2)
Are you serious? There's no way FPGAs beat GPUs in FLOPS/watt.
FPGAs have so much more overhead both in space and power due to programmability, whereas GPUs are pure processing. Further the algorithms necessary for CT and MRI are practically the same algorithms GPUs were designed for, so if you were to use an FPGA, your design would end up with a similar architecture anyway. Further, while low end commercial GPUs (like those you and I use for gaming), may only last 3-4 years, the high end scientific computin
Re: (Score:2)
I am serious and you are wrong. I don't have a clear idea what you mean about space and power due to programmability. FPGAs are soft coded hardware. If by the nature of being able to code it and change it you mean "overhead" then fine. But even with that overhead, they are still more efficient. You might be thinking of raw speed instead of FLOPS/Watt.
From "A Comparative Study on ASIC, FPGAs, GPUs and General Purpose Processors in the O(N^2) Gravitational N-body Simulation
"
"In this paper, we describe th
Re: (Score:2)
What's most surprising it that the research was on matrix dot products, something that graphics cards do in 3D operations. The FPGA beat the graphics card at its own game in both performance and performance per watt.
I'm impressed. Perhaps we'll see graphics cards made up of nothing but programmable FPGAs in the future. Instead of loading and running a CUDA kernel we'll be loading and running an FPGA core.
Re: (Score:1)
Re: (Score:3)
Re: (Score:2)
What are the practical differences between targeting an FPGA on a computing platform and targeting more ubiquitous massively-parallel programmable pipelines in modern GPUs? Also, what are the fundamental differences? Could my GPU already contain FPGAs?
The main difference is that you don't program FPGAs. You do synchronous digital logic design which is implemented in the FPGA fabric. Thinking that you can program them like you program a sequential-execution processor is a recipe for failure. And, yeah, C-to-gates tools are a joke.
Disappointment (Score:5, Funny)
YouTube Algorithm (Score:2)
20x speed is getting closer to what I need before I can even ATTEMPT to build my very own Holodeck.
http://en.wikipedia.org/wiki/Holodeck [wikipedia.org]
Re: (Score:2)
FPGAs ... (Score:2)
Yawn. Seriously.
(says the guy who does FPGA design for a living.)
Re: (Score:2)
Indeed. 1000 simple CPUs will fit in a FPGA, though it might require one near the top of the line. (e.g., picoblaze reportedly needs 96 "slices" and 1.5 "block RAMs"; the biggest Virtex-7 FPGA has more than 1400x as many block RAMs and 3100x as many "slices") There's little doubt that you could program a DCT for a picoblaze, if you wanted to.
It's hard to tell what 5.0GBps refers to -- the bitrate of the incoming, uncompressed, RGB video data? If so, that's maybe about 800FPS of 1080P video. In a circa
first (Score:2)
We first need to break a lock of x86 instruction set and the operating system that requires it. CPUs already try to execute multiple x86 instructions in parallel, but this is severely limited by sequential instruction set design. There needs to be a way to express computation A and B using different sets of virtual registers and let hardware execute them sequentially or in parallel depending on its capabilities, or vectorize/parallelize multiple iterations of a loop. If software, including operating systems
Re: (Score:2)
We first need to break a lock of x86 instruction set
Yep. All hail ARM.
There's a reason why embedded devices use ARM over x86. The x86 instruction set has a lot of instructions that no compilers (and therefore hardly anyone) ever use. Those unused instructions are just sitting there in the silicon, charged up with electrons, draining power, generating heat, and making it harder to create smaller & faster x86 chips. Some of these "deprecated" instructions are microcoded, but that just means they're slower and even less likely to be used by an optimizin
Re: (Score:3)
Sigh. Multi-way branching was already old when ARM implemented it. What you fail to explain (understand?) is that there is a cost associated with either choice. As with most of engineering there is not a simple proposal that wins. In the case that branch prediction is perfect, the predicted execution is cheaper. In the case that the prediction is terrible the multi-way execution wins. In real life branch prediction is neither perfect, nor is it that terrible, so engineers have to balance the likelihood that
Re: (Score:2)
Well.... no. A few percent is a small deal. A larger percentage would be a bigger deal.
You've made a critical failure here: the x86 *instruction* cache stores x86 instructions after they've been decoded into simpler RISCy form
Yes - after they've been transferred across the bottleneck from memory. So at the point where it matters (the cache fetching lines from memory) the code is in a dense form because of the CISC encoding.
It's really quite simple: RISC is an advantage where the cost of decoding dominates because it simplifies the decoder circuitry. CISC is an advantage where the cost of transferring instructions (and the space that they occup
Re: (Score:2)
Admittedly slightly tangential to your discussion of virtual machines ... but part of the point of Intel's IA64 instruction set was to address this kind of thing. The compiler's job was to specify groups of instructions that could be executed safely in parallel, then the CPU would execute these according to its capabilities.
But a higher-level virtual instruction set with just-in-time compilation is admittedly more insulated against future technology and more amenable to the code being run on a variety of a
good grief - maybe rediscover integration as well? (Score:1)
"However, we are already seeing some microchips which combine traditional CPUs with FPGA chips being announced by developers, including Intel and ARM."
welcome to 2004.
Xilinx Virtex II, includes internal PPC 405GP
Security issues (Score:3)
A programmable hardware platform would provide amazing computing power because of hardware specialization: rather than emulating a proper CPU, you would download core architecture into the FPGA to accelerate tasks such as REGEX processing or H.264 decoding. You could compile the entire logic of a program into a gate array with various logical operators and flip-flop circuits for unlimited (albeit slow) registers (L2 registers) as well as including standard registers and SRAM cache (L1).
Although the FPGA runs slower than a regular CPU, direct programming rather than instructional programming (that is logic blocks that perform programmatic functions, rather than logic blocks that interpret discrete instructions to follow programmatic functions) would shorten the overall hardware logic path. In short, the chip would follow fewer clock cycles and instead just "do things." The CPU would be slow, but optimized for your workload. The main performance bottleneck would be the context switch: replacing the logic gate configuration with a new program every time you switch. Other than that, dynamic program expansion could be utilized: inlining operations like multiplication, addition, etc, or breaking them out if space constraints make it hard to load the whole program onto the FPGA that way.
The obvious, major issue we see is, of course, a security issue. You can now reprogram the CPU. This makes it difficult to prevent a program from bypassing any and all hardware security measures. This is solved by implementing a completely new security design on the chip, by which the CPU itself (the FPGA) is under control of external security mechanisms (paging etc handled in the MMU, outside the FPGA space, would largely mitigate most of this); it's not impossible to deal with, it's just an issue that needs to be raised.
In short, this sucks for "download the new Intel CPU into your BIOS/bootloader." This sucks for whatever general purpose CPU you can think of. For an entirely new programmatic platform, however, this would provide some interesting performance possibilities, and some interesting challenges.
Re: (Score:2)
This sounds reminiscent of the hype around reconfigurable computing ten years ago. A lot of the hype has died down now that people have tried and discovered that what you've described is wrong.
First point: a specialised hardware circuit will always be faster than a generic circuit.
Second point: generic circuits require a lot more interconnect than specialised circuits which impacts how many of them you can fit on a die relative to specialised circuits.
Third point: a CPU is a set of specialised circuits bein
Re: (Score:2)
So here is the basic problem. If the target application is made of steps that exist as specialised circuits in the CPU then selecting which of those circuits to apply in sequence will be faster than a generic circuit because the specialised circuit uses the space on the die more effectively and is clocked at a much higher frequency.
If the target application is made of steps which are very unlike the circuits provided on the CPU then the generic design will win. For everything in-between it is a trade-off. Not as many things win as FPGA designs and there is ten years of literature showing marginal improvements.
Encryption is a lot of things in CPU that are faster in hardware because it's a single clock cycle to do thing that are 30,000 clock cycles on the CPU.
Regex calculation, faster in a specialized hardware chip.
Codec decoding, we use an off-board CPU that has a microkernel and a small program; it benefits from just not running an OS and being a dedicated RISC processor, but in no other way.
GPU, specialized instruction set. Not dedicated to a specific task, but dedicated to a type of task. WAY faster t
Re: (Score:2)
It's odd that you pick crypto as I've spent a little time implementing crypto primitives on weird and exotic hardware. Sure - division is quite slow, that is why most primitives avoid the need for it, or only perform reductions in a specialised field rather than a full division. Multiplication on the other hand is fast and tends to be used a lot.
AES is quite a bad example for FPGAs. The very latest AES extensions from Intel can compute a round of AES in under three clock cycles. Performing the full cipher t
Re: (Score:2)
AES is quite a bad example for FPGAs. The very latest AES extensions from Intel can compute a round of AES in under three clock cycles. Performing the full cipher takes less than twenty clock cycles (on a processor running in excess of 3Ghz). No FPGA in the world can keep up with that performance.
"AES Extensions" means that Intel put a dedicated instruction pipeline in the processor to compute AES. That means you now have a specialized purpose hardware encryption chipset built into your CPU, tada. Just like an FPU.
Try the same Intel CPU with IA-32 instructions implementing AES, you won't do the whole cipher in 20 cycles. If you implement the exact same instruction architecture on an FPGA, it'll run at the slower clock of the FPGA, but still do it in 20 cycles. This means when you want to run
Re: (Score:2)
Your original point was that a reconfigurable processor would be more efficient at most tasks than a specialised processor, and that the big issue would be handling security. Why resort to car analogies when your entire argument can be summarised so concisely?
I have made a simple enough argument that you seem to keep missing - while it is a nice theory that we can reconfigure chips to be more efficient for a particular task the actual practice doesn't live up to expectations. Reconfigurable architectures ha
Re: (Score:2)
My point is everyone responding when this was first posted had this idea that you can just "reconfigure your FPGA to be a new Intel CPU" by some magic, and it'll work better. This is a dumb and short-sighted idea; if you have reconfigurable hardware, you have the ability to ad-hoc create specialized gate hardware rather than run software on generic instruction set architectures.
As for lard-cycles, somebody pointed out modern FPGAs clock at 1.5GHz; I'm more interested in what someone else said about the l
Re: (Score:2)
Once upon a time, there was a writable control store computer from TI.
Some one wrote a Super Compiler that produced microcode directly instead of producing instructions as usual.
Perhaps something like this can come back.
Re: (Score:2)
I could see this being used by a driver model. A generic driver is present that is able to reprogram the FPGA. Specialized or even derived drivers use the - now static - set of functionality. This could allow you to create generic purpose CPU's that can still be tweaked for certain tasks. It would also allow for upgrades of the algorithms being implemented. Symmetric cryptography and encoding/decoding would be obvious choices.
If updating the FPGA is really slow, I would not try and let applications change t
Re: (Score:1)
Re: (Score:2)
Yes. [uclinux.org]
A New Chip (Score:2)
Now we need a chip that can take any given problem and divide it into one thousand parts so we can feed it into these processors. -Gives me a headache!
Re: (Score:2)
Now we need a chip that can take any given problem and divide it into one thousand parts so we can feed it into these processors. -Gives me a headache!
It's called a "programmer".
Re: (Score:2)
Now we need a chip that can take any given problem and divide it into one thousand parts so we can feed it into these processors. -Gives me a headache!
It's called a "programmer".
It's called a "fpgaprogrammer"
get it right.
Re: (Score:2)
For that you need a 1001-cpu chip.
Wow, this will end useful software! (Score:2)
Software developers have barely figured out how to write single threaded algorithms without crashing. Now we are seeing more multithreaded algorithms with race conditions, deadlocks and other data-sharing bugs.
Can you imagine what will happen if every desktop machine has one or two FPGAs available for programs to use as needed?
PHB says "Hey, I've heard that you can make the program faster if you program custom hardware on the motherboard's FPGA. Get the new intern to write some FGPA code for our algorithm
Re: (Score:2)
Multi threaded computing is not rocket science. Most bad multi threaded programming is bad because a lot of so called "software developers" just plain suck.
Re: (Score:2)
Sorry, but to make parallelism painless, you have to restrict the language in ways that make a lot of other things painful.
A language where every method call is a perfect closure is easily made parallel, the only question left is what granularity of parallelism will produce a gain when considering the overhead of managing threads. It also introduces a lot of overhead for constantly copying data on methods you are not going to be making parallel and rendering it slower for some to many applications when comp
And, to think... (Score:3)
Ten years ago some young 6-digit ID Slashdotter was getting modded down for suggesting a Beowulf cluster of cores. Who's laughing now, mods?!?!?
huh (Score:2)
What about an all core chip? (Score:3)
The ultimate end to this trend is to build a system that is just core processing logic, with logic and memory all fused as closely as possible. I call it the BitGrid... it consists of 4bit look up tables hooked into an orthogonal grid. Because every single table can be used simultaneously, there is no Von Neuman bottleneck to worry about.
Petaflops... here we come.... !
Re: (Score:2)
You've just described the FPGA. Large areas of an FPGA are devoted to thousands of almost-identical functional blocks ("slices" in xilinx parlance). For instance, in one Xilinx family, a slice contains a 4-input LUT, a flip-flop (1 bit of memory, called an FF), and other specific gates that help implement things like carry chains, shift registers, and some 5+input functions the chip designers thought were commonly encountered.
Other areas contain "block RAMs" and "DSP cores" (basically, dedicated multiplie
But can you play Doom on it? (Score:1)
Re: (Score:2)
Actually, I think the correct overused meme would be, "Imagine a Beowulf cluster of those!"
That's nothing. (Score:2)
Mine goes to 1011.
Amdahl's Law (Score:1)
I hate to be the devil's advocate but at what point will Amdahl's Law take hold fully and adding more cores to a processor will prove to be a fruitless endeavor?
Re: (Score:1)