Wintel, Universities Team On Parallel Programming 91
kamlapati writes in with a followup from the news last month that Microsoft and Intel are funding a laboratory for research into parallel computing at UC Berkeley. The new development is the imminent delivery of the FPGA-based Berkeley Emulation Engine version 3 (BEE3) that will allow researchers to emulate systems with up to 1,000 cores in order to explore approaches to parallel programming. A Microsoft researcher called BEE3 "a Swiss Army knife of computer research tools."
1000 cores? (Score:1)
Re:1000 cores? (Score:5, Informative)
Fine-Grain Parallelism Is the Future, Not Threads (Score:2, Insightful)
One day soon, the computer industry will realize that, 150 years after Charles Babbage invented his idea of a general purpose sequential computer, it is time to move on and change to a new computing model. The industry will be dragged kicking and screaming into the 21st century. Threads were not originally intended to be the basis of a parallel software model but only a me
Re: (Score:2)
Personally, I like Erlang, but the point is the same -- come up with a toolset and/or programming paradigm which makes scaling to thousands of cores easy and natural.
The only problem I have yet to see addressed is how to properly test a threaded app, as it's non-deterministic.
Re: (Score:2, Insightful)
This is precisely what is wrong with the current approach. The idea that the problem should be addressed from the top down has been tried for decades and has failed miserably. The idea that we should continue to write implicitly sequential programs and have some tool extract the parallelism by dividin
Re: (Score:2)
Maybe so, but it's certainly not what I was suggesting.
Rather, I'm suggesting that we should have tools which make it easy to write a parallel model, even if individual tasks are sequential -- after all, they are ultimately executed in sequence on each core.
stupid (Score:2)
You only need to "worry" about that if you insist on programming your multi-core machine in low-level C. Better solutions have existed for decades, people just don't use them. How is the BEE3 going to change that?
Re:1000 cores? (Score:4, Insightful)
2) A significant number of applications can and do run on 1000+ cores. Sure, most are scientific apps rather than consumer apps, but there is a market for it nevertheless. Go tell a high performance computing guy that there's no need for 1k cores on a single chip and watch him collapse laughing at you.
Re:1000 cores? (Score:4, Funny)
Re: (Score:2)
I keep wondering when we're going to put processing closer to the memory again. As in, put a couple of SPUs right on the memory chips. At least an FPGA with a couple 1,000 gates, that would be very general purpose.
Re: (Score:2)
Re: (Score:2)
Granted, in 90% of day to day uses that's all you need. But the other 10% would probably love to see RAM running synchronously with the CPU.
Re: (Score:2)
Re: (Score:1)
(just teasing)
Re: (Score:2)
Re: (Score:1)
Except that instead of running SETI@home, they used heavily FPGA optimized designs. Since most radio astronomy only requires a few bits of precision (2-8) modern CPUs or GPUs are incredibly wasteful for them. So intead they use heavily optimized fixed-point math circuitry. By using FPGAs they
Re: (Score:2)
You could try and have a process running on each core, but even on a university server, you will only have a few hundred processes running, so giving every user a single core is still going to underutilize 80% of those cores. And even then many of those processes are hourly cron jobs or
Re: (Score:2)
OK, so I know there's no "wrong" mod, but don't mod it insightful.
Re: (Score:2)
Re: (Score:2)
Why would you need to ? Either your program is multithreaded or it isn't; and if it is, it either is or isn't properly synchronized. The number of cores is completely irrelevant; a broken multithreaded program will fail randomly in a single-core machine too, and a singlethreaded program in a 1000-core machine won't run into any issues either.
cool (Score:2)
Re:cool (Score:5, Funny)
(Okay, the joke would have worked better in the P4 days.)
"stuck with a ...serial programming model" (Score:5, Insightful)
Even languages like Erlang which bring parallelization right to the front of the language are still stuck running serial operations serially. There is sometimes no way around doing something sequentially.
Now, can we blow a few cycles on a few cores trying to predict which operations will get executed next? Yeah, sure, but that's not a programming problem, it's a hardware design problem.
Re:"stuck with a ...serial programming model" (Score:4, Insightful)
While I'll agree that not all programmers are stuck with the serial programming model, threads aren't exactly a great solution (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html [berkeley.edu]). They're heavyweight and inefficient compared to running most algorithms on e.g. bare hardware or even an FPGA. Plus they deal badly with embarrasing parallelism (http://en.wikipedia.org/wiki/Embarrassingly_parallel [wikipedia.org]). And finally they are HARD to use, the programmer must explicitly manage the parallelism by creating, synchronizing and destroying threads.
Setting aside those problems which exhibit no parallelism (for whom there is no solution but a faster CPU really), there are many classes of problems which would benefit enormously from better programming models, which are more efficiently tied to the operating system and hardware rather than going through an OS level threading package.
Re: (Score:2, Interesting)
It's the people who really can't program that are having significant trouble with parallelization in modern applications. That's not to say that in the future I won't love to be able to express a solution and have it automatically parallelized, but for the time being creating applications that take advantage of multiple cores well (server apps, not client apps) is not that difficult if you know what you're doing.
Though, like
Re: (Score:1)
Why do Java and
Part of the goal here is to make it so that, like with memory management, someone who knows what their doing (i.e. a hardcore manage-their-own-memory assembly or C++ programmer) can write a large parallel library, and someone who doesn't (
Re: (Score:1)
Re: (Score:2)
Yes, we're making the same point, though I'm also pointing out that the doom and gloom that is always presented wrt parallelism in current programming languages isn't so. It's only so for those that don't know what they're doing.
Re: (Score:2)
Re: (Score:2)
I find memory management is trivial in most real-life C++ code. And management of non-memory resources is easier than in Java.
Re:"stuck with a ...serial programming model" (Score:4, Interesting)
Current operating system could run code in parallel if they stop scheduling threads a timeslice on a processor but instead schedule a timeslice across multiple processors. Take an array of 1000 strings and a regex to match them against. If the program is allocated 10 processors it can do a simple interrupt and have them start working on 100 strings each. By having the processors allocated can you avoid the overhead of switching memory spaces and of scheduling, making this kind of fine-grained parallelism feasible.
But the problem here is that most programs will use one or two processors most of the time and all the available processors at other times. And if your parallel operation had to synchronize at some point then you'd have all your other allocated processors doing nothing while waiting for one to finish with its current work. So there is a huge amount of wasted time by allocating a thread to more than one processor.
A solution to the unused processor problem is to have a single memory space, and so as a consequence only run typesafe code -- an operating system like JavaOS or Singularity or JXOS. This lets any processor be interrupted quickly to run any process's code in parallel, so CPU's can be dynamically assigned to different threads. Even small loops can be effectively run across many CPUs, and there is no waste from the heavyweight allocations and clunkiness that is caused ultimately by separate memory spaces needed to protect C-style programs from each other. This is why it is the operating system, not the programming models, that is the main problem.
Re: (Score:2)
Re: (Score:1)
There is no reason why 'foreach' or 'collect' cannot use other processors
While it sounds like a good idea, 'foreach' loops often collect the value for each into a single instance variable to get a sum, or similar compilation of the contents of the array. If the cycles were to run at the same time, the second iteration of the loop would not have the data from the first to append to.
I do believe that parallel processing could be used to improve the speed of 3d rendering and particle simulations though, and that is reason enough to be optimistic about it.
Re: (Score:2)
Re: (Score:2)
But if it is blocked because the target of the I/O operation, say, a local database server, is busy calculating the data to be returned, throwing more processors at it might indeed cause it to unblock sooner.
Re: (Score:2)
Yup. And as Amdahl's Law [wikipedia.org] (paraphrased) puts it: the amount of speed increase you can achieve with parallelization is always constrained by the parts of the process that can't be parallelized.
Re:"stuck with a ...serial programming model" (Score:4, Insightful)
In the very near future, we could potentially have systems with hundreds of cores that sit idle all the time because none of the software takes advantage of much more than 5-10 cores. Of course, this would never actually happen, because once the hardware manufacturers notice this to be a problem, they will stop increasing the number of cores and try to make some other changes that would result in increased performance to the end user. There will always be a bottleneck -- either the software paradigms or the hardware and right now it looks like in the near future it will be the software.
Yes, there are some algorithms that no matter what you do have to be executed sequentially. However, there is a huge truckload of algorithms that can be rewritten, with little added complexity, to take advantage of parallel computing. Furthermore, there is a slew of algorithms that could be rewritten with a slight loss in efficiency to be parallelized but with a net gain in performance. This third type of algorithm is what I think the most interesting is for researchers -- Even though parallelizing the algorithm may introduce redundant calculations or added work, the increased number of workers outweighs this.
In other words, what is more efficient: 1 core that performs 20,000 instructions in 1 second or 5 cores that each perform 7,000 instructions, in parallel, in 0.35 seconds. Perhaps surprisingly to you, the single core is more efficient (20,000 instructions instead of 7,000*5 = 35,000 instructions) -- BUT, if we have the extra 4 cores sitting around doing nothing anyways, we may as well introduce inefficiency and finish the algorithm about 2.9 times faster.
Re: (Score:2)
(1..100).foreach { |e| e.function() }
And turn it into
(1..50).foreach { |e| e.function() }
(50..100).foreach { |e| e.function() }
and these get run on two or more processors faster than it runs on one you'll never get much use out of the extra ones. Our current operating systems can't take a small loop like say 100 iterations and divide it up across processors and have it run faster than just doing it on one. That's the problem. Just a rough guess but I bet
Re: (Score:3, Insightful)
Re: (Score:2)
And I don't think it needs to be a bottleneck unless it needs to be sequential. All you do is, have multiple LINDAs (or queues, or jars, or whatever), and have multiple sources to each.
For example: Suppose you wanted that raytracer to do some simple anti-aliasing which took into account the surrounding pixels. The antialiasing "jar" could be fed by all of
One step further (Score:2)
Re: (Score:2)
First, how much effort will it take to optimize it, versus throwing another core at it? Or computer? Not always an option, but take, say, Ruby on Rails -- it wouldn't scale to 1000 cores, but it might scale to 1000 separate machines. And yes, it could probably run four or five times faster -- and thus require four or five times less hardware -- but at what cost? Ruby itself is slow, and there are certain aspects of it which will always be slow.
But, you see, the advan
Re: (Score:2)
I believe SMP (when did we decide to start calling it "multicore"?) is mostly useful (in everyday computing!) for multiple-user machines.
Right now I'm running a number of important jobs on a four-CPU Linux server. At the same time, one guy has a process hogging 1
Re: (Score:2)
Additionally, many algorithms cannot be parallelized.
Conventional wisdom but it's just not true. Maybe you meant to say automatically parallelizable however that's much the same as saying that it is necessary to
I work in parallel programming and I have never seen a real world problem/algorithm that was not parallelizable. Maybe there's a few obscure ones out there but I've never seen them. Anybody want to suggest even one?
In any case almost all PC's these days are already highly parallel; display c
Re: (Score:1)
Reading a sector of data from a hard disk.
But that's pretty obscure, I'll grant you that.
Re: (Score:2)
Reading a sector of data from a hard disk.
But that's embarrassingly parallelizable, it's why hard disks have multiple heads and multiple platters to read/write in parallel and up the throughput. RAID's make it even more parallel. Within limits and for normal volumes of hard disk data, which are much larger than a sector size, (small data is held in memory+caches) this will increase throughput proportional to the number of heads.
In any case that's a hardware limitation, nothing to do with an algorithm
Re: (Score:1)
Then you admit that there are operations upon which software must wait. There is no way for a program that relies on the read data to act upon it until it arrives in the CPU registers. How are you going to serialize that?
I'm not aware of any atomic operation in real world programming that takes any more than a tiny fraction of second and thus cause real world impact
These things add up. You would be the first to admit, I'm sure
nonsense (Score:2)
Multi-threaded programming is cumbersome. There have been better was of doing parallel programming for a long time.
Additionally, many algorithms cannot be parallelized.
Whether algorithms can be parallelized doesn't matter. What matters is whether there are parallel algorithms that solve problems faster than serial algorithms, and in most cases there are.
Even languages like Erlan
Why 1000 ? (Score:2)
Re:Why 1000 ? (Score:4, Informative)
Basically 1000 is the goal, anything over that is a bonus. And yes, we like powers of 2 as much as you.
Linus Torvolds & Dave Patterson discuss it on (Score:4, Interesting)
NOW - Network Of Workstations (Score:1)
Re: (Score:1)
We have "beehappy", "newbee", "beehive" and even "sting". The tradition of using bad puns to name computers lives on!
Real Information (Score:5, Informative)
ParLab (what's being funded): http://parlab.eecs.berkeley.edu/ [berkeley.edu]
RAMP (the people who are building the architectural simulators for ParLab): http://ramp.eecs.berkeley.edu/ [berkeley.edu]
BEE2 (the precursor to the not-quite-so-microsoft BEE3): http://bee2.eecs.berkeley.edu/ [berkeley.edu]
The funding being announced here is for ParLab whose mission is to "solve the parallel programming problem". Basically they want to design new architectures, operating systems and languages. And before you get all "we tried that an it didn't work" there are some genuinely new ideas here and the wherewithall to make them work. ParLab grew out of the Berkeley View report (http://view.eecs.berkeley.edu/ [berkeley.edu]) which was the work of very large group of people to standardize on the same language and figure out what the problems in parallel computing were. This included everyone from architecture to applications (e.g. the music department).
RAMP is a multi-university group working to build architectural simulators in FPGAs. In fact you can go download one such system right now called RAMP Blue (http://ramp.eecs.berkeley.edu/index.php?downloads [berkeley.edu]). With ParLab starting up there will be another project RAMP Gold which will build a similar simulator but specifically designed for the architectures ParLab will be experimenting with.
As a side note, keep in mind when you read articles like this that statements like the "Microsoft BEE3" are amusing when you take in to account that "B.E.E." standards for Berkeley Emulation Engine. Microsoft did a lot of the work and did a good job of it, but still...
Re: (Score:1)
Beowulf Cluster... (Score:1)
Re: (Score:1)
kinda silly (Score:1)
Re: (Score:3, Insightful)
Faith, young grasshopper...
If you want a more technical reas
Cheap Bastards. (Score:4, Interesting)
Rick Merritt, who wrote the lead article also posted an opinion piece in EE Times [eetimes.com] lambasting Wintel for their lackluster funding efforts in parallel programming. I thoroughly agree with this guy. To quote:
Use your GPU (Score:5, Interesting)
Re: (Score:1)
But try putting that GeForce in your cell phone. And don't come crying to us when your ass catches on fire from the hot cell phone on your back pocket. Or for that matter when your pants fall down from carying the battery around.
ParLab (http://parlab.eecs.berkeley.edu/ [berkeley.edu]) is interested in MOBILE computing as well as your desktop.
Re: (Score:2)
Super...I'm hoping that GPUs can provide a cheap way for newcomers to learn parallel programming, and it appears that the GPU makers are really waking up to general purpose uses of GPUs.
(I learned parallel programming on a Connection Machine
PLINQ (Score:2)
Re: (Score:1)
Re: (Score:2, Funny)
First you take the
I wish I had more than one boat. Taking all across concurrently would be easier.
Reconfigurable Computing / FPGA Acceleration (Score:3, Informative)
There's a growing community of FPGA programmers making accelerators for supercomputing applications. DRC (www.drccomputing.com) and XtremeData (www.xtremedatainc.com) both make co-processors for Opteron sockets with HyperTransport connections, and Cray uses these FPGA accelerators in their latest machines. There is even an active open standards body (www.openfpga.org).
FPGAs and multicore BOTH suffer from the lack of a good programming model. Any good programming model for multicore chips will also be a good programming model for FPGA devices. The underlying similarity here is the need to place dataflow graphs into a lattice of cells (be they fine-grained cells like FPGA CLBs or coarse-grained cells like a multicore processor). I can make a convincing argument that spreadsheets will be both the programming model and killer-app for future parallel computers: think scales with cells.
I've kept a blog on this stuff if you still care: fpgacomputing.blogspot.com
Parallel Computing is not magic (Score:4, Insightful)
Re: (Score:2)
Re: (Score:1)
Where is Code Pink? (Score:1)
The 250 gigacore Intel processor (Score:1)