Auto-Parallelizing Compiler From Codeplay 147
Max Romantschuk writes "Parallelization of code can be a very tricky thing. We've all heard of the challenges with Cell, and with dual and quad core processors this is becoming an ever more important issue to deal with. The Inquirer writes about a new auto-parallelizing compiler called Sieve from Codeplay: 'What Sieve is is a C++ compiler that will take a section of code and parallelize it for you with a minimum hassle. All you really need to do is take the code you want to run across multiple CPUs and put beginning and end tags on the parts you want to run in parallel.' There is more info on Sieve available on Codeplay's site."
Reentrant? (Score:5, Interesting)
Re:Reentrant? (Score:5, Informative)
Consider a for loop: for (int i=0; i100; i++)doSomething(i);
Can this be parallelized? Perhaps the author meant it like it's written there: First doSomething(0), then doSomething(1), then
Re:Reentrant? (Score:5, Interesting)
Practically all for loops written are independent of order, so they could be trivially implemented using MapReduce. That one change would parallelize a lot of code, with no tricky compiler optimizations.
Re: (Score:2)
Interesting! I'd never heard of "map and reduce" before, but it seems like just the sort of idiom that would be useful for the program I'm getting ready to write.
Would it make sense to have an interface for it like this in C:
Where the implementation could be either a for loop:
or an actual parallel implementation, such as one using pthreads or r
Re: (Score:3, Informative)
Re: (Score:2)
Well, I meant in the more general sense (i.e., your mention of "MapReduce" eventually lead me to find this [wikipedia.org]).
Re: (Score:2)
I would be very surprised if nobody's tried to work around this problem with AOP and annotations. It would be trivial to code parallelization with a method interceptor.
On a multiprocessor machine the JVM will assign threads to individual CPUs. Your interceptor would populate an array of N threads, assign each thread a number, wrap the annotated method in a callback with a finally {lock.notify()} at the end, sy
Re: (Score:2)
Fortress does all loops in parallel by default. You have to explicitly tell it when you want a serial loop.
Re: (Score:2)
Fortress does, but Fortran-90 didn't. It was strictly a SISD language and had to be extended in HPF with javadoc-style hacks as I described above so that the compiler knew how to link against the MPI libraries and produce an executable that could take advantage of data parallelism on SIMD/MIMD architectures. The whole thing was one big hack. I think starting with Fortran 95 they began to incorporate fe
Re: (Score:2)
That is a very large, very unsupported assumption to be making. It certainly doesn't hold true for the code I've seen in the wild and have written myself.
Re: (Score:2)
Re: (Score:2)
Python cares so little about parallelism that it still uses a single monolithic lock around the interpreter, so you can't even make reasonable use of threads except for I/O waits. But no imperative language can just throw in something as drastic as auto-parallelism without rewriting basic
Re: (Score:2)
For what it's worth (Score:2)
Re: (Score:2)
"for each FOO in BAR do STUFF;" [Re:Reentrant?] (Score:2)
Re: (Score:2)
Damn HTML stuffs...
Re:Reentrant? (Score:5, Informative)
More complex version: There are four ways to run a program. These are "Single Instruction, Single Data" (ie: a single-threaded program), "Single Instruction, Multi Data" (SETI@Home would be an example of this), "Multi Instruction, Single Data" (a good way to program genetic algorithms) and "Multi Instruction, Multi Data" (traditional, hard-core parallelism).
SIMD would need to be re-entrant to be parallel, otherwise you can't be running the same instructions. (Duh. :) SIMD is fashionable, but is limited to those cases where you are operating on the data in parallel. If you want to experiment with dynamic methods (herustics, genetic algorithms, self-learning networks) or where you want to apply multiple algorithms to the same data (eg: data-mining, using a range of specialist algorithms), then you're going to be running a vast number of completely different routines that may have no components in common. If so, you wouldn't care if they were re-entrant or not.
In practice, you're likely to use a blend of SIMD, MISD and MIMD in any "real-world" program. People who write "pure" code of one type or another usually end up with something that is ugly, hard to maintain and feels wrong for the problem. On the other hand, it usually requires the fewest messaging and other communication libraries, as you're only doing one type of communication. You can also optimize the hell out of the network, which is very likely to saturate with many problems.
FPP (Score:5, Funny)
is arle ot
Sad... (Score:3, Funny)
Re: (Score:2, Informative)
Re: (Score:1)
Re: (Score:1)
(I'm explaining someone else's programming joke on Slashdot... I've reached a new low).
This is Awesome (Score:5, Funny)
Nevermind.
Oh look. A duck.
Re: (Score:1)
For
Auto-Compiling Code from ParallelPlay
or
Auto-Play Parallelizer from Complay
or something...
openMP (Score:2, Informative)
Re: (Score:1)
Re: (Score:2)
And the difference would be?
Re: (Score:2, Informative)
and what the difference between this and openMP ?
On the page 7 of The Codeplay Sieve C++ Parallel Programming System, 2006 [codeplay.com] you'll find section that describes "advantages" of codeplay over openmp, but nothing terribly exciting. Codeplay allows you indeed to better automatize parallelization but is at the same time also limited to a narrower set of optimizations compared to openmp.
Re: (Score:2)
This sounds like it can do both, as well as determine what parts of the code can be parallelized.
~X~
Interesting, but.. (Score:5, Insightful)
That's great - but do the algorithms involved here naturally lend themselves to the parallelization techniques the compiler uses? Are there algorithms that are very poor choices for parallelization? For example, can you effectively parallelize a sort? Wouldn't each thread have to avoid exchanging data elements any other thread was working on, and therefore cause massive synchronization issues? A solution might be to divide the data set by the number of threads and then after each set was sorted merge them in order - but that requires more code tweaking than the summary implies. So I wonder how different this is from Open/MT?
Re: (Score:2)
Re:Interesting, but.. (Score:5, Interesting)
Yes you can, take a look a Merge sort [wikipedia.org] (or quick sort, same idea). You split up the large data set into smaller ones, sort those and recombine. That's perfect for parallization -- you just need a mechanism for passing out the orginal elements and then recombining them.
So if you had to sort 1B elements maybe you get 100 computers and give them each 1/100th of the data set. THat's manageable for one computer to sort easily. THen just develop a service that hands you the next element from each machine, and you pull off the lowest one.
Re:Interesting, but.. (Score:4, Interesting)
I would also ask this of CodePlay: If your compiler is automatic, why do we need to add beginning and end tags?
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Interesting)
Consider a two NxN matrices, A and B, multiplied together to make a matrix C. Each element of C (Cij), is the sum of Ai[0..N] and Bj[0..N]. This is an almost trivial parallelization problem, commonly one of the first coding exercise learned in a parallel processing class.
IMHO, this is interesting but has a long way to go before its useful for anything b
snake oil (Score:5, Insightful)
Re: (Score:2)
The compiler isn't going to know if you're doing something stupid or not.
In other words: use at your own risk.
The old adage of "garbage in, garbage out" still applies.
Re: (Score:3, Insightful)
But how are you supposed to know exactly how something is going to run under this? Even with a good understanding of what your trying to do and (hopefully) what exactly the compiler is
Re: (Score:2)
The semantics of that construct is well-defined.
Of course from the short description it's not entirely clear to me if the compiler actually implements that semantics, or simply relies on you to honor it (e.g. is it possible to call a non-sieve function from within a sieve function or block? In that case, the compiler cannot reasonably implement the semantics). There's a precedent for the second type of semantics: restrict.
But
Re: (Score:2, Informative)
contained within a sieve {} marker and
any functions that are marked with sieve.
until the end of the sieve.
of data that are declared outside the
sieve
The compiler can use this information to decide what parts of the code can safely be parallelized. Adding the "sieve" keyword can change the semantics of the code, a
Re: (Score:2)
TFA claims it's auto-parallelizing and says:
If "sieves" have all the restrictions you mention, then both the claim that it's "auto-parallelizing" and the description in the
It's time for the C and C++ to include || proc (Score:2)
It'll even compile on old compilers with a "#define par for" and "#define sync".
It's long past time for this.
Prefer OpenMP (Score:5, Informative)
1: OpenMP is supported by Sun, Intel, IBM, $MS(?) etc, and implemented in gcc 4.2.
2: OpenMP has been used successfully for about 10 years now, and is on a 2.5 release of the SPEC.
3. It is Open - the white paper for Codeplay mentions it being protected by patents. (boo hiss)
4. Did I mention that it is supported in gcc 4.2 which I built it on my Powerbook last week and it is very cool?
So maybe Codeplay is a nice system. Maybe they even have users and can offer support. But if you are looking to make your C++ code run multi-threaded with the least amount of effort I've seen ( It is still effort! ) take a look at OpenMP. In my simple tests it was pretty easy to make use of OpenMP, and I am looking forward to trying it on a rather more complicated application.
Re:Prefer OpenMP (Score:5, Informative)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re:Prefer OpenMP (Score:5, Interesting)
On the other hand, OpenMP is a far more solid, robust, established, reputable, reliable solution than Codeplay. The patent in Codeplay is also bothersome - there aren't many ways to produce an auto-parallelizing compiler and they've mostly been done. This means the patent either violates prior art (most likely), or is such "black magic" that no other compiler-writing company could expect to reproduce the results and would be buying the technology anyway. It also means they can't ship to Europe, because Europe doesn't allow software patents and has a reputation of reverse-engineering such code (think "ARC-4") or just pirating/using it anyway (think: pretty good privacy version 2, international version, which had patented code in it)
Word of advice. (Score:2)
mod parent up (Score:2)
OpenMP can support clusters (Score:5, Informative)
Intel's compiler (icc), available for Linux [intel.com], Windows [intel.com], and FreeBSD [freshports.org] extends OpenMP to clusters [intel.com].
You can build your OpenMP code and it will run on clusters automatically. Intel's additional pragmas allow you to control, which things you want parallelized over multiple machines vs. multiple CPUs (the former being fairly expensive to setup and keep in sync).
I've also seen messages on gcc's mailing list, that talk about extending gcc's OpenMP implementation (moved from GOMP [gnu.org] to mainstream in gcc-4.2 [gnu.org]) to clusters the same way.
Nothing in OpenMP [openmp.org] prevents a particular implementation from offering multi-machine parallelization. Intel's is just the first compiler to get there...
The beauty of it all is that OpenMP is just compiler pragmas [wikipedia.org] — you can always build the same code with them off (or with a non-supporting compiler), and it will still run serially.
Re: (Score:2)
Won't that require some runtime support, like mpirun in MPI (that takes care of rsh/ssh-ing to each node and starting the processes)?
Re: (Score:3, Insightful)
Well, yes, of course. You also need the actual hardware too :-)
This is beyond the scope of the discussion, really — all clusters require a fair amount of work to setup and maintain. But we are talking about coding for them here...
Re: (Score:2)
At work, I have (I work at a supercomputing center).
Re: (Score:2)
Re: (Score:2)
Regards,
Athanasios
Re: (Score:2)
Mostly I seem to have been lucky that 4.2 compiled as is on my Powerbook, because it did not do so on my Dual G5 which is of course where I would like to use OpenMP. I'll have to figure out what is on my PB that is not on my G5. And I have an 8 core linux box at work that
Re: (Score:2)
best regards
Athanasios
Yup (Score:4, Interesting)
For the majority of apps, OpenMP [wikipedia.org] is enough. That is what this looks like - a proprietary OpenMP. It might make it easier than creating and managing your own threads but calling it "auto" parallelizing when you need to mark what to execute in parallel is a bit of a stretch.
For apps that need more, it is probably a big enough requirement that someone knowledgable is already on the coding team. Which isn't to say that a compiler/lang/lib lowering the "experience required" bar wouldn't be welcomed, just that I wish these people would work on solving some new problems instead of re-tackling old ones.
The main purpose of these extensions seems to be finding a way to restrict the noob developer enough that they won't be able to abuse threading like some apps love to do. That is a very good thing in my book! (Think Freenet [freenetproject.org], where 200-600 threads is normal.)
How long has Sun Studio had "-xautopar"? (Score:2, Informative)
And it works, too.
Ok (Score:1)
for (i = 0; i doSomething (i);
doSomething() cannot know or infer i-1. Is that right? So doSomething() really has to regard i as (almost) random. So the loop becomes:
for (i = 0; i doSomething (uniqueRand (i/RANDMAX*100);
No wonder it's so complicated and hard to debug
Been done... (Score:4, Interesting)
Re:Been done... (Score:5, Interesting)
Oh, are we having a contest for who can name the earliest auto-parallelizing C compiler? If so, I nominate the vc compiler on the Convex [wikipedia.org] computers. The Convex C-1 was released in 1985 and I believe had a vectorizing compiler from the start, which would make sense since it had a single, big-ass vector processor (one instruction, crap loads of operands -- can't remember how many, but it was something like 64 separate values being added to another 64 separate values in one single instruction).
I personally remember watching somebody compile something with it. It was really neat to watch -- required no special pragmas or anything, just plain old regular C code, and it would produce an annotated copy of your file telling you which lines were fully vectorized, partly vectorized, etc. You could, of course, tweak the code to make it easier for the compiler to vectorize it, but even when you did, it was still plain old C code.
Re: (Score:2)
Re: (Score:2)
I have my 'Mips Pro Auto Parallellizing Option 7.2.1' cd sitting right next to my Irix 6.5 machine... and I know it's YEARS old
I was thinking the same thing. I remember playing with "-xautopar" option in Sun's compiler way back in 1996. I just checked and it's still there. The problem was always making sure loops didn't have dependancies on the loop counter, and only had one exit.
Sun gives away the whole Sun Studio compiler suite now days, complete with OpenMP and all the profiling tools. They have a
Not a big deal (Score:1)
Could this possibly be for real? (Score:2)
Re: (Score:2)
Nothing new here, move along.
Jokes aside, this is bull. It requires the coder to mark sections that he wants to run in parallel, making the "automatic" part a bit of a stretch. And then, there already is a system that does this, and has done it for 10 years. It's called OpenMP [wikipedia.org], and features wide industry support in compilers such as the upcoming gcc 4.2, MS Visual Studio .NET 2005, the Intel compiler suite and compilers from traditional SMP vendors such as IBM and SGI.
SmartVariables is a good alternative to MPI / PVM (Score:2, Insightful)
Here is a quote from the SmartVariables white-paper:
"The GPL open-source SmartVariables technology works well as a replacement for both MPI and PVM based systems, simplifying such applications. Systems built with SmartVariables don't need to worry about explicit message passing. New tasks can be invoked by using Web-Service modules. Programs always work directly with named-data, in parallel. Tasks are easily sub-divided and farmed out to additional web-ser
Re:SmartVariables is a good alternative to MPI / P (Score:2)
It may or may not be easier to program, but will it perform well enough? Does it support high-speed low-latency interconnects like Myrinet or Infiniband? Will it perform well enough to make up for the high price of such interconnects? Gigabit Ethernet performance is not enough on such systems, as latency is a major factor, and the latency of Ethernet is typically high compared to HPC interconnects.
Re:SmartVariables is a good alternative to MPI / P (Score:2)
shapiro called 'concurrent prolog' published in 1986 which uses the concept of named
streams to communicate between concurrent processes, which to a
large extent are treated as variables in the language. i could
find an earlier reference, but it would be more work than turning 60 degrees.
try harder
Sounds like multi-threading AND NOT Parallelizing (Score:4, Insightful)
When I think of Parallelizing software, after getting over my humors mind thinking of a virus that paralyzes users, what comes to mind is clustering. When I think of clustering the train of thought directs me to Beowulf and MPI or it's predecessor PVM. Though I can find no information that supports the concept of clustering in any manner.
Again I did see a reference to: "Multiple computers across a network (e.g. a "grid")" but according to Wikipedia grid computing is defined "A grid uses the resources of many separate computers connected by a network (usually the Internet) to solve large-scale computation problems. Most use idle time on many thousands of computers throughout the world."
Well, that sounds like the distributed SETI project and the like, which would seem even more ambitious than a compiler that would help write MPI code for Beowulf clusters.
From all the examples this looks like a god compiler for writing code that will run more efficiently on multi-core and multi-processor systems but would not help you in writing parallel code for clustering.
Though, this brings up a concept that many people forget. Even people that I would consider to be rather intelligent on the subject of clustering often forget this. And that is that if you have an 8 computer cluster with each node running on a system with dual-core Intel CPU installed that if you write parallel code for it using MPI you are benefiting from 8 cores in parallel. Many people that write parallel code forget about multi-threading. To benefit from all 16 cores in a cluster I just described the code would have to be written multi-threaded and parallel. One of the main professors involved in a clustering project at my university stated to me that in their test environment they were using 8 dell systems with dual-core Intel CPU so in total they had the power of 16 cores. Since he has his Ph. D. and all I didn't feel the need to correct him and explain that unless his code was both parallel and multi-threaded he was only getting the benefit of 8 cores. I knew he was not multi-threading because they were not even writing the code in MPI rather they were using Python and batching processes to the cluster. From my knowledge Python cannot write multi-threaded applications. Even if it can I know they were not (from looking at their code).
Sometimes it's the simplest things that confuse the brightest of us....
Nick Powers
are you sure your prof is wrong? (Score:2)
Assuming their cluster management system knows that each node is dual core, can you explain why they couldn't run two processes on each node?
Re:single process uses 1 core unless multi-threade (Score:3, Insightful)
In parallel processing general each machine is running one part of a program, thus one program, and unless that program is multi-threaded as well as parallel then it can only use one core per node on a cluster.
Though, someone who writes multi-threaded parallel applications should be held in high esteem! I don't know any such coders.
Have yo
Re: (Score:2)
I think the parent's point was that the cluster management system could simply batch out two individual (single threaded) processes to each node (for a total of 16 processes over 8 nodes), rather than just a single process per node. This may or may not work, depending on how the processes communicate over the network, but it is certainly a substantially easier task than writing a multi-threaded distributed program in most any case.
batching code to a cluster is just silly (Score:2)
1) Write the code using MPI or it's predecessor PVM.
2) Have non parallel code that has separate programs that each handle a part of the data and batch it to nodes on the cluster.
Method 2 is either done as an initial step to help determine how to split up your processing so you could use that information to write a MPI or PVM version or by people that don't know how to write pa
Re: (Score:3, Insightful)
In defense of the batch system, assuming we definine parallel as "at the same time", and not "inter-node communication", one should be fine with the multiple processes approach. There are a number of commercial and open source schedulers which will do the work for you, including node weighting for systems with different numbers of processors an
Efficiency vs Ease of programming (Score:2)
Re: (Score:2)
Yes, that does sound like a valid use of batching (Score:2)
Nick Powers
Re: (Score:2)
This may or may not work, depending on how the processes communicate over the network, but it is certainly a substantially easier task than writing a multi-threaded distributed program in most any case.
Actually, if the message passing implementation is done right, communication between processes on the same node is done though shared memory, and not network communication. Actually mixing threading and message passing in the application code just means unnecessary complexity in this case.
Re:Sounds like multi-threading AND NOT Parallelizi (Score:2)
Sounds like multi-threading AND NOT Parallelizing
Both multi-threading and message passing systems are parallel systems, they are just different subsets of the parallel computing paradigm. You cannot really claim with any authority that multithreading isn't parallel computing, and that only message passing is.
Multithreading is used on a shared memory multiprocessor, and message passing is used on distributed memory multiprocessors. They are just two different ways of implementing parallel code, and none of them is more parallel than the other.
Well, that sounds like the distributed SETI project and the like, which would seem even more ambitious than a compiler that would help write MPI code for Beowulf clusters.
Actua
Re:Sounds like multi-threading AND NOT Parallelizi (Score:2)
Anything when more than a single thread or process is executing simultaneously is parallel. Anything running on multiple computers at the same time can also be called parallel, but is more precisely (and commonly) referred to as clustered or distributed computing. These are the commonly agreed upon meanings.
Or yo
neither useless nor compelling (Score:1)
WCF - Dotnet 3.0 (Score:2)
Re: (Score:2)
This is a feature of WCF - Windows Communication Foundation in .NET 3.0 (part of Win V). WCF is designed for next gen CPUs with large numbers of cores. It spawns worker threads for you as needed and sychronises these calls for you automatically. You have the option of manually creating and sychronising threads, but out of the box it does it all for you behind the scenes.
So WCF takes care of parallelizing your compute-intensive tasks for you? Sorry, but I don't believe you. It might spawn threads for communication-related tasks, but those aren't really compute-intensive anyway.
Re: (Score:2)
Wow - Deterministic Concurrency (Score:1, Interesting)
This Sieve programming seems also to make it easier to target the PS3, which has gotten a bad rap as being notoriously diffi
RapidMind (Score:1)
Is this better than OpenMP? (Score:2, Informative)
Nothing new (Score:2, Informative)
Minimum hassle? (Score:2)
What does the compiler do, taunt you with harsh language while it compiles your code?
How about an auto-commenting compiler (Score:2)
Seriously, Sieve sounds so trivial and meaningless, it's the ultimate silicon valley startup. How about something more valuable like an auto-vectorizing compiler that really works.
WTF? (Score:2)
Do you really know the difference between a compiler and an development environment?
What "comments" is a compiler going to derive from your code? Something like "EAX will contain the result of the last add." is NOT going to be a useful comment in most situations. What do you expect the *compiler* to be able to tell you about *your* code?
And I believe it is widely held tha
Modern CPU and existing compilers (Score:2)
Re: (Score:2)
Re: (Score:2, Interesting)
Our SGI compilers at work come with an -apo (automatic parallization optimization) command line option. That one option cost us a pretty penny. It's nice to see other people getting in on the action.
Snippet from the manpage, highlighting is mine:
Re: (Score:1)