Project Aims For 5x Increase In Python Performance 234
cocoanaut writes "A new project launched by Google's Python engineers could make the popular programming language five times faster. The project, which is called Unladen Swallow, seeks to replace the Python interpreter's virtual machine with a new just-in-time (JIT) compilation engine that is built on LLVM. The first milestone release, which was announced at PyCon, already offers a 15-25% performance increase over the standard CPython implementation. The source code is available from the Google Code web site."
Kill the GIL! (Score:5, Informative)
The summary misses one of the best bits -- the project will try to get rid of the Global Interpreter Lock that interferes so much with multithreading.
Also, it's based on v2.6, which they are hoping will make 3.x an easy change.
Re:This is a very interesting project (Score:3, Informative)
- No Python 3.0 support
They are using v 2.6 which has been designated as the official migration step towards 3.0. So it should be easiy to port over to 3.0, anyway right now very few projects are using 3.0.
Re:This is a very interesting project (Score:3, Informative)
- No Windows support (apparently a Linux-only VM in the plans)
The article says it's going to be based on LLVM which most definitely is cross-platform (and being touted as the logical successor to GCC). Unless they go out of their way to use some Linux only calls while implementing their Python VM on top of LLVM it should be trivially easy to get it running in Windows.
Re:This is a very interesting project (Score:5, Informative)
Psyco is x86 only and uses a lot of memory. It also requires additional coding... you have to actively use it, so you don't automatically get the speedup that a faster interpreter gets you. You also have to pick-and-choose what you want to get compiled with Psyco - the extra overhead isn't always worth it.
To be fair, I don't know what the memory requirements of this new project are.
Re:How fast is five times faster really? (Score:4, Informative)
I know you're trying to be funny but... If you're talking plain Java vs Python [debian.org], Java looks to be quite a bit faster. You don't have to look hard to find benchmarks that show java is faster [timestretch.com].
Jython [jython.org] seems to be about 2-3 times faster than CPython [warwick.ac.uk] according to those test.
This could give CPython the performance edge over Jython, but it still has a way to go to catch up to Java.
Re:This is a very interesting project (Score:4, Informative)
It might be easy to port over to 3.0, but not because it is using 2.6. Basically, they are planning on ripping out a big chunk of the internals of 2.6 and replacing it with a LLVM based system. To the extent that those internals changed for 3.0 (there wasn't necessarily effort put into making them compatible across 2.6 and 3.0...), the code would need to be updated for 3.0. The python level portability between 2.6 and 3.0 isn't a huge factor for something like this.
They are targeting 2.6 because that is what made sense for Google (who is paying for the work). Or so they say:
http://code.google.com/p/unladen-swallow/wiki/FAQ [google.com]
IronPython speed (Score:3, Informative)
Re:No windows (Score:4, Informative)
Quite to the contrary, the FreeBSD guys have been building with clang [llvm.org]+llvm [llvm.org] for a while now, and they seem to like it [freebsd.org]. The kernel boots, init inits, filesystems mount, the shell runs.
What other platforms, Darwin? Apple employs the largest number of LLVM developers. Windows? Both MinGW and Visual Studio based builds are tested for each release.
It's still not as portable as the python interpreter, but that will come if and when developers who are interested in working on it start to contribute.
Re:It's probably pining for the fiords. (Score:5, Informative)
Not really. Parrot is a much higher-level VM, providing things like closures, multiple dispatch, garbage collection, infrastructure to support multiple object models, and so forth, whereas LLVM really models a basic RISC instruction set with an infinite number of write-only registers.
In fact, it would make a fair bit of sense to actually use LLVM as the JIT-compiling backend for Parrot...
Re:Too many levels of translation? (Score:5, Informative)
Wouldn't a more direct compile yield a better result?
No, it wouldn't.
The entire point of LLVM is that it provides an easy-to-target machine (it's basically a RISC instruction set) that you can use as your intermediate representation (the p-code you described). You then use the LLVM backends to compile the IR down to machine code. And because of the way the IR is structured (for example, it has write-only registers, which makes certain classes of optimizations much easier), you can do a really good job of optimizing.
Basically, you "direct compile" to the LLVM IR, and then let LLVM take care of the details of generating the machine code. This gives you better abstraction (no more machine-specific code generation in Python itself), portability (to whatever LLVM targets), and you get all the sophisticated optimization that LLVM provides for free. That's a huge potential win.
Re:Kill the GIL! (Score:5, Informative)
Only because Python uses a refcounting garbage collector. When you get many threads, you need to lock all your data structures because otherwise you might collect them when they are still reachable. This project plans to change the garbage collection strategy first. Once it's done, killing the GIL is easy.
Re:Kill the GIL! (Score:4, Informative)
That's funny, because os.fork() etc. work fine on my version of python.
Effort in wrong place (Score:5, Informative)
This is disappointing. Shed Skin [google.com] has shown speed improvements of 2 to 220x over CPython. Going for 5x over CPython is lame. But Shed Skin is a tiny effort, and needs help.
PyPy got a lot of press, but they tried to do an optimizing compiler with "agile programming" and "sprints", and, at six years on with substantial funding, it's still not done.
The fundamental problem with running Python fast is its gratuitous dynamism. In CPython, almost everything is late-bound, and most of the time goes into name lookups. This makes it easy to treat everything as dynamic. You can store into the local variables of a function from outside the function, for example. In order to make Python go fast, the compiler has to be able to detect the 99.99% of the time when that isn't happening and generate pre-bound code accordingly.
Dynamic typing requires similar handling. Most variables never change type. Recognizing int and float variables that will never contain anything else creates a significant speedup. In CPython, all numbers are "boxed", stored in an object structure. This is general but slow.
CPython is nice and simple, but slow. Serious speedup requires global analysis of the program to detect the hard cases and generate fast code for the easy ones. Shed Skin actually does this, but has to place some limitations on the language to do it. If someone did everything right, Python could probably achieve the speed of C++.
There's also the problem that if you want to be compatible with existing C modules for CPython, you're stuck with CPython's overly general internal representation.
Re:What about Parrot? (Score:2, Informative)
LLVM is stable and in use. The iPhone SDK arm compilers use gcc with a llvm backend. OS X uses LLVM in the OpenGL stack to support features that the GPU doesn't. They're also using LLVM for openCL/Grand Central.
LLVM isn't just another virtual machine, it also optimizes that code (at compile time, link time, and/or runtime) and converts it to native (alpha, arm, cell, ia64, mips, CIL, pic16, ppc, sparc, x86) binaries (or C source code).
Re:It all depends (Score:4, Informative)
I smell bullshit. There is no overhead from using STL containers.
If you used an std::vector, you couldn't have a bottleneck, for the simple reason that the std::vector is an array.
That was my impression, too, but careful timing and profiling suggested otherwise.
In addition, we can by simple reasoning determine that there's gotta be some overhead involved with vector implementations. First, vectors know their size; in particular, they know it in constant time. This means that they essentially must include a size field and update it whenever size changes. Also, I can have a pointer to a vector, and that vector can grow arbitrarily without invalidating the pointer. That means that there pretty much has to be an indirect pointer to the vector's storage. It also means that the vector's storage must more or less be coming from a heap, which definitely slows things down. ("more or less" because one can imagine certain optimizations that might be possible if you somehow knew an upper bound on the vector's lifetime size)
All of this stuff costs you in time and space.
Suppose I have a function I'm going to call a million times and it needs a temporary array of ints, of a size I can bound (maybe even small enough to be cache-beneficial). I can allocate that array in the parent function and pass in a pointer each time. Overhead to create and destroy the array in the inner function each time: zero. If you do this with a vector, the implementation has to zero the length, which costs time. Or you can delete and recreate it by letting it go out of scope, but that also costs time.
Most of the time these minor effects don't matter, but if it's in the innermost loop and is going to run billions of times, it can be quite noticeable.
It could conceivably be that gcc's implementation of STL is a little slow. Doesn't matter why, though, because that's my target, and that's where my program has to run.
It's been a while since I went through this exercise, so I don't have the exact scenario. But the code is GPL'ed and available here [greylag.org]. If you can replace any of the arrays with an as-simple, as-fast use of vectors, I'd be happy to have it.
Re:How fast is five times faster really? (Score:4, Informative)
How is Java faster? If it's a trivial program, than it just doesn't matter. Actually, if it's a trivial program, for your own use, a Pythoneer will write the script and run the interpret (no compile!) before you can fire up Eclipse and type "private static void".
You know you can write trivial java programs without using an IDE such as Eclipse. I started out in the late 90's writing Servlets in vi and notepad. The time it takes to compile is meaningless. You only need to do it once. You don't have to recompile every time you run the application.
If we are talking about a non trivial program, then algorithms, data structures, caching, micro-optimization (like re-writing bits in C) and profiling can improve things by many many orders of magnitude. Too bad if the code has so many layers and adapters that any real change will be prohibitively expensive.
Or they could use any of the many java libraries available so they don't have to write those parts of the code. Since they've been around for years, they've already been optimized.
The productivity gains of writing fewer lines of code seems stupid to me. Programmers aren't secretaries. I can't type maybe 90wpm but a few lines of code might take an hour to get right. It doesn't matter what the language is.
Re:Effort in wrong place (Score:3, Informative)
# Maintain source-level compatibility with CPython applications.
# Maintain source-level compatibility with CPython extension modules.
vs.
Shed Skin will only ever support a subset of all Python features.
Re:This is a very interesting project (Score:5, Informative)
I think it's only Linux-only right now, because the developers currently use Linux. But they consider loss of Windows support a "risk", not a design goal:
Re:Too many levels of translation? (Score:4, Informative)
The Python object files are just a more convenient way to store the program compared to text files. No information is lost or glue is added in that first step.
LLVM is, like its name suggests, really low level. You should think of it as a kind of portable assembly. It's much closer to actual hardware architectures than for example Java byte code. I don't expect much overhead from the LLVM to native step. A while ago I ran some tests with C++ compiled by GCC directly to native and compiled by GCC to LLVM byte code and then by LLVM to native; sometimes one approach was faster and sometimes the other, but they were pretty close.
So that leaves the glue added in the Python object to LLVM step. I expect this to have a significant overhead, but I don't see it becoming a smaller overhead by going directly to native. The advantage of using LLVM is that you only have to write this step once, instead of once for each architecture.
With LLVM it is possible to compile parts of the interpreter to LLVM byte code in advance and then inline that into the program being JIT-compiled. That way, you can be sure that the JIT and the interpreter actually do the same thing. Apple did this for their OpenGL driver, there is a nice presentation (PDF) [llvm.org] about it.
Re:Speed ups for EVE online, perhaps? (Score:5, Informative)