ARM Announces 64-Bit Cortex-A50 Architecture 160
MojoKid writes "ARM debuted its new 64-bit microarchitecture today and announced the upcoming launch of a new set of Cortex processors, due in 2014. The two new chip architectures, dubbed the Cortex-A53 and Cortex-A57, are the most advanced CPUs the British company has ever built, and are integral to AMD's plans to drive dense server applications beginning in 2014. The new ARMv8 architecture adds 64-bit memory addressing, increases the number of general purpose registers to 30, and increases the size of the vector registers for NEON/SIMD operations. The Cortex-A57 and A-53 are both aimed at the mobile market. Partners that've already signed on to build ARMv8-based hardware include Samsung, AMD, Broadcom, Calxeda, and STMicro."
The 64-bit ARM ISA is pretty interesting: it's more of wholesale overhaul than a set of additions to the 32-bit ISA.
Relaunch (Score:2, Informative)
The 64-bit ARMv8 became available over 12 months ago and no one is making any yet.
Re:AMD? (Score:5, Informative)
I work at a tech company, and almost everyone I know there owns an APU based machine - generally for HTPC uses, or so they say. Yes, it is true that the fastest chips are made by Intel, but when you look at the cost of typical (not high end) machine, AMD is hard to beat, especially when the graphics in and APU will work fine for you.
Re:ARM builds chips? (Score:4, Informative)
Architectures.
It's the last word in the title.
Re:Relaunch (Score:5, Informative)
The 64-bit ARMv8 became available over 12 months ago and no one is making any yet.
That was the instruction set. These are the chip designs.
Re:Relaunch (Score:5, Informative)
The first drafts of the ARMv8 architecture became available to a few ARM partners about 4-5 years ago. They've since been working closely with these partners to produce their chips before releasing their own design. The aim was to have third-party silicon ready to ship before anyone started shipping ARM-designed parts to encourage more competition.
ARM intentionally delayed releasing their own designs to give the first-mover advantage to the partners that design their own cores. In the first half of next year, there should be three almost totally independent[1] implementations of the ARMv8 architecture, with the Cortex A50 appearing later in the year. This is part of ARM's plan to be more directly competitive with the likes of Intel. Intel is a couple of magnitudes bigger than ARM, and can afford to have half a dozen teams designing chips for different market segments, including some that never make it to production because that market segment didn't exist by the time the chip was ready. ARM basically has one design, plus a seriously cut-down variant. By encouraging other implementations, they get to have chips designed for everything from ultra-low-power embedded systems (e.g. the Cortex-M0, which ARM licenses for about one cent per chip), through smartphone and tablet processors up to server chips. ARM will produce designs for some of these, and their design is quite modular, so it's relatively easy for SoC makers with the slightly more expensive licenses to tweak it a bit more to fit their use case, and companies like nVidia, TSMC and AMD will fill in the gaps.
The fact that ARM is now releasing their own designs for licensing means that their partners are very close to releasing shipping silicon. We've seen a few pre-production chips from a couple of vendors, but it's nice to see that they're about to hit the market.
[1] ARM engineers consulted on the designs, so there may be some common elements.
Re:ARM64 is a mess (Score:5, Informative)
All of the things that make Arm "ARM" are gone, such as conditional execution, having the program counter as general purpose register and more
The advantage of conditional instructions is that you can eliminate branches. The conditional instructions are always executed, but they're only retired if the predicates held. ARMv8 still has predicated select instructions, so you can implement exactly the same functionality, just do an unconditional instruction and then select the result based on the condition. The only case when this doesn't work is for loads and stores, and having predicated loads and stores massively complicates pipeline stage interactions anyway, so isn't such a clear win (you get better code density and fewer branches, but at the cost of a much more complex pipeline).
They also have the same set of conditional branches as ARMv7, but because the PC is not a GPR branch prediction becomes a lot easier. With ARMv7, any instruction can potentially be a branch and you need to know that the destination operand is the pc before you know whether it's a branch. This is great for software. You can do indirect branches with a load instruction, for example. Load with the pc as the target is insanely powerful and fun when writing ARM assembly, but it's a massive pain for branch prediction. This didn't matter on ARM6, because there was no branch predictor (and the pipeline was sufficiently short that it didn't matter), but it's a big problem on a Cortex A8 or newer. Now, the branch predictor only needs to get involved if the instruction has one of a small set of opcodes. This simplifies the interface between the decode unit and the branch predictor a lot. For example, it's easy to differentiate branches with a fixed offset from ones with a register target (which may go through completely different branch prediction mechanisms), just by the opcode. With ARMv7, an add with the pc as the destination takes two operands, a register and a flexible second operand, which may be a register, a register with the value shifted, or an immediate. If both registers are zero, then this is a fixed-destination branch. If one register is the pc, then it's a relative branch. Because pretty much any ARMv7 instruction can be a branch, the branch predictor interface to the decoder has two big disadvantages: it's very complex (not good for power) and it often doesn't get some of the information that it needs until a cycle later than one working just on branch and jump encodings would.
Load and store multiple are gone as well, but they're replaced with load and store pair. These give slightly lower instruction density, but they have the advantage that they complete in a more predictable amount of time, once again simplifying the pipeline, which reduces power consumption and increases the upper bound on clock frequency (which is related to the complexity of each pipeline stage).
They've also done quite a neat trick with the stack pointer. Register 0 is, like most RISC architectures, always 0, but when used as the base address for a load or store, this becomes the stack pointer with ARMv8, so they effectively get stack-relative addressing without having to introduce any extra opcodes (e.g. push and pop on x86) or make the stack a GPR.
ARMv8 also adds a very rich set of memory barriers, which map very cleanly to the C[++]11 memory ordering model. This is a big win when it comes to reducing bus traffic for cache coherency. This is a big win for power efficiency for multithreaded code, because it means that it's easy to do the exact minimum of synchronisation that the algorithm requires.
As an assembly programmer, I much prefer ARMv7, but as a compiler writer ARMv8 is a clear win. I spend a lot more time writing compilers than I spend writing assembly (and most people spend a lot more time using compilers than writing assembly). All of the things that they've removed are things that are hard to generate from a compiler (and hard to implement efficiently in silicon) and all of the things that they've added are useful for compilers. It's the first architecture I've seen where it looks like the architecture people actually talked to the compiler people before designing it.
Re:Patent move? (Score:5, Informative)
First, don't conflate the ABI and the ISA. The ABI, the Application Binary Interface, describes things like calling conventions, the sizes of fundamental types, the layout of C++ classes, vtables, RTTI, and so on. It is usually defined on a per-platform (OS-architecture pair) basis. This changes quite infrequently because changing it would break all existing binaries.
The ISA, Instruction Set Architecture, defines the set of operations that a CPU can execute and their encodings. These change quite frequently, but usually in a backwards-compatible way. For example, the latest AMD or Intel chips can still run early versions of MS DOS (although the BIOS and other hardware may be incompatible). ARM maintains backwards compatibility for userspace (unprivileged mode) code. You could run applications written for the ARM2 on a modern Cortex A15 if you had any. ARM does not, however, maintain compatibility for privileged mode operations between architectures. This means that kernels needed porting from ARMv5 to ARMv6, a little bit of porting from ARMv6 to ARMv7 and a fair bit more from ARMv7 to ARMv8. This means that they can fundamentally change the low-level parts of the system (for example, how it does virtual memory) but without breaking application compatibility. You may need a new kernel for the new CPU, but all of the rest of your code will keep working.
Backwards compatible changes happen very frequently. For example, Intel adds new SSE instructions with every CPU generation, ARM added NEON and so on. This is because each new generation adds more transistors, and you may as well use them for something. Once you've identified a set of operations that are commonly used, it's generally a good use of spare silicon to add hardware for them. This is increasingly common now because of the dark silicon problem: as transistor densities increase, you have a smaller proportion of the die area that can actually be used at a time if you want to keep within your heat dissipation limits. This means that it's increasingly sensible to add more obscure hardware (e.g. ARMv8 adds AES instructions) because it's a big power saving when it is used and it's not costing anything when it isn't (you couldn't use the transistors for anything that needs to be constantly powered, or your CPU would catch fire).
Re:Yay Cortex A-15! (Score:4, Informative)
It's interesting I'm typing this on a netbook. That's got an Atom N570 a 1.6Ghz dual core, in order hyperthreaded CPU. My phone has dual core Cortex A9s which are 1.2Ghz out of order and single issue.
If you'd have said five years ago that Arm would go out of order and Intel would go in order, I'd have thought it was absurd. Then again you're comparing the (then) slowest Atom with the (then) fastest Arm.
According to this
http://www.7-cpu.com/ [7-cpu.com]
An Atom N270 at 1.6Ghz with two threads gets a score of 1000 MIPS Compressing and 1500 MIPS Decompressing.
An Exynos 4210 at 1.2Ghz with 4 threads gets 1380 MIPS Compressing and 2130 MIPS Decompressing.
Unfortunately there's no result for an N570 but judging by the other results doubling up the number of cores should make it a bit faster than the Exynos 4210. Still it's probably quite close. Which is remarkable actually - the Exynos uses slow mobile SDRAM and the Atom uses DDR2.