Bulldozer 2

In an AMD/HP slideset available on the web I found slides showing a possible future development of AMD's high end server platform and a seemingly more detailed (added TDP numbers) performance prediction. Neither is the first slide a roadmap nor is the second one a measurement presentation according to John Fruehe of AMD. The first slide was used in a discussion at an event earlier this year and even the future socket names used there are not based on any plans and will be called differently. So better take it as an idea than a grand plan of AMD, since the future of the server platform is subject to change depending on market conditions and operative/strategic decisions. Similarly the performance chart should be treated as hand drawn and not like it was plotted in Matlab.

I added some year numbers and process nodes to it which both somehow fit to the data of an older slide listing a "2012 processor" and a "2013 processor" and GlobalFoundries current process node roadmap. And as we already know, the upcoming Orochi CPU is called "BD Ver 1" in GCC sources. This fact and the microarchitecture details in patents, which don't seem to fit to what is known about Bulldozer, but otherwise descibe a Bulldozer like processor architecture, indicate that there will be a microarchitecture update. I guess, we might see it in the 2013 timeframe, while an even more advanced microarchitecture, which could lead the Fusion concept to an even stronger integration, might come around 2015, as already reported at other places and indicated by the slide above.

The changes to expect in a "Bulldozer NG" or "Bulldozer 2", as I call it, might be as complex and as effective as what we know from the K8 core which appeared in the first Opteron in 2003 and got its latest update with Llano. They also might be comparable to what has changed between Greyhound+ and Llano's core (also a Hound). We will see.

BTW, recently some new GCC patches for support of new ISA extensions appeared. They already mention a "bdver2" processor - and of course the new extensions:

These patches add support for upcoming bdver2 AMD processors:
BMI (Bit Manipulation Instructions)
TBM (Trailing Bit Manipulation)
FMA3 (three operand FMA) instructions

The public specifications for BMI and TBM are in progress (they are today available under NDA).  They will appear in one of the AMD64 Architecture Programmer's Manual Volumes 3-6.   I can post the mnemonics definitions if needed.  The FMA3 specification is documented in http://software.intel.com/en-us/avx/

2010-10-15  Quentin Neill  <quentin.neill.gnu@amd.com>

BMI patch:
BMI mnemonics:
TBM patch:
TBM mnemonics:
FMA3 patch is not ready yet:

The reason that these patches have already been published could mean, that Bulldozer 2 isn't that far away in the future (a few years as shown above).


This week AMD presented a demo of the Llano APU on it's Technical Forum Exhibition 2010. The [LINK zu Mainboard] shown processor with an apparently lower TDP managed to run 4 threads of HyperPi (not SuperPi!), the DirectX nBody simulation and 1080p video playback in parallel. The calculations throughput, which was displayed by the nBody simulation, was an estimated 35GFLOPS on average. But due to the high utilization of the CPU cores only a part of available power was dedicated to the GPU. So the result is not comparable to the Zacate score reported at Anandtech.

These DirectX nBody code examples calculate the forces between a bunch of particles (several thousands I suppose). For this it has to calculate the distance between each possible pair of particles. This needs multiplications, additions, divisions and square roots to calculate the distances and forces. With a higher number of particles the amount of calculations grows quadratically. Thanks to that the shown resolution is irrelevant, because the force calculations for the drawn particles takes the most time.

There was also a demo of AVP running on Llano. You can find videos of those demos at AMD's blog, Techspot, SemiAccurate, LegitReviews for example.

Also the Brazos platform has been shown and managed to reach a much higher DirectX nBody throughput (seemingly with no other tasks in the background) at ~19W (at the wall socket) as the compared Core i5 processor at ~38W.

Planet3DNow! presented some slides, like the fmax distribution slide for Llano (showing the max. reachable frequency of parts at a given voltage), which is already known from another presentation a while back. Back then I read a comment (Hi Paul!), that the process doesn't look healthy because of the sample points stretching that far to the lower end (on the left). But a while back I've seen a similar curve in a paper or dissertation about some fmax distributions achieved using an Intel process node. As soon as I find it I'll post it here.

Other stuff

I'm thinking about how to link different forums and maybe interesting updates to this blog, since a lot of interesting information first pops up in a forum. So I'm open for ideas. For example, the data for the Hudson southbridges appeared here (in Russian), reposted later here (in German) by me. If you ask, why I didn't post it on my blog: if I'm not sure about the nature of the document or its source, I don't publish it here.