Sitting under the hot Cretan sun might not be the best place to find clear thoughts but some palm tree shadows and water cooling provided by the hotel's pool did enough to finish this blog article. But in short: what a week! While Apple lost Steve Jobs as their charismatically leading CEO, AMD got the probably most enthusiastic and motivating CEO ever since the days of Jerry Sanders. Time will tell. A Bulldozer based CPU hasn't been launched yet. With a felt low probability this might happen very soon, we'll see. The true Orochi die size has been revealed and more BD ES benchmark results trickled out of more or less dubious sources. Many interesting topics to cover.
Bulldozer Engineering Sample Performance
Recently a new wave of Chinese benchmark leaks and some new results by OBR stirred up discussions in several forums. A lot of energy is being put into reasoning, whether the ES were run under optimal settings or not. The variance seen amongst benchmark results of different BD ES or even retail CPUs on different boards but at the same base clocks speak for suboptimal configurations. Such a configuration doesn't only include the type of HDD/SSD, RAM or OS, but extends to proper BIOS and driver support, which I assume to be missing or not ready yet. In case of graphics performance even the graphics drivers could be a reason for lower than expected graphics performance if the drivers code (which runs on the CPU) is not fully optimized for a target CPU.
My first PC was a custom built system with an 80386 CPU. It had some cache on the mainboard. Running Norton System Info on it revealed that raw CPU performance wasn't as high as other results suggested. The reason was the mainboard cache was deactivated in the BIOS settings. Enabling it doubled the SI-score! So much on the effect of misconfiguration. Now imagine the much more complex settings of Bulldozer based CPUs, which likely became the most configurable series of AMD CPUs ever. Some examples of where something could go wrong to achieve optimal performance:
- memory controller modes and timings aren't configured correctly
- NB and L3 are still not clocked at retail speed levels
- modes and settings of different caches
- turbo mode settings (besides the specific P-State settings this includes timings, switching rules)
- core configuration (BD allows different configurations to be adapted to specific workloads)
Software Code Paths and Code Optimization
The executed code path also has a significant effect on performance. I mentioned this on Twitter already but it's time to go into more detail. Here are some points where software running on BD might achieve a performance significantly lower than software optimized for BD:
- if software checks for specific CPUIDs (esp. family) and has several code paths optimized for older CPUs it might still choose one of the worse optimized ones
- most softwares won't use FMA, but use of SSE1-4, which uses only half of max. theoretical FP throughput (will be the case for all FP code paths not explicitly optimized for Bulldozer or Haswell)
- use of scalar FP code (MMX, x87, scalar SSE ops) will only utilize 1/4 of max. theoretical FP throughput (think of still used benchmarks like SuperPi)
- wrong cache blocking might cause cache line thrashing (e.g. code might group data into blocks of 32kB or 64kB to fit into current L1 cache sizes)
- code alignment/content rules
- streamed writes (using too many streams might drop write throughput to much lower levels than on 10h according to the optimization manual)
Power Cap / TDP Limit
Some guys at the web already speculated about AM3/AM3+ board related issues regarding power supply and Bulldozer's specific requirements and TDP limit feature being a reason for lower than expected ES performance. To understand part of these issues have a look at a voltage graph of Llano while jumping between different P-States. This diagram is from AMD and shows voltage over time:
This behaviour has already been called "dithering" by some. Also remember Asus' AM3+ advantage slides, which covered topics like power/voltage jumps. So if for stability reasons a lower peak TDP cap than planned for retail has been applied to BD ES, this might indeed have an effect on performance. The high TDP readings of CPU-Z might just be this peak value, but this is speculation. Some AMD patents described kind of a TDP-budget related throttling at different stages during execution. Further some turbo P-states (esp. All Core Turbo) might become unavailable in such cases.
If the TDP limit isn't too low then low threaded benchmark performance should still get the highest turbo clocks, while multithreaded high throughput benchmarks might suffer.
HyperTransport Flaw in earlier ES Steppings
One site recently reported about Bulldozer in general but also mentioned a reason for lower ES performance: a flaw in the HyperTransport module might cause bad cache and memory performance, thus affecting overall performance significantly.
Bulldozer Engineering Sample OPNs
Fortunately it's not too late for the following: About 2 months ago I started to assemble a list of BD ES and their OPNs as they could be found at different places. The most interesting ones might be the listed Sandra results since I found them mentioned nowhere. One result even made it to #1 worldwide in SiSoft Sandra's crypto benchmark (the same machine as #4 and #5 in the table below).
If you look carefully at the OPN codes, you'll notice that the right part of the full OPN code contains some additional information compared to what has already been leaked about the main part of the OPN (left half). I read it this way:
TC: Max. Turbo Core Clock (GHz)
BC: Base Clock (GHz)
NC: North Bridge and/or L3 Clock (GHz)
?: unknown (always 2)
C#: Number of Cores (physically present, not always fully activated)
Example: 31/21/2_2/16 means 3.1 GHz max. Turbo Core Clock, 2.1 GHz Base Clock, 2.0 GHz NB/L3 Clock, 16 cores
I will fix non working links later.
Some Economical Aspects of Orochi's Die Size
Charlie reported that Orochi die size is 315 mm² according to an AMD slide. This is 7% more than my latest measurement of 294 mm² based on the clean non-distortet Orochi die photo from ISSCC 2011and the already known area of a Bulldozer module incl. its L2 cache, published at the same conference. So the earlier estimations of 300 to 320 mm² (based on DRAM I/O pads, L1 I-cache sizes etc.) were closer to the truth than the methodically more accurate pixel measurement The best methods are worth nothing when applied to bad or unclear data, in this case the way of counting the power gating ring area I suppose.
What are the economical aspects of Bulldozer's die size being greater than expected by many? First there is an effect on the pure costs of producing Orochi dies. This value can only be estimated by parties outside of Globalfoundries and AMD, and it depends on many input variables not known to the public. One way is to use known data of other foundries or the semiconductor industry as a whole. In this case I assume a price of $5k for the processed wafer (ignoring the new die-based WSA between AMD and GF) and die yields of 70%.
After roughly 10% losses at the wafer edge and markings this would mean ca. 150 net die at a die size of 294 mm² or ca. 140 net die at 315 mm². With packaging/test costs of maybe $5 per die the resulting cost per die would be $38 for the first case and $41 for the actual die size. At a yield of 50% the net die count would be 107 and 100 respectively and resulting costs $52 or $55. Such a difference of $3 is certainly smaller than the error margin we can assume for these calculations.
But there is another possibility, where die size has an effect: If demand is greater than supply - a situation AMD is familiar with - this might indeed be a somewhat bigger problem than just an increase in variable cost of $3 per die, when Globalfoundries is short of 32nm capacity. Just imagine a hypothetic case, where AMD could order no more than 5000 wafers for Orochi dies. That doesn't sound much but with those 5k of wafers this means 750k vs. 700k chips. If demand is high enough (say: 800k), assuming an ASP of $200 for the DT/server processor mix those missing 50k would result in a reduction of achievable revenue by about $10M or less than 1% in Q3.