Unfortunately I both didn't have enough time and details (some things were to guess) to create the promised architecture diagram. However, now the missing details can be found in new published patent applications. I think that will help me getting back to the task. But now I switch to another topic: Will Bulldozer have SMT or not?
AMD's John Fruehe recently said in an AMDZone forum thread, that AMD won't do SMT in the next years. That could be understood in a way that the architecture revealed here won't be able to execute more than one thread per core. However, this is not the case, because such a statement hasn't been given. So far, John said, that AMD wouldn't implement SMT. In my eyes it was a smart move to mention SMT - just to be able to deny it ;). However, this is still speculation.
Instead we saw the term "Cluster-based Multi-threading" (also known as clustered multi-threading, CMT) already years ago in an AMD presentation. If you look at Chuck Moore's slide below, you see, that SMT is the least admirable multi-threading variant to AMD. So far they were underway in the CMP part of this diagram and it just seems logical to move to much greener CMT area from there - even more since they explicitly state a 50% area investment for 80% throughput gain. They had this view already four years ago with first patents covering the new architecture being filed just two years later. If Bulldozer would have been ready already for 2009 or 2010, these time frames seem ok to me. And even the four year difference from patent filing dates to 2011 fits well to what we know from older architectures.
So we find the new arch again in:
- 20090164758 - System and method for performing locked operations
- 20090172359 - Processing pipeline having parallel dispatch and method thereof
- 20090172362 - Processing pipeline having stage specific thread selection and method thereof
- 20090172370 - Eager execution in a processing pipeline having multiple integer execution units
And most of these patent applications give now much more detail on how the threads are executed and the likes. Most of it fits well to what Hans de Vries already described in his detailed post on aceshardware.
These patent application describe ways to execute a single thread on both clusters. This could be done by having a run ahead thread for early memory prefetches or by executing both ways of a branch in parallel and scrap the wrong way after branch resolution. A different variant is the parallel execution of the same code to gain reliability of the results by comparing them afterwards.
Some of the mentioned patent applications also state, that the 4 way decoders could decode more than 4 instructions per cycle if there are both a microcoded and a fastpath instruction (of different threads) in one decoding path.
Another interesting and future related topic is the way how general and graphics processing units could be combined. This is covered in following patent applications:
- 20090164726 - Programmable address processor for graphics applications
- 20090160863 - Unified processor architecture for processing general and graphics workload