Long ago I mentioned Andy Glew because of his comment on Bulldozer's architecture in comp.arch. He also has a blog, where he posted links to some of his publications: http://andyglew.blogspot.com/2009/12/links-to-mlp-coherent-threading.html

I reccommend reading (or at least skim over) his description of the Multi-Star architecture. Also his Berkeley ParLab talk presentation contains a depicted architecture, which closely resembles that of Bulldozer, as you can see below.mcmt

Some of the more interesting stuff he wrote about is Speculative Multithreading (SpMT), Eager Multithreading and similar variants, which could be implemented by some extensions of this architecture.

In his exemplary architecture, Glew prefers a 2 ALU + 2 mem op (or AGU) configuration for the execution units. But as Hans de Vries pointed out in the comments to the latest Bulldozer diagram blog entry, it could be, that this configuration has been changed now to 4 ALU and 4 AGUs or at least some hybrid variant capable of the same throughput (remember Wireloops comment a while back).

Hans' argument goes like this: Chuck Moore (the Bulldozer chief architect, which has been reported as missed by cmaier) told the audience in the Q&A session on last Financial Analyst Day, that each integer scheduler and also the FP scheduler each can issue 4 instructions per cycle. Ok, we also heard about the 4 instruction pipelines per integer core. But what - in this case - is an instruction? This could be an x86 instruction translated into some internal representation (like the MacroOps in K10) - or something else. A clue has been given by Chekib Akrout in his presentation of the Bobcat core: Bobcat's scheduler issues 2 instructions per cycle. And there we know, that the execution hardware consists of 2 ALUs plus a load and a store pipe.

Taking this meaning of instruction would mean, there might be 4 ALUs and 4 AGUs in each of the integer cores. Another hint supporting this is the set of restrictions for dispatch groups mentioned in some of the later patents: per dispatch group of 4 instructions there can only be 2 load ops and one store op because of only two ports to the load queue and one port to the store queue. If these mem ops could be executed in one cycle, then there need to be at least 3 AGUs in one integer core.