David Kanter published his in-depth article about the microarchitecture of Sandy Bridge. He completes the picture given by Intel at the IDF 2010 and compares the architecture to Nehalem and Bulldozer. While he interprets "Bridge" (Hebrew: "gesher") as a metaphor for bringing together several existing microarchitectural concepts with some new ones and a GPU, I'd also see a similarity to AMD's "Fusion", which also stands for bringing things together.

Hiroshige Goto recently wrote even four different articles about Sandy Bridge. You can find the Japanese articles here and the autotranslated ones here.

In a response to some interesting thoughts I collected some data to follow AMD's past decision of going the way of increased clock frequency with Bulldozer (meaning lower FO4 delay per pipeline stage). So here is the translation of thoughts and preliminary conclusions:

Regarding power consumption, the higher clockable 12FO4 design of Bulldozer (as indicated on comp.arch) inversely needs a lower voltage to run at 2GHz than a 17FO4 design like K8, because the signals simply have to run through less gates during a 0.5ns clock phase. And a lower voltage also lowers leakage. If this will lead to a higher power efficiency, depends on further factors.

Now as some might have found out by playing with the recently released Scheduler Simulation (OpenOffice.org variant is in the pipeline), a 4-wide OoO scheduler might have a harder time keeping the EUs busy than a 2-wide scheduler due to dependencies. The narrower design just has to provide the necessary operands (registers or memory data) for up to two instructions and not four.

Designing wider OoO execution cores also means complexity growth for many logic components (often quadratic, sometimes cubic growth). This is reflected by Pollack's rule, as depicted by Hiroshige Goto:

(from here, but more on that can be found here or at Intel, where Pollack actually worked)

Leakage is becoming increasingly important when designing chips for smaller structures. In case of Llano it is stated as being 29%:

(from here)

According to an IBM research report from 2003, designs with 17 or 18 FO4 (black and green curves) could be most power efficient:

Their metric "Total FO4" includes jitter, skew and latches, so it's comparable to the 17FO4 of Bulldozer (12FO4 + 5FO4). The reason for the bumpy curves (compared to many other publications) could simply be, that the researchers actually looked, where they could divide the pipeline in a useful way. The Bulldozer architects could have known those results (remember Chuck Moore's job history) in advance or come to similar conclusions after doing their own research.