According to Andreas Stiller from german c't magazine (in German), Sandy Bridge will have a loop cache with space for over 1500 µOps (1536?) compared to Nehalem's 28 µOp loop stream detector buffer. This will actually make it comparable to a trace cache. A Wikipedia article about Sandy Bridge's architecture mentions both the trace cache and a "16 KiB decoder" (decode cache?). These two seem to be the same and are actually comparable to the loop cache. Some plausibilisation: 16 KiB could be enough for ~1500 µOps (more than 80 bit/µOp). As it seems, the data presented in the Wikipedia article has been collected from older Intel presentations, but I didn't track that down to the original sources.

As I wrote in earlier blog entries Bulldozer might have a trace cache or a "redirect recovery cache". Most of the described variants work in the same way as Intels loop cache. To me it seems likely that Bulldozer will that too. This is not just a matter of performance but also of power efficiency. Using a trace cache is a way to reduce power consumption, e.g. by reducing the work to be done by the fetch and decode stages. Additionally some code optimization/instruction combining of the traces could be done in the non critical path. Such a method has been described in one of the older patents.