A lot of interesting articles and postings appeared recently. So here are some of them:

John Fruehe published a blog about Bulldozer's Flex FP unit. It contains details about the execution of 128 bit and 256 bit instructions and other info like the throughput of AES instructions or that the FPU can go down to 2% power consumption when idle. And if you missed it, he also posted the fourth part of the 20 Questions blog here.

Hiroshige Goto writes about the Llano demonstration (Google/Bing), AMD's new Barts GPU (Google/Bing) and it's tesselator (Google/Bing).

David Kanter from Realworldtech pointed to an interesting GPU article, which contains a deep analysis of the capabilities of the tested Fermi GPU in comparison to the AMD's Cypress GPU: http://www.beyond3d.com/content/reviews/55

While writing about GPUs: There is a webinar series from AMD covering OpenCL from the beginning to more advanced algorithms. You can attend them live or watch the recorded events.

The Inquirer has an article about AMD's 2011 outlook where Nebojsa Novakovic mentions the good mood amongst AMD'ers regarding their future products.

And last but not least: two new patches posted at the GCC patches mailing list bring a lot of instruction latency numbers and other data of Bulldozer. So there is a pipeline description patch and a processor costs table patch, which add to the numbers already published in the Open64 compiler source code. There we can see again some numbers supporting Bulldozer's higher frequency design like: FMUL/FADD latencies went from 4 to 6 cycles (was known before), x87 FDIV from 19 to 42, SSE DP division from 20 to 27 and x87 FSQRT from 35 to 52 cycles. Similarly 32bit integer muls take 4 cycles (vs. 3) and 64bit integer muls take 6 cycles (vs. 4). And most 256 bit AVX ops are double decoded (two 128bit uops) in Bulldozer.