Some recent digging into the Open64 source code revealed more information about a microarchitecture, which seems to be Bulldozer's. This open source code looks to be a good source as shown before (cache sizes or instruction latencies). For example, the file cg_sched.cxx (part of the code generator and responsible for the scheduling of instructions in compiled code) contains:

static const int num_fu[] = {
0, /* NONE */
4, /* ALU */
3, /* AGU */
4, /* FPU */

This looked like this in an earlier revision (according to the comment it is dedicated to the Opteron CPU):

static const int num_fu[] = {
0, /* NONE */
3, /* ALU */
3, /* AGU */
1, /* FADD */
1, /* FMUL */
1, /* FMISC */

So this looks like each integer core has 4 ALUs and at least 3 AGUs (perhaps there are 4 for easier scheduling, but only 3 can be used during a single cycle). The number of AGUs fits well to the already speculated 2 loads and 1 store per cycle. The 4 FP units match to the 4 issue FPU mentioned by Chuck Moore. Now one might think that available decode bandwidth is not enough to keep two integer cores with these execution capabilities highly utilized. But since Bulldozer could be a latency tolerant design with data speculation, checkpointing, replay and runahead execution to cover L1 and L2 misses, the execution resources could be needed for such features.

The changes came with a comment "Phase 1 implementation of support for new target work". Some other interesting lines (copied from different lines of the file):

  static const int load_ops_rate = 2;
static const int store_ops_rate = 1;
static const int issue_rate = 4;

Further reading reveals more info, like there are up to 4 single decoded ops per dispatch group. Some instructions are decoded as fast double decoded ops like in K8 or K10. The number of 64 bit immediates is limited to 2. I assume that there can be up to four 32 bit immediates. So there seems to exist what already has been mentioned as a immediate/constant steering unit, described in one patent.

There is more, but I think these were the most interesting findings.

P.S.: Since looking for BOINC stats of Ontario and Llano was successful I tried the same for Sandy Bridge (although there are other benchmark results out there):

Sandy Bridge Stepping 3, 2.2GHz
Sandy Bridge Stepping 2, 2.0GHz
Sandy Bridge Stepping 0, 2.2GHz
Sandy Bridge Stepping 3, 2.4GHz