<?xml version="1.0" encoding="UTF-8"?>
<rss version="0.92"><channel><title>Patent based research regarding AMD's future MPUs</title><link>http://citavia.blog.de/</link><description>This is a blog about processors and microarchitectures like AMD's upcoming Bulldozer architecture. Here I'll gather facts, speculation and thoughts about this and other architectures like Intel's Sandy Bridge. So this blog is mainly x86 related.</description><language>en-EU</language><docs>http://backend.userland.com/rss092</docs><image><title>Patent based research regarding AMD's future MPUs</title><link>http://citavia.blog.de/</link><url>http://data5.blog.de/design/preview/f9/c8bfa70226a7acdaf665a9b951bdc5_160x200.jpg</url></image><item><title>2 GHz AMD Jaguar benchmarks</title><description>	&lt;p&gt;As a &lt;a href="https://www.osadl.org/Profile-of-system-in-rack-9-slot-1.qa-profile-r9s1.0.html"&gt;detailed system report&lt;/a&gt; suggests, there is a lonely quad core engineering sample sporting four Jaguar cores running in a rack slot somewhere at OSADL (Open Source Automation Development Lab). The CPU is identified as family 22 (16h), model 0, stepping 1. This translates to stepping A1. The OPN is "2M201079J4461_00/20/08/06_9830", which suggests a mobile chip ("M"), running between 0.8 and 2.0GHz core clock and likely with a 600MHz GPU clock ("20/08/06").&lt;/p&gt;
	&lt;p&gt;The GPU device ID is 9830, which already appeared in device string &lt;span&gt;AMD9830.1 = "KB 4C 25W (9830)"&lt;/span&gt;, as reported by &lt;a href="http://www.3dcenter.org/news/ein-erstes-zeichen-zu-amds-radeon-hd-8000-serie-chip-codenamen-mars-oland-und-venus-aufgetaucht"&gt;3DCenter.org&lt;/a&gt;.  So this engineering sample might actually be a Kabini part with a TDP of 25W. This also means, the eight Jaguar cores in the &lt;em&gt;Playstation 4&lt;/em&gt; APU could also be clocked  at levels like 2GHz, since the total TDP listed above includes the GPU  and FCH, leaving something around 10 to 15W for the compute unit. Two of them would need about 20 to 30W.&lt;/p&gt;
	&lt;p&gt;Here's the detailed report of core #1 (running at maximum clock frequency):&lt;/p&gt;
vendor_id	: AuthenticAMD&#13;
cpu family	: 22&#13;
model		: 0&#13;
model name	: AMD Eng Sample: 2M201079J4461_00/20/08/06_9830 &#13;
stepping	: 1&#13;
microcode	: 0x7000105&#13;
cpu MHz		: 2000.000&#13;
cache size	: 2048 KB&#13;
physical id	: 0&#13;
siblings	: 4&#13;
core id		: 1&#13;
cpu cores	: 4&#13;
apicid		: 1&#13;
initial apicid	: 1&#13;
fpu		: yes&#13;
fpu_exception	: yes&#13;
cpuid level	: 13&#13;
wp		: yes&#13;
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr&#13;
                  sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl&#13;
                  nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 cx16 sse4_1&#13;
                  sse4_2 movbe popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm&#13;
                  sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt topoext arat xsaveopt hw_pstate npt&#13;
                  lbrv svm_lock nrip_save tsc_scale flushbyasid decodeassists pausefilter pfthreshold bmi1&#13;
bogomips	: 3992.32&#13;
TLB size	: 1024 4K pages&#13;
clflush size	: 64&#13;
cache_alignment	: 64&#13;
address sizes	: 40 bits physical, 48 bits virtual&#13;
power management: ts ttp tm 100mhzsteps hwpstate [11]	&lt;p&gt; &lt;/p&gt;
	&lt;p&gt;Further down, the DMI info block reveals 1.2V core voltage at 1.8GHz. The somewhat lower clocked 40nm Brazos cores run at 1.35 V.&lt;/p&gt;
	&lt;p&gt;The &lt;a href="https://www.osadl.org/CPU-benchmarks.qa-farm-cpu-benchmarks.0.html"&gt;benchmark page&lt;/a&gt; contains some single/multi core UnixBench results. To find the Jaguar results, look for rack #9, slot #1, or "r9s1". It is also possible to sort the columns. There are other interesting CPUs like &lt;a href="https://www.osadl.org/?id=1223"&gt;VIA, PPC, ARM&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;A link on the report page leads to a detailed cache/memory bandwith plot:&lt;/p&gt;
	&lt;p&gt;&lt;a href="https://www.osadl.org/monitoring/membw/membw-r9s1.gif"&gt;&lt;img src="http://data8.blog.de/media/312/6989312_42e1ed3509_m.gif" alt="membw-r9s1"&gt;&lt;/a&gt;&lt;/p&gt;
	Some AMD Richland benchmarks
	&lt;p&gt;But this were not the only new benchmark results. There is a new Geekbench result of a Richland based HP Notebook:&lt;br&gt; &lt;a href="http://browser.primatelabs.com/geekbench2/1852876"&gt;Hewlett-Packard HP ProBook 455 G1&lt;/a&gt;&lt;br&gt; Compare that to the older result of AMD's Bantry platform here:&lt;br&gt; &lt;a href="http://browser.primatelabs.com/geekbench2/1591657"&gt;AMD BANTRY&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2013/04/17/2-ghz-amd-jaguar-benchmarks-15761535/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2013/04/17/2-ghz-amd-jaguar-benchmarks-15761535/</link><pubDate>Wed, 17 Apr 2013 02:12:46 +0200</pubDate></item><item><title>ISSCC 2013 and Next Gen Consoles</title><description>	&lt;p&gt;There is a lot being talked about small cores at &lt;a href="http://isscc.org/"&gt;ISSCC 2013&lt;/a&gt;, which is still going on. So far some first bits of  information have made their way out of it, for example about the &lt;a href="translate.google.com/translate?sl=ja&amp;tl=en&amp;js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;u=http://pc.watch.impress.co.jp/docs/column/kaigai/20130219_588233.html"&gt;voltage  regulation and power management in Intel's Haswell MPU&lt;/a&gt;. Another presentation gave us many details of AMD's Compute Unit (CU) based on four Jaguar cores. This  is especially interesting, as Jaguar cores seem to be an important  component of next gen XBox and Playstation rumours. 3DCenter once made a  nice overview of these rumours (see link below), which still seem to be  changing or popping up on a weekly basis. So if Jaguar is meant to be  included in one of the next gen consoles' processing units, this might  happen in the form of such a compute unit. Let's have a look at it.&lt;/p&gt;
	&lt;p&gt;To put that into perspective, I show you these two pictures at original scale (as long as your browser is at 100% zoom and has the correct info about the screens DPI):&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://info.nuje.de/2013_core_sizes_77.jpg" alt=""&gt; &lt;img src="http://info.nuje.de/Jaguar_CU.jpg" alt=""&gt;&lt;/p&gt;
	&lt;p&gt;The left image shows a collection of several well known cores (made by Hans de Vries):&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://chip-architect.com/news/2013_core_sizes_768.jpg" alt=""&gt;&lt;/p&gt;
	&lt;p&gt; The second image is an photoshopped version of the 4 core Jaguar Compute Unit as depicted at AMD's ISSCC presentation:&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://info.nuje.de/Jaguar_CU.jpg" alt=""&gt;&lt;/p&gt;
	&lt;p&gt;Update: This CU with four cores, 2 MB L2 and additional logic measures 26.2 mm² (excl. the logic in the upper left), which is less than the area of a 32 nm Piledriver Module. According to leaked quad core Kabini models with TDP ratings of 15 to 25 W (which might result in 8 to 15 W SCP according to Intel's definition) these CUs might consume around half of this power with no turbo mode or power distribution engaged. For such a rumoured next gen console with 8 Jaguar cores, two of these CUs would have to be included. This could go with additional memory channels to remove any potential bottleneck. Those CUs might be memory channel agnostic to allow their use beyond planned Kabini/Temash (not Tamesh!) SKUs.&lt;/p&gt;
	&lt;p&gt;Links:&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.techpowerup.com/180394/AMD-quot-Jaguar-quot-Micro-architecture-Takes-the-Fight-to-Atom-with-AVX-SSE4-Quad-Core.html"&gt;AMD "Jaguar" Micro-architecture Takes the Fight to Atom with AVX, SSE4, Quad-Core &lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://translate.google.com/translate?hl=en&amp;sl=de&amp;tl=en&amp;u=http://www.elektroniknet.de/halbleiter/prozessoren/artikel/95059/"&gt;Jaguar - The New Low Power CPU-Core From AMD (translated)&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="translate.google.com/translate?hl=en&amp;sl=de&amp;tl=en&amp;u=http://www.planet3dnow.de/cgi-bin/newspub/viewnews.cgi?id=1361284728"&gt;AMD presents Jaguar Quad Module at ISSCC (translated, with galleries containing ALL slides)&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://semiaccurate.com/forums/showthread.php?p=174760#post174760"&gt;A nice collection of Kabini/Temash information by user vain at SemiAccurateForums&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;At the &lt;a href="http://news.cnet.com/8301-1023_3-57570013-93/amd-chip-touch-controllers-all-head-to-next-playstation-report/"&gt;Sony PlayStation event tomorrow&lt;/a&gt;, we might hear a bit more about what they will actually include into their next console. To be prepared, there are some links covering these rumours:&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.xbitlabs.com/news/multimedia/display/20130218090620_Sony_PlayStation_4_New_Liberal_Approach_to_Development_Multi_Core_x86_Massive_Memory_Bandwidth_Cloud_Technologies_4K_Video.html"&gt;More rumoured details by Xbitlabs &lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://translate.google.com/translate?hl=en&amp;sl=de&amp;tl=en&amp;u=http://www.3dcenter.org/news/genauere-informationen-zum-grafikchip-der-nintendo-wii-u"&gt;Raw GPU performance (submetrics) of current and next gen consoles&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://translate.google.com/translate?hl=en&amp;sl=de&amp;tl=en&amp;u=http://www.3dcenter.org/news/xbox-720-und-playstation-4-beiderseits-mit-jaguar-basierten-achtkern-prozessoren"&gt;Overview of many of the recent next gen console rumours at 3DCenter.org&lt;/a&gt;:&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.neogaf.com/forum/showpost.php?p=38123390&amp;postcount=1854"&gt;A lot of background information about stacked memory, AMD, etc. at Neogaf forums&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.neogaf.com/forum/showthread.php?t=505495"&gt;More on that regarding AMD, PS4 by the same poster&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;BTW, in my personal opinion I see the possibility to get enough gaming performance out of 8 Jaguar cores supported by enough GPU compute power and memory bandwidth (stacking and/or huge caches). An unchanged hardware spec over the life cycle of a console also helps here develop optimized code for a fixed platform. This also worked in the past. But if there is a need for high single or low thread count performance, other CPU cores might fit better here, even Steamroller. But this should be the topic of a different blog posting.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2013/02/20/isscc-2013-and-next-gen-consoles-15549512/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2013/02/20/isscc-2013-and-next-gen-consoles-15549512/</link><pubDate>Wed, 20 Feb 2013 00:27:00 +0100</pubDate></item><item><title>Trinity/Piledriver Performance</title><description>	&lt;p&gt;Since February I'm regularly searching for appearances of new "family 21 model 16" BOINC results, which belong to AMD's Trinity APU. As I noticed, I'm not the only one doing that. ;-) Some early results of an engineering sample (ZD372058A4451_41/37/16_9901_800, which should clock at 3.7 GHz base and 4.1 GHz turbo clock according to the string) didn't look bad (one day it reached an integer score of over 13K on 64b linux). But to do some halfway accurate (or semiaccurate ;-)) analysis it is important to look at results achieved on the same OS (here: Win 7, 64 bit) and BOINC client version (6.12.34 here except for the ES, which run a 6.12.43 client).&lt;/p&gt;
	&lt;p&gt;The FP benchmark, which is a Whetstone benchmark, seems to run as a multithreaded benchmark according to "informal". At least it fills up all available cores while running. The integer benchmark, a good old Dhrystone benchmark, seems to be single threaded. Further it is important to know, that both benchmarks have a rather small memory footprint.&lt;/p&gt;
	&lt;p&gt;Since we don't know the exact clock frequencies of the benchmark runs, it is difficult to find the correct value for calculating per GHz results. I estimated those based on turbo clocks, which might lead to skewed results. At least in the case of comparing Trinity with its Piledriver cores to the FX models, I hope that rather similar turbo mode behaviour should reduce the error margin.&lt;/p&gt;
	&lt;p&gt;OK, here comes the table comparing several values I filtered out of my collected BOINC results to have OS and client version the same. As you can see, Piledriver w/o L3 cache seems to perform a bit better than BDver1 based FX models:&lt;/p&gt;
	&lt;p&gt;&lt;a title="Trinity BOINC Performance Comparison" href="javascript:window.open("&gt; &lt;img src="http://data7.blog.de/media/945/6295945_8174fda20c_m.png" alt="Trinity BOINC Performance Comparison"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;Note: I used "Trinity vs. Bulldozer" to denote the difference between a L3-less Piledriver core and a Bulldozer core, which always had L3 available.&lt;/p&gt;
	&lt;p&gt;Another note (as of 04/10): In the Piledriver vs. Bulldozer columns I divided the Trinity value by the maximum of all FX values. Further the FP benchmark likely run at base clock frequency. I'll add more on that in a follow up article.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2012/04/08/trinity-piledriver-performance-13460109/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2012/04/08/trinity-piledriver-performance-13460109/</link><pubDate>Sun, 08 Apr 2012 16:13:24 +0200</pubDate></item><item><title>AMD FX Processor Launch</title><description>	&lt;p&gt;Today NDA's for AMD FX processor reviews got lifted. Here is a quick list (updated as time permits):&lt;/p&gt;
	&lt;p&gt;&lt;a href="translate.google.com/translate?sl=de&amp;tl=en&amp;js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;u=http%3A%2F%2Fwww.planet3dnow.de%2Fvbulletin%2Fshowthread.php%3Ft%3D399114"&gt;Planet3DNow #1&lt;/a&gt; (Googlish, original article is &lt;a href="http://www.planet3dnow.de/vbulletin/showthread.php?t=399114"&gt;here&lt;/a&gt;. There's also a clock to clock comparison. Stay tuned for my articles looking at certain aspects of the architecture and why there are the performance differences we see today.)&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://translate.google.com/translate?hl=en&amp;sl=de&amp;tl=en&amp;u=http://www.planet3dnow.de/vbulletin/showthread.php?t=399118"&gt;Planet3DNow #2&lt;/a&gt; (Googlish, my first Bulldozer performance analysis article, where I have a look at measured instruction latencies/throughput, original article is &lt;a href="http://www.planet3dnow.de/vbulletin/showthread.php?t=399118"&gt;here&lt;/a&gt;)&lt;/p&gt;
	&lt;p&gt;&lt;a title="AMD Bulldozer FX-8150" href="http://hexus.net/tech/reviews/cpu/32110-amd-bulldozer-fx-8150/"&gt;Hexus&lt;br&gt; &lt;/a&gt;&lt;br&gt; &lt;a title="AMD FX-8150 8-Core CPU Review: Bulldozer Is Here" href="http://hothardware.com/Reviews/AMD-FX8150-8Core-Processor-Review-Bulldozer-Has-Landed/"&gt;Hot Hardware&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a title="AMD FX-8150 Review: From Bulldozer To Zambezi To FX" href="http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043.html#xtor=RSS-182"&gt;Tom's Hardware&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a title="AMD FX-8150, FX Series Review - Bulldozer makes debut" href="http://www.techspot.com/review/452-amd-bulldozer-fx-cpus/"&gt;Tech Spot&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a title="AMD FX processor brings eight cores to battle, we go eyes-on" href="http://www.engadget.com/2011/10/12/amd-fx-processor-brings-eight-cores-to-battle-we-go-eyes-on-vi/"&gt;Engadget&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;DGLee@XS &lt;a href="http://www.xtremesystems.org/forums/showthread.php?275869-AMD-FX-quot-Bulldozer-quot-Review-%281%29-Gaming"&gt;part 1&lt;/a&gt; &lt;a href="http://www.xtremesystems.org/forums/showthread.php?275871-AMD-FX-quot-Bulldozer-quot-Review-%282%29-Synthetic-benchmarks"&gt;part 2&lt;/a&gt; &lt;a href="http://www.xtremesystems.org/forums/showthread.php?275872-AMD-FX-quot-Bulldozer-quot-Review-%283%29-Multi-media-benchmarks"&gt;part 3&lt;/a&gt; &lt;a href="http://www.xtremesystems.org/forums/showthread.php?275873-AMD-FX-quot-Bulldozer-quot-Review-%284%29-!exclusive!-Excuse-for-1-Threaded-Perf."&gt;part 4&lt;/a&gt; (the 4th part is interesting, where he tests the performance of 4 modules with 1 core disabled each)&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://tech.icrontic.com/articles//amd-bulldozer-review/?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%3A+icrontictech+%28Icrontic+Tech%29"&gt;Icrontic&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.maximumpc.com/article/features/bulldozer_benchmarked_and_analyzed_amd_back_game"&gt;Maximum PC&lt;/a&gt; (has a nice result table)&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.hardocp.com/article/2011/10/11/amd_bulldozer_fx8150_gameplay_performance_review/"&gt;[H]ardOCP&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://techreport.com/articles.x/21813?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%3A+techreport%2Fall+%28The+Tech+Report%29"&gt;The Tech Report&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested"&gt;Anand Tech&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.xbitlabs.com/articles/cpu/display/amd-fx-8150.html"&gt;XBit Labs&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.legitreviews.com/article/1741/1/"&gt;Legit Reviews&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.techradar.com/reviews/pc-mac/pc-components/processors/amd-fx-8150-1033315/review?artc_pg=1"&gt;TechRadar&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://benchmarkreviews.com/index.php?option=com_content&amp;task=view&amp;id=831&amp;Itemid=63"&gt;Benchmark Reviews&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://vr-zone.com/articles/amd-fx-8150-memory-scaling-investigation--feeding-the-bulldozer/13704.html"&gt;VR Zone&lt;/a&gt; (test of memory scaling with latest BIOS, BTW I think, each module has a 64b read+write interface, so it's limited at 17.6GB/s per module w/ NB running at 2.2GHz)&lt;/p&gt;
	&lt;p&gt;Also at XS is this &lt;a href="http://www.xtremesystems.org/forums/showthread.php?275867-AMD-Bulldozer-Thread"&gt;long list of reviews&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;Since this microarchitecture is a clean break from any existing x86 microarchitecture before, it won't be perfectly suited for legacy software. Software-wise it's a situation like in times of Intel's Pentium 4. Furthermore rumours indicate that there are some things to be fixed (think of the Linux kernel patch to avoid unnecessary cache line thrashing in the instruction cache).&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2011/10/12/amd-fx-processor-launch-12002215/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2011/10/12/amd-fx-processor-launch-12002215/</link><pubDate>Wed, 12 Oct 2011 06:54:32 +0200</pubDate></item><item><title>Eight Opteron 6200 models listed and FX slide deck leaked</title><description>	&lt;p&gt;HPC vendor Penguin Computing lists eight Opteron 6200 models in &lt;a href="http://www.penguincomputing.com/hardware/linux_servers/configurator/amd/altus2800i"&gt;their configurator&lt;/a&gt; including base, All Core Turbo and max. Turbo Core clock frequencies:&lt;/p&gt;
	&lt;ul&gt;
	&lt;li&gt;AMD Opteron 6212, 8C, 2.8/3.7/3.7GHz&lt;/li&gt;
	&lt;li&gt;AMD Opteron 6220, 8C, 3.0/3.6GHz&lt;/li&gt;
	&lt;li&gt;AMD Opteron 6234, 12C, 2.3/2.9/3.1GHz&lt;/li&gt;
	&lt;li&gt;AMD Opteron 6238, 12C, 2.5/3.1/3.3GHz&lt;/li&gt;
	&lt;li&gt;AMD Opteron 6272, 16C, 2.1/2.7/3.1GHz&lt;/li&gt;
	&lt;li&gt;AMD Opteron 6274, 16C, 2.2/2.8/3.2GHz&lt;/li&gt;
	&lt;li&gt;AMD Opteron 6276, 16C, 2.3/2.9/3.3GHz&lt;/li&gt;
	&lt;li&gt;AMD Opteron 6282SE, 16C, 2.5/3.1/3.5GHz&lt;/li&gt;
	&lt;/ul&gt;
	&lt;p&gt;Thanks to Daniel Bowers on Twitter. Update: It looks like those SKU listings have been removed.&lt;/p&gt;
	&lt;p&gt;In case you missed that: Donanimhaber leaked AMD's FX slides with many new details and first official benchmark results. The main article is &lt;a href="http://www.donanimhaber.com/islemci/haberleri/AMD-Bulldozer-hakkinda-her-sey-islemciler-teknik-ozellikler-test-sonuclari.htm"&gt;here&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?sl=tr&amp;tl=en&amp;js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;u=http%3A%2F%2Fwww.donanimhaber.com%2Fislemci%2Fhaberleri%2FAMD-Bulldozer-hakkinda-her-sey-islemciler-teknik-ozellikler-test-sonuclari.htm"&gt;translation&lt;/a&gt;) and the gallery is &lt;a href="http://www.donanimhaber.com/islemci/galerileri/AMD-Bulldozer-FX-resmi-test-sonuclari.htm"&gt;here&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;And @foobarjp2010 pointed me to an article about future gaming hardware by Hiroshige Goto: &lt;a href="http://game.watch.impress.co.jp/docs/news/20110909_476229.html"&gt;Japanese &lt;/a&gt;/ &lt;a href="http://translate.google.com/translate?sl=ja&amp;tl=en&amp;js=n&amp;prev=_t&amp;hl=de&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;u=http%3A%2F%2Fgame.watch.impress.co.jp%2Fdocs%2Fnews%2F20110909_476229.html"&gt;Googlish&lt;/a&gt;. While I'm at it, here are his articles on &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20110830_473823.html"&gt;Bulldozer/Hot Chips 23&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?hl=de&amp;sl=ja&amp;tl=en&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20110830_473823.html"&gt;Googlish&lt;/a&gt;), &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20110907_475560.html"&gt;the inner workings of the Llano APU&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?hl=de&amp;sl=ja&amp;tl=en&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20110907_475560.html"&gt;Googlish&lt;/a&gt;), &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20110915_477478.html"&gt;Intel's 22nm transistor technology and Haswell&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?hl=de&amp;sl=ja&amp;tl=en&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20110915_477478.html"&gt;Googlish&lt;/a&gt;), and &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20110916_478013.html"&gt;the microarchitecture of Ivy Bridge&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?hl=de&amp;sl=ja&amp;tl=en&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20110916_478013.html"&gt;Googlish&lt;/a&gt;)&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2011/09/24/eight-opteron-6200-models-listed-and-fx-slide-deck-leaked-11911716/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2011/09/24/eight-opteron-6200-models-listed-and-fx-slide-deck-leaked-11911716/</link><pubDate>Sat, 24 Sep 2011 22:41:39 +0200</pubDate></item><item><title>Bulldozer Engineering Samples and an Analysis of Orochi's Die Size</title><description>	&lt;p&gt;Sitting under the hot Cretan sun might not be the best place to find clear thoughts but some palm tree shadows and water cooling provided by the hotel's pool did enough to finish this blog article. But in short: what a week! While Apple lost Steve Jobs as their charismatically leading CEO, AMD got the probably most enthusiastic and motivating CEO ever since the days of Jerry Sanders. Time will tell. A Bulldozer based CPU hasn't been launched yet. With a felt low probability this might happen &lt;a href="http://www.bit-tech.net/news/hardware/2011/08/26/amd-bulldozer-to-ship-within-the-next-week/1"&gt;very soon&lt;/a&gt;, we'll see. The true Orochi die size has been revealed and more BD ES benchmark results trickled out of more or less dubious sources. Many interesting topics to cover.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt; Bulldozer Engineering Sample Performance&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt; Recently a new wave of Chinese benchmark leaks and some new results by OBR stirred up discussions in several forums. A lot of energy is being put into reasoning, whether the ES were run under optimal settings or not. The variance seen amongst benchmark results of different BD ES or even retail CPUs on different boards but at the same base clocks speak for suboptimal configurations. Such a configuration doesn't only include the type of HDD/SSD, RAM or OS, but extends to proper BIOS and driver support, which I assume to be missing or not ready yet. In case of graphics performance even the graphics drivers could be a reason for lower than expected graphics performance if the drivers code (which runs on the CPU) is not fully optimized for a target CPU.&lt;/p&gt;
	&lt;p&gt; &lt;em&gt; CPU Misconfiguration&lt;/em&gt;&lt;/p&gt;
	&lt;p&gt; My first PC was a custom built system with an 80386 CPU. It had some cache on the mainboard. Running Norton System Info on it revealed that raw CPU performance wasn't as high as other results suggested. The reason was the mainboard cache was deactivated in the BIOS settings. Enabling it doubled the SI-score! So much on the effect of misconfiguration. Now imagine the much more complex settings of Bulldozer based CPUs, which likely became the most configurable series of AMD CPUs ever. Some examples of where something could go wrong to achieve optimal performance:&lt;/p&gt;
	&lt;ul&gt;
	&lt;li&gt; memory controller modes and timings aren't configured correctly&lt;/li&gt;
	&lt;li&gt; NB and L3 are still not clocked at retail speed levels &lt;/li&gt;
	&lt;li&gt; modes and settings of different caches&lt;/li&gt;
	&lt;li&gt; turbo mode settings (besides the specific P-State settings this includes timings, switching rules)&lt;/li&gt;
	&lt;li&gt; core configuration (BD allows different configurations to be adapted to specific workloads)&lt;/li&gt;
	&lt;/ul&gt;
	&lt;p&gt;&lt;em&gt; Software Code Paths and Code Optimization&lt;/em&gt;&lt;/p&gt;
	&lt;p&gt; The executed code path also has a significant effect on performance. I mentioned this on Twitter already but it's time to go into more detail. Here are some points where software running on BD might achieve a performance significantly lower than software optimized for BD:&lt;/p&gt;
	&lt;ul&gt;
	&lt;li&gt; if software checks for specific CPUIDs (esp. family) and has several code paths optimized for older CPUs it might still choose one of the worse optimized ones &lt;/li&gt;
	&lt;li&gt; most softwares won't use FMA, but use of SSE1-4, which uses only half of max. theoretical FP throughput (will be the case for all FP code paths not explicitly optimized for Bulldozer or Haswell)&lt;/li&gt;
	&lt;li&gt; use of scalar FP code (MMX, x87, scalar SSE ops) will only utilize 1/4 of max. theoretical FP throughput (think of still used benchmarks like SuperPi)&lt;/li&gt;
	&lt;li&gt; wrong cache blocking might cause cache line thrashing (e.g. code might group data into blocks of 32kB or 64kB to fit into current L1 cache sizes) &lt;/li&gt;
	&lt;li&gt; code alignment/content rules &lt;/li&gt;
	&lt;li&gt; streamed writes (using too many streams might drop write throughput to much lower levels than on 10h according to the optimization manual)&lt;/li&gt;
	&lt;/ul&gt;
	&lt;p&gt;&lt;em&gt;Power Cap / TDP Limit&lt;/em&gt;&lt;/p&gt;
	&lt;p&gt; Some guys at the web already speculated about AM3/AM3+ board related issues regarding power supply and Bulldozer's specific requirements and TDP limit feature being a reason for lower than expected ES performance. To understand part of these issues have a look at a voltage graph of Llano while jumping between different P-States. This diagram is from AMD and shows voltage over time:&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://data7.blog.de/media/365/5824365_97153b2305_m.png" alt="Llano_pwr_mgmt"&gt;&lt;/p&gt;
	&lt;p&gt; This behaviour has already been called "dithering" by some. Also remember Asus' AM3+ advantage slides, which covered topics like power/voltage jumps. So if for stability reasons a lower peak TDP cap than planned for retail has been applied to BD ES, this might indeed have an effect on performance. The high TDP readings of CPU-Z might just be this peak value, but this is speculation. Some AMD patents described kind of a TDP-budget related throttling at different stages during execution. Further some turbo P-states (esp. All Core Turbo) might become unavailable in such cases.&lt;/p&gt;
	&lt;p&gt; If the TDP limit isn't too low then low threaded benchmark performance should still get the highest turbo clocks, while multithreaded high throughput benchmarks might suffer. &lt;/p&gt;
	&lt;p&gt; &lt;em&gt; HyperTransport Flaw in earlier ES Steppings&lt;/em&gt;&lt;/p&gt;
	&lt;p&gt; One site&lt;a href="http://donthatethegeek.com/2011/08/18/amd-bulldozer-x8-in-production/"&gt; recently reported&lt;/a&gt; about Bulldozer in general but also mentioned a reason for lower ES performance: a flaw in the HyperTransport module might cause bad cache and memory performance, thus affecting overall performance significantly.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt; Bulldozer Engineering Sample OPNs&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt; Fortunately it's not too late for the following: About 2 months ago I started to assemble a list of BD ES and their OPNs as they could be found at different places. The most interesting ones might be the listed Sandra results since I found them mentioned nowhere. One result even made it to &lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffc8ee8feed3e5d0e7d6f082bf8fa9cca994a482f1ccf4&amp;l=en"&gt;#1 worldwide in SiSoft Sandra's crypto benchmark&lt;/a&gt; (the same machine as #4 and #5 in the table below).&lt;/p&gt;
	&lt;p&gt; If you look carefully at the OPN codes, you'll notice that the right part of the full OPN code contains some additional information compared to what has already been leaked about the main part of the OPN (left half). I read it this way:&lt;br&gt; TC/BC/NC_?/C# with&lt;br&gt; TC: Max. Turbo Core Clock (GHz)&lt;br&gt; BC: Base Clock (GHz)&lt;br&gt; NC: North Bridge and/or L3 Clock (GHz)&lt;br&gt; ?: unknown (always 2)&lt;br&gt; C#: Number of Cores (physically present, not always fully activated)  &lt;/p&gt;
	&lt;p&gt; Example: 31/21/2_2/16 means 3.1 GHz max. Turbo Core Clock, 2.1 GHz Base Clock, 2.0 GHz NB/L3 Clock, 16 cores &lt;/p&gt;
	&lt;p&gt; And here is my collected list:&lt;br&gt; &lt;a title="Bulldozer_Eng_Sample_List" href="javascript:window.open("&gt;&lt;img src="http://data7.blog.de/media/294/5824294_a040152a5a_m.png" alt="Bulldozer_Eng_Sample_List"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;/p&gt;
	
	
	
	Source
	Link
	
	
	
	
	[1]
	&lt;a href="http://www.ubuntu.com/certification/hardware/201108-8333"&gt;&lt;a href="http://www.ubuntu.com/certification/hardware/201108-8333"&gt;http://www.ubuntu.com/certification/hardware/201108-8333&lt;/a&gt;&lt;/a&gt;
	
	
	[2]
	&lt;a href="http://openbenchmarking.org/s/AMD%20Eng%20Sample%20ZS182045TGG43_28"&gt;&lt;a href="http://openbenchmarking.org/s/AMD%20Eng%20Sample%20ZS182045TGG43_28"&gt;http://openbenchmarking.org/s/AMD%20Eng%20Sample%20ZS182045TGG43_28&lt;/a&gt;&lt;/a&gt;
	
	
	[3]
	&lt;a href="http://browse.geekbench.ca/geekbench2/view/412187"&gt;&lt;a href="http://browse.geekbench.ca/geekbench2/view/412187"&gt;http://browse.geekbench.ca/geekbench2/view/412187&lt;/a&gt;&lt;/a&gt;
	
	
	[4]
	&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcee889e8d5e1d4e5d7e5c3b18cbc9aff9aa797b1c2ffce&amp;l=en"&gt;&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcee889e8d5e1d4e5d7e5c3b18cbc9aff9aa797b1c2ffce&amp;l=en"&gt;http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcee889e8d5e1d4e5d7e5c3b18cbc9aff9aa797b1c2ffce&amp;l=en&lt;/a&gt;&lt;/a&gt;
	
	
	[5]
	&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffc9ef8eefd2e3d5e2d4e5c3b18cbc9aff9aa797b1c2ffce&amp;l=en"&gt;&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffc9ef8eefd2e3d5e2d4e5c3b18cbc9aff9aa797b1c2ffce&amp;l=en"&gt;http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffc9ef8eefd2e3d5e2d4e5c3b18cbc9aff9aa797b1c2ffce&amp;l=en&lt;/a&gt;&lt;/a&gt;
	
	
	[6]
	&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcee889e8d5e6d3eadce8cebc81b197f297aa9abccff2c3&amp;l=en"&gt;&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcee889e8d5e6d3eadce8cebc81b197f297aa9abccff2c3&amp;l=en"&gt;http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcee889e8d5e6d3eadce8cebc81b197f297aa9abccff2c3&amp;l=en&lt;/a&gt;&lt;/a&gt;
	
	
	[7]
	&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcee889e8d5e1d8ebddeccab885b593f693ae9eb8cbf6c6&amp;l=en"&gt;&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcee889e8d5e1d8ebddeccab885b593f693ae9eb8cbf6c6&amp;l=en"&gt;http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcee889e8d5e1d8ebddeccab885b593f693ae9eb8cbf6c6&amp;l=en&lt;/a&gt;&lt;/a&gt;
	
	
	[8]
	&lt;a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CBQQFjAA&amp;url=http%3A%2F%2Fwww.sisoftware.co.uk%2Frank2009%2Ftop_run.php%3Fq%3Dc2ffcee89aa797b1d8e5d7f199a491b7cff2c2e481e4d9e9cfbc81b0%26l%3Den&amp;rct=j&amp;q=amd%20%228x%202MB%22%20site%3Asisoftware.co.uk&amp;ei=MZ47Tse5I4fGtAbZt4y1Ag&amp;usg=AFQjCNGOfbZvWhcdFAtSBqhSv4bd2LVULA&amp;cad=rja"&gt;SiSoft Sandra result in Google Cache&lt;/a&gt;
	
	
	[9]
	&lt;a href="http://www.google.com/url?sa=t&amp;source=web&amp;cd=2&amp;ved=0CBoQFjAB&amp;url=http%3A%2F%2Fwww.sisoftware.co.uk%2Frank2009%2Ftop_run.php%3Fq%3Dc2ffcfe99ba697b1d8e5d5f39ba696b0c8f5c5e386e3deeec8bb86b7%26l%3Den&amp;rct=j&amp;q=amd%20%228x%202MB%22%20site%3Asisoftware.co.uk&amp;ei=MZ47Tse5I4fGtAbZt4y1Ag&amp;usg=AFQjCNFp6gPwN_jqLFGjinZxZR1hBWlvWw&amp;cad=rja"&gt;SiSoft Sandra result in Google Cache&lt;/a&gt;
	
	
	[10]
	&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcfe988e9d4e2d3e5d1e2c4b68bbb9df89da090b6c5f8c9&amp;l=en"&gt;&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcfe988e9d4e2d3e5d1e2c4b68bbb9df89da090b6c5f8c9&amp;l=en"&gt;http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcfe988e9d4e2d3e5d1e2c4b68bbb9df89da090b6c5f8c9&amp;l=en&lt;/a&gt;&lt;/a&gt;
	
	
	[11]
	&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcfe988e9d4e2d6e5ddfb89b484a2c7a29faf89fac7fe&amp;l=en"&gt;&lt;a href="http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcfe988e9d4e2d6e5ddfb89b484a2c7a29faf89fac7fe&amp;l=en"&gt;http://www.sisoftware.co.uk/rank2011d/show_run.php?q=c2ffcfe988e9d4e2d6e5ddfb89b484a2c7a29faf89fac7fe&amp;l=en&lt;/a&gt;&lt;/a&gt;
	
	
	[12]
	&lt;a href="http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=300392"&gt;&lt;a href="http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=300392"&gt;http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=300392&lt;/a&gt;&lt;/a&gt;
	
	
	[13]
	&lt;a href="http://abcathome.com/show_host_detail.php?hostid=176325"&gt;&lt;a href="http://abcathome.com/show_host_detail.php?hostid=176325"&gt;http://abcathome.com/show_host_detail.php?hostid=176325&lt;/a&gt;&lt;/a&gt;
	
	
	[14]
	&lt;a href="http://setiathome.berkeley.edu/sah___##3##___show_host_detail.php?hostid=5747954"&gt;http:&lt;/em&gt;setiathome.berkeley.edu/sah//show_host_detail.php?hostid=5747954&lt;/a&gt;
	
	
	[15]
	&lt;a href="http://openbenchmarking.org/global/root-15821-31683-1045"&gt;&lt;a href="http://openbenchmarking.org/global/root-15821-31683-1045"&gt;http://openbenchmarking.org/global/root-15821-31683-1045&lt;/a&gt;&lt;/a&gt;
	
	
	
	&lt;p&gt;&lt;span&gt;I will fix non working links later.&lt;/span&gt;&lt;/p&gt;
	&lt;p&gt;&lt;strong&gt;Some Economical Aspects of Orochi's Die Size&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt; Charlie reported that Orochi die size is 315 mm² according to an AMD slide. This is 7% more than my latest measurement of 294 mm² based on the clean non-distortet Orochi die photo from ISSCC 2011and the already known area of a Bulldozer module incl. its L2 cache, published at the same conference. So the earlier estimations of 300 to 320 mm² (based on DRAM I/O pads, L1 I-cache sizes etc.) were closer to the truth than the methodically more accurate pixel measurement &lt;img src="/img/smilies/icon_wink.gif" alt=";)" class="middle" border="0"&gt; The best methods are worth nothing when applied to bad or unclear data, in this case the way of counting the power gating ring area I suppose.&lt;/p&gt;
	&lt;p&gt; What are the economical aspects of Bulldozer's die size being greater than expected by many? First there is an effect on the pure costs of producing Orochi dies. This value can only be estimated by parties outside of Globalfoundries and AMD, and it depends on many input variables not known to the public. One way is to use known data of other foundries or the semiconductor industry as a whole. In this case I assume a price of $5k for the processed wafer (ignoring the new die-based WSA between AMD and GF) and die yields of 70%.&lt;/p&gt;
	&lt;p&gt; After roughly 10% losses at the wafer edge and markings this would mean ca. 150 net die at a die size of 294 mm² or ca. 140 net die at 315 mm². With packaging/test costs of maybe $5 per die the resulting cost per die would be $38 for the first case and $41 for the actual die size. At a yield of 50% the net die count would be 107 and 100 respectively and resulting costs $52 or $55. Such a difference of $3 is certainly smaller than the error margin we can assume for these calculations.&lt;/p&gt;
	&lt;p&gt; But there is another possibility, where die size has an effect: If demand is greater than supply - a situation AMD is familiar with - this might indeed be a somewhat bigger problem than just an increase in variable cost of $3 per die, when Globalfoundries is short of 32nm capacity. Just imagine a hypothetic case, where AMD could order no more than 5000 wafers for Orochi dies. That doesn't sound much but with those 5k of wafers this means 750k vs. 700k chips. If demand is high enough (say: 800k), assuming an ASP of $200 for the DT/server processor mix those missing 50k would result in a reduction of achievable revenue by about $10M or less than 1% in Q3.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;img src="http://vg06.met.vgwort.de/na/63264ebe83fe4be99e3b38b8c156707a" alt="" width="1" height="1"&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2011/08/28/bulldozer-engineering-samples-llano-and-amd-s-fusion-developer-summit-11338315/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2011/08/28/bulldozer-engineering-samples-llano-and-amd-s-fusion-developer-summit-11338315/</link><pubDate>Sun, 28 Aug 2011 20:46:36 +0200</pubDate></item><item><title>AMD Bulldozer Engineering Sample Rumours</title><description>	&lt;p&gt;Before Bulldozer based processors like Zambezi or Interlagos will be publicly reviewed, further leaks and fakes can be expected. After observing some interesting developments on the internet during the past weeks I could post an in-depth analysis of how rumours start or how estimations/analsyses based on rumoured infos or just faked screenshots are taken as granted by some writers/posters out there, triggering some bigger sites' authors' "newsworthy" receptors and finally creating a wave of false information news. This is what happened with those 3.8GHz base / 4.2GHz max. turbo numbers of the FX-8130P model, based on a likely &lt;a href="http://www.bit-tech.net/news/hardware/2011/05/24/leaked-slide-details-bulldozer-models/1"&gt;faked Asus slide&lt;/a&gt;. It actually prompted me to create my own &lt;em&gt;free&lt;/em&gt; interpretation of that empty space left by some erased numbers ;)&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://info.nuje.de/speculative_fx_clocks.jpg" alt=""&gt;&lt;/p&gt;
	&lt;p&gt;And it can still be observed, how such kind of wrong information still outpaces really new information. Unfortunately there is no filter for such kind of "news".&lt;/p&gt;
	&lt;p&gt;While this pic is of no use, available information on the net indicates 3.2GHz base clock, 3.6GHz all-core turbo and 4.2GHz max turbo for the FX-8130P. To me it also seems plausible that such a model will be available (as 8110, 6110 and 4110). It just seems less plausible to offer rumoured models 8100, 8120, 8150, 8170 while leaving out 8110, 8130, 8140, 8160. However, this is still speculation.&lt;/p&gt;
	&lt;p&gt; &lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2011/07/21/amd-bulldozer-engineering-sample-rumours-11513994/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2011/07/21/amd-bulldozer-engineering-sample-rumours-11513994/</link><pubDate>Thu, 21 Jul 2011 00:08:22 +0200</pubDate></item><item><title>AMD Bulldozer Software Optimization Guide is online</title><description>	&lt;p&gt;No time for blogging much this evening, but one thing I don't want to let slip through my twitter box: AMD released the Family 15h Software Optimization Guide: &lt;a href="http://support.amd.com/us/Processor_TechDocs/47414.pdf"&gt;&lt;a href="http://support.amd.com/us/Processor_TechDocs/47414.pdf"&gt;http://support.amd.com/us/Processor_TechDocs/47414.pdf&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;[Update] Some quotes:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;&lt;span&gt;1.6.4 Instruction Fetching Improvements&lt;br&gt; While previous AMD64 processors had a single 32-byte fetch window, AMD Family 15h processors have two 32-byte fetch windows, from which four µops can be selected. These fetch windows, when combined with the 128-bit floating-point execution unit, allow the processor to sustain a fetch/dispatch/retire sequence of four instructions per cycle.&lt;br&gt; (page 25)&lt;/p&gt;
	&lt;p&gt; 1.6.6 Notable Performance Improvements&lt;br&gt; Several enhancements to the AMD64 architecture have resulted in significant performance improvements in AMD Family 15h processors, including:&lt;br&gt; • Improved performance of shuffle instructions&lt;br&gt; • Improved data transfer between floating-point registers and general purpose registers&lt;br&gt; • Improved floating-point register to floating-point register moves&lt;br&gt; • Optimization of repeated move instructions&lt;br&gt; • More efficient PUSH/POP stack operations&lt;br&gt; • 1-Gbyte paging&lt;br&gt;(page 26)&lt;br&gt; &lt;/span&gt;&lt;/p&gt;
	&lt;p&gt;&lt;span&gt;2.1 Key Microarchitecture Features&lt;br&gt; AMD Family 15h processors include many features designed to improve software performance. The internal design, or microarchitecture, of these processors provides the following key features:&lt;br&gt; • Integrated DDR3 memory controller with memory prefetcher&lt;br&gt; • 64-Kbyte L1 instruction cache and 16-Kbyte L1 data cache&lt;br&gt; • Shared L2 cache between cores of compute unit&lt;br&gt; • Shared L3 cache compute units on chip (for supported platforms)&lt;br&gt; • 32-byte instruction fetch&lt;br&gt; • Instruction predecode and branch prediction during cache-line fills&lt;br&gt; • Decoupled prediction and instruction fetch pipelines&lt;br&gt; • Four-wayAMD64 instruction decoding (This is a theoretical limit. See section 2.3 on page 31.) &lt;br&gt; • Dynamic scheduling and speculative execution&lt;br&gt; • Two-way integer execution&lt;br&gt; • Two-way address generation&lt;br&gt; • Two-way 128-bit wide floating-point execution&lt;br&gt; • Legacy single-instruction multiple-data (SIMD) instruction extensions, as well as support for XOP, FMA4, VPERMILx, and Advanced Vector Extensions (AVX).&lt;br&gt; • Superforwarding&lt;br&gt; • Prefetch into L2 or L1 data cache&lt;br&gt; • Deep out-of-order integer and floating-point execution&lt;br&gt; • HyperTransport™ technology&lt;br&gt;(page 30)&lt;br&gt;&lt;/span&gt;&lt;/p&gt;
	&lt;p&gt;&lt;span&gt;The minimum branch misprediction penalty is 20 cycles in the case of conditional and indirect branches and 15 cycles for unconditional direct branches and returns.&lt;br&gt;(page 34)&lt;br&gt;&lt;/span&gt;&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;Enjoy! BTW this is my 100th blog entry after nearly 2 years!&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2011/04/07/amd-bulldozer-software-optimization-guide-is-online-10968638/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2011/04/07/amd-bulldozer-software-optimization-guide-is-online-10968638/</link><pubDate>Thu, 07 Apr 2011 22:36:28 +0200</pubDate></item><item><title>Llano benchmarks and construction work at Globalfoundries</title><description>	&lt;p&gt;&lt;strong&gt;Llano benchmarks&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt;Just a few hours before April Fool's day someone uploaded a first &lt;a href="http://browse.geekbench.ca/geekbench2/view/389654"&gt;Geekbench result &lt;/a&gt;of a Llano engineering sample equipped system running Linux. A first quick comparison to Athlon II X4 results showed an much different scaling factor from one to four cores for the different integer and floating point benchmarks. As the reported clock frequency of 800MHz indicates, power management might have been active on that CPU. And while Geekbench sometimes also gets cache sizes wrong, I'm sure, that "Family 18 Model 1 Stepping 0" is correct. Three quarters ago the &lt;a href="http://citavia.blog.de/2010/06/29/llano-tri-core-and-ontario-dual-core-spotted-8884456/"&gt;leaked BOINC benchmark results&lt;/a&gt; were running on a "Model 0" ES. I guess that 0 means prototype while other numbers are being used for products.&lt;/p&gt;
	&lt;p&gt;Comparing Geekbench results is difficult. Apparently scores for the same CPU vary with the version of the benchmark, OS because of different compilers (e.g. MS C++ 8 on Windows) and ISA (64 bit is usually faster). But I found a &lt;a href="http://browse.geekbench.ca/geekbench2/view/274528"&gt;result of an Athlon II X4&lt;/a&gt; system which should make a first comparison possible. I didn't chose Phenom II to exclude L3 cache effects.&lt;/p&gt;
	&lt;p&gt;In the following diagrams you see the integer and floating point scores. The green line shows the performance ratio (scale is on the right side). So in single-threaded benchmarks (sorted to the left) Llano sometimes achieves higher scores than the 2.2 GHz Athlon II, sometimes comparable to a 2.4-2.5 GHz chip. Based on this and the reported clock speed of 800 MHz (lowest P-state clock) it's very likely that Turbo Core 2.0 is working.&lt;/p&gt;
	&lt;p&gt;The multi-threaded benchmarks give a different picture. Here we see a drop to 80-90% the performance of the counterpart. It looks like the ES was running at 1.8 GHz (81% of 2.2 GHz) like the CeBit demo system. Any other differences could be explained by the incremental architectural changes (like the hardware divider, as discussed in depth in &lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=de&amp;tl=en&amp;u=http://www.heise.de/ct/artikel/Teile-mit-Eile-1194892.html"&gt;this article&lt;/a&gt; at Heise), the memory speed, the new north bridge, the integrated GPU (which shouldn't have had that much of influence here) and obviously the &lt;a href="http://blogs.amd.com/fusion/2010/02/08/amd-talks-llano-x86-innovation-isscc/"&gt;new power management&lt;/a&gt;. You could also look at &lt;a href="http://browse.geekbench.ca/geekbench2/compare/389654/274528"&gt;all subscores side by side&lt;/a&gt;, as provided by the Geekbench website.&lt;/p&gt;
	&lt;p&gt;&lt;a title="Llano_ES_vs_AthlonIIX4_Geekbench_Integer" href="http://www.blog.de/media/photo/llano_es_vs_athloniix4_geekbench_integer/5478264"&gt;&lt;/a&gt;&lt;a title="Llano_ES_vs_AthlonIIX4_Geekbench_Integer2" href="http://www.blog.de/media/photo/llano_es_vs_athloniix4_geekbench_integer2/5478296"&gt;&lt;img src="http://data6.blog.de/media/296/5478296_4f0121d78c_m.png" alt="Llano_ES_vs_AthlonIIX4_Geekbench_Integer2"&gt;&lt;/a&gt;&lt;br&gt; &lt;a title="Llano_ES_vs_AthlonIIX4_Geekbench_FP2" href="http://www.blog.de/media/photo/llano_es_vs_athloniix4_geekbench_fp2/5478297"&gt;&lt;img src="http://data6.blog.de/media/297/5478297_c298454491_m.png" alt="Llano_ES_vs_AthlonIIX4_Geekbench_FP2"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt; Bulldozer will have an even improved Turbo Core implementation, as &lt;a href="http://blogs.amd.com/work/2011/02/22/amd-at-isscc-bulldozer-innovations-target-energy-efficiency/"&gt;this blog&lt;/a&gt; suggests:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;And finally, a next generation AMD Turbo CORE technology implementation that provides maximum compute speed when required, and throttles back to maximum efficiency when appropriate.  Bulldozer implements a significantly more aggressive version of this capability than "Llano” with more details to be disclosed in the future.&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;&lt;strong&gt;&lt;br&gt;Construction work at Globalfoundries&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt;While driving by the Fab 1 complex near Dresden a few weeks ago I asked my girlfriend (many thanks! &lt;img src="http://www.blog.de/image/smileys/04smile.gif" alt=""&gt;) to take some photos of it. The one below shows some cranes (and probably some Bulldozers behind or right inside the fab). Several tech sites already reported about a planned &lt;a href="http://www.heise.de/newsticker/meldung/Grundstein-fuer-Erweiterung-des-Globalfoundries-Werks-Dresden-gelegt-1080085.html?view=zoom;zoom=1"&gt;fab extension called "Annex"&lt;/a&gt;, which will be located exactly where those cranes are standing. This new 10,000 m² (110,000 square feet) cleanroom will allow to increase Fab 1's total capacity to 60,000 WSPM this year and &lt;a href="http://www.globalfoundries.com/newsroom/2010/20100601.aspx"&gt;to 80,000 WSPM&lt;/a&gt; by end of next year.&lt;br&gt; &lt;a title="IMG_4280" href="javascript:window.open("&gt; &lt;img src="http://data6.blog.de/media/990/5470990_7ed30c7a58_m.jpeg" alt="IMG_4280"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;img src="http://vg06.met.vgwort.de/na/d34429efa4494ddda3e905d9033ab51b" alt="" width="1" height="1"&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2011/04/05/a-llano-benchmark-and-construction-work-at-globalfoundries-10934679/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2011/04/05/a-llano-benchmark-and-construction-work-at-globalfoundries-10934679/</link><pubDate>Tue, 05 Apr 2011 12:39:30 +0200</pubDate></item><item><title>ISSCC 2011 news and die size of an 8C Bulldozer</title><description>	&lt;p&gt;Before finding the time to do any deeper analysis of how the facts which were presented at ISSCC 2011 last week I decided to let you at least know the current state of die size analysis. So far &lt;a href="http://www.theregister.co.uk/2011/02/24/amd_bulldozer_core_isscc/"&gt;The Reg&lt;/a&gt; published an apparently unphotoshopped die photo last week. Taking this one and &lt;a href="http://www.semiaccurate.com/forums/showthread.php?p=100349#post100349"&gt;doing some pixel count measurements&lt;/a&gt; resulted in a die size of &lt;strong&gt;~294 mm²&lt;/strong&gt; for the Orochi die, which will be used for Zambezi, Valencia and Interlagos processors (obviously two of them for the latter). This is in line with measurements by &lt;a href="http://www.semiaccurate.com/forums/showthread.php?p=100377#post100377"&gt;Hans de Vries&lt;/a&gt; (292 mm²) and also &lt;a href="http://citavia.blog.de/2010/11/08/latest-news-rumors-and-amd-s-upcoming-financial-analyst-day-9933368/"&gt;my earlier estimation&lt;/a&gt; of 300 +/- 20 mm². Hiroshige Goto just published &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20110301_430044.html"&gt;his article on Bulldozer&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=ja&amp;tl=en&amp;u=http://pc.watch.impress.co.jp/docs/column/kaigai/20110301_430044.html"&gt;translated&lt;/a&gt;) with an even larger version of the die photo.&lt;/p&gt;
	&lt;p&gt;In the following picture (a little bit resized) you can see, how I counted:&lt;br&gt; &lt;img src="http://info.nuje.de/Orochi_die_size.png" alt="" width="400" height="318"&gt;&lt;/p&gt;
	&lt;p&gt;&lt;strong&gt;Some preliminary clock frequency number for Interlagos&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt;Two weeks ago &lt;a href="http://www.planet3dnow.de/vbulletin/showpost.php?p=4378267&amp;postcount=365"&gt;I looked&lt;/a&gt; for some info on supercomputers and found some upgrade plans, which included some Interlagos details:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;Significant Upgrade in June 2011 – 232 AMD &lt;strong&gt;2.3 GHz&lt;/strong&gt; 16-core OpteronInterlagosprocessors – 3,712 compute cores, 116 32-core nodes – 7.4 TB DDR3 memory, 64 GB/node, 2.0 GB/core&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;&lt;a href="http://www.esrl.noaa.gov/research/events/espc/8sep2010/Hack_Path_to_Exascale-Boulder.pdf"&gt;&lt;a href="http://www.esrl.noaa.gov/research/events/espc/8sep2010/Hack_Path_to_Exascale-Boulder.pdf"&gt;http://www.esrl.noaa.gov/research/events/espc/8sep2010/Hack_Path_to_Exascale-Boulder.pdf&lt;/a&gt;&lt;/a&gt; (slides 28, 30, 31)&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;In June 2011, a 720-teraflop Cray XE6 system will be added to Gaea. It will employ the next-generation AMD Interlagos 16-core processor. After the installation of that second system, the original 260-teraflop system will be upgraded with the same AMD Interlagos processor to achieve 386 teraflops.&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;&lt;a href="http://www.ncrc.gov/computing-resources/gaea/"&gt;&lt;a href="http://www.ncrc.gov/computing-resources/gaea/"&gt;http://www.ncrc.gov/computing-resources/gaea/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;So it looks like there will be Interlagos model running at 2.3 GHz base clock frequency. If this is any indication and given that super computer chips often are not running at the top bin frequencies, there might be Interlagos models up to 2.5 or even 2.6 GHz&lt;/p&gt;
	&lt;p&gt;&lt;strong&gt; And here is a bunch of links, which I will sort later:&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt; A fresh video of swapping two Magny Cours processors for Interlagos processors:&lt;br&gt; &lt;a href="http://blogs.amd.com/work/2011/02/25/filling-the-sockets/"&gt;&lt;a href="http://blogs.amd.com/work/2011/02/25/filling-the-sockets/"&gt;http://blogs.amd.com/work/2011/02/25/filling-the-sockets/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;News regarding Bulldozer capable AM3 boards:&lt;br&gt; &lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=de&amp;tl=en&amp;u=http://www.planet3dnow.de/cgi-bin/newspub/viewnews.cgi%3Fid%3D1298916148"&gt; &lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=de&amp;tl=en&amp;u=http://www.planet3dnow.de/cgi-bin/newspub/viewnews.cgi%3Fid%3D1298916148"&gt;http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=de&amp;tl=en&amp;u=http://www.planet3dnow.de/cgi-bin/newspub/viewnews.cgi%3Fid%3D1298916148&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt; &lt;/a&gt;&lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=de&amp;tl=en&amp;u=http://www.pcgameshardware.de/aid,813945/Cebit-2011-MSI-bringt-AMD-Bulldozer-taugliche-AM3-Mainboards-sowie-BIOS-Update/Mainboard/News/"&gt;&lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=de&amp;tl=en&amp;u=http://www.pcgameshardware.de/aid,813945/Cebit-2011-MSI-bringt-AMD-Bulldozer-taugliche-AM3-Mainboards-sowie-BIOS-Update/Mainboard/News/"&gt;http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=de&amp;tl=en&amp;u=http://www.pcgameshardware.de/aid,813945/Cebit-2011-MSI-bringt-AMD-Bulldozer-taugliche-AM3-Mainboards-sowie-BIOS-Update/Mainboard/News/&lt;/a&gt;&lt;/a&gt;&lt;br&gt;
Update: This round of news turned out to be based on wrong interpretations, possibly even some facts lost in translation, see &lt;a href="http://www.xtremesystems.org/forums/showpost.php?p=4764028&amp;postcount=231"&gt;John Fruehe's comment&lt;/a&gt; on this and the updates in the articles linked above. &lt;/p&gt;
	&lt;p&gt;John Fruehe's long explaination of how a Bulldozer could and could not compared to two cores of Magny Cours and that hypothetical single cored Bulldozer module:&lt;br&gt; &lt;a href="http://www.xtremesystems.org/forums/showpost.php?p=4755711&amp;postcount=67"&gt;&lt;a href="http://www.xtremesystems.org/forums/showpost.php?p=4755711&amp;postcount=67"&gt;http://www.xtremesystems.org/forums/showpost.php?p=4755711&amp;postcount=67&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;AMD blogs with some info about the ISSCC presentations:&lt;br&gt; &lt;a href="http://blogs.amd.com/work/2011/02/18/what-to-expect-from-amd-at-isscc-2011/"&gt;&lt;a href="http://blogs.amd.com/work/2011/02/18/what-to-expect-from-amd-at-isscc-2011/"&gt;http://blogs.amd.com/work/2011/02/18/what-to-expect-from-amd-at-isscc-2011/&lt;/a&gt;&lt;/a&gt;&lt;br&gt; &lt;a href="http://blogs.amd.com/work/2011/02/21/amd-at-isscc-whats-in-a-box/"&gt;&lt;a href="http://blogs.amd.com/work/2011/02/21/amd-at-isscc-whats-in-a-box/"&gt;http://blogs.amd.com/work/2011/02/21/amd-at-isscc-whats-in-a-box/&lt;/a&gt;&lt;br&gt; &lt;/a&gt;&lt;a href="http://blogs.amd.com/work/2011/02/21/amd-at-isscc-bulldozer-design-solutions/"&gt;&lt;a href="http://blogs.amd.com/work/2011/02/21/amd-at-isscc-bulldozer-design-solutions/"&gt;http://blogs.amd.com/work/2011/02/21/amd-at-isscc-bulldozer-design-solutions/&lt;/a&gt;&lt;/a&gt;&lt;br&gt; &lt;a href="http://blogs.amd.com/work/2011/02/22/amd-at-isscc-bulldozer-innovations-target-energy-efficiency/"&gt;&lt;a href="http://blogs.amd.com/work/2011/02/22/amd-at-isscc-bulldozer-innovations-target-energy-efficiency/"&gt;http://blogs.amd.com/work/2011/02/22/amd-at-isscc-bulldozer-innovations-target-energy-efficiency/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;More ISSCC related links:&lt;br&gt; &lt;a href="http://pc.watch.impress.co.jp/docs/news/event/20110223_428720.html"&gt;&lt;a href="http://pc.watch.impress.co.jp/docs/news/event/20110223_428720.html"&gt;http://pc.watch.impress.co.jp/docs/news/event/20110223_428720.html&lt;/a&gt;&lt;/a&gt;&lt;br&gt; &lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=ja&amp;tl=en&amp;u=http://pc.watch.impress.co.jp/docs/news/event/20110223_428720.html"&gt;&lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=ja&amp;tl=en&amp;u=http://pc.watch.impress.co.jp/docs/news/event/20110223_428720.html"&gt;http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=ja&amp;tl=en&amp;u=http://pc.watch.impress.co.jp/docs/news/event/20110223_428720.html&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt; &lt;a href="http://www.eetimes.com/electronics-news/4213365/ISSCC-China-eyes-petaflops-IBM-hits-5-GHz"&gt;&lt;a href="http://www.eetimes.com/electronics-news/4213365/ISSCC-China-eyes-petaflops-IBM-hits-5-GHz"&gt;http://www.eetimes.com/electronics-news/4213365/ISSCC-China-eyes-petaflops-IBM-hits-5-GHz&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;A little bit of info about Bulldozer's 256b AVX performance compared to it's 128b AVX performance (no absolute numbers), as found in a recent GCC patch:&lt;br&gt; &lt;a href="http://patchwork.ozlabs.org/patch/82705/"&gt;&lt;a href="http://patchwork.ozlabs.org/patch/82705/"&gt;http://patchwork.ozlabs.org/patch/82705/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;Attached is the patch to force gcc to generate 128-bit avx instructions for bdver1. We found that for the current Bulldozer processors, AVX128 performs better than AVX256. For example, AVX128 is 3% faster than AVX256 on CFP2006, and 2~3% faster than AVX256 on polyhedron.&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2011/03/01/isscc-2011-news-and-bulldozer-die-size-10726253/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2011/03/01/isscc-2011-news-and-bulldozer-die-size-10726253/</link><pubDate>Tue, 01 Mar 2011 08:34:15 +0100</pubDate></item><item><title>Brazos, OpenCL, Steamroller and other stuff</title><description>	&lt;p&gt;I would have liked to post my thoughts earlier than just yet, but higher priority tasks prevented me from doing so.&lt;/p&gt;
	&lt;p&gt;&lt;strong&gt;Brazos&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt;Those Brazos previews, which appeared around mid November, caused a lot of discussion on the net. One reason was the inclusion of a SSD instead of a HDD in the Zacate prototype system. This surely had some effect on benchmarks, which depend on hard drive performance. Further it influenced measured power consumption. Well, the benchmarks with most influence I can think of here, are office benchmarks like SYSMark, where script controlled applications do file operations, start . Other types of benchmarks are mostly or solely CPU/memory bound like Cinebench (surely an interesting benchmark for mobile devices), or CPU/GPU/memory bound like games (they usually load their data into RAM before starting a benchmark).&lt;/p&gt;
	&lt;p&gt;Another question seldomly answered by the previews was the function of Zacate's power management. Some reviewers included AMD's statements about idle power likely going further down in final systems. The most helpful remark came with the latest paper issue of the German c't mag, where the reviewer noted that the power management itself was not the final version. This might be supported by power consumption numbers running different kinds of code. For example the PC Per review lists a power consumption of 19.1W while running CineBench 11 (heavy CPU usage - esp. FPU) and 28.8W while running Left 4 Dead 2 with both heavy CPU and GPU usage. This is a difference of 10W for using 3D graphics. Some of this difference is likely caused by mem operations, which include power for the mem controller. But it's still possible that the available TDP headroom isn't fully exploited by the power management. This might be answered by an &lt;a href="http://www.xbitlabs.com/articles/cpu/display/amd-fusion-interview-2010.html"&gt;interview done by Xbit Labs&lt;/a&gt;, where Godfrey Cheng, director of client technology unit at AMD, answered a lot of questions about Zacate, Ontario and Fusion in general. He states that power management - esp. frequency boosting - isn't final and subject to change. So in case of the previews I wouldn't assume a higher CPU core clock than 1.6GHz and a higher GPU clock than 500MHz for tested Zacate systems. Further he said that Zacate is capable of delivering more than 90 GFLOPS compute performance. This is in comparison to the 400-500 GFLOPS stated for Llano.&lt;/p&gt;
	&lt;p&gt;I actually wonder if the Zacate APUs used in the prototype systems are of an earlier stepping than the ones sent out in the first batch shipped to device manufacturers as mentioned by AMD's CEO Dirk Meyer at the Financial Analyst Day. However, here are the (p)reviews:&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://techreport.com/articles.x/19981"&gt;&lt;a href="http://techreport.com/articles.x/19981"&gt;http://techreport.com/articles.x/19981&lt;/a&gt;&lt;/a&gt;&lt;br&gt;&lt;a href="http://www.pcper.com/article.php?aid=1039&amp;type=expert&amp;pid=1"&gt;&lt;a href="http://www.pcper.com/article.php?aid=1039&amp;type=expert&amp;pid=1"&gt;http://www.pcper.com/article.php?aid=1039&amp;type=expert&amp;pid=1&lt;/a&gt;&lt;/a&gt;&lt;br&gt;&lt;a href="http://www.anandtech.com/show/4023/the-brazos-performance-preview-amd-e350-benchmarked"&gt;&lt;a href="http://www.anandtech.com/show/4023/the-brazos-performance-preview-amd-e350-benchmarked"&gt;http://www.anandtech.com/show/4023/the-brazos-performance-preview-amd-e350-benchmarked&lt;/a&gt;&lt;/a&gt;&lt;br&gt;&lt;a href="http://hothardware.com/Reviews/AMD-Zacate-E350-Processor-Performance-Preview/"&gt;&lt;a href="http://hothardware.com/Reviews/AMD-Zacate-E350-Processor-Performance-Preview/"&gt;http://hothardware.com/Reviews/AMD-Zacate-E350-Processor-Performance-Preview/&lt;/a&gt;&lt;/a&gt;&lt;br&gt;&lt;a href="http://www.legitreviews.com/article/1470/1/"&gt;&lt;a href="http://www.legitreviews.com/article/1470/1/"&gt;http://www.legitreviews.com/article/1470/1/&lt;/a&gt;&lt;/a&gt;&lt;br&gt;&lt;a href="http://www.notebookreview.com/default.asp?newsID=5940&amp;p=2"&gt;&lt;a href="http://www.notebookreview.com/default.asp?newsID=5940&amp;p=2"&gt;http://www.notebookreview.com/default.asp?newsID=5940&amp;p=2&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;AMD recently also published the open source Ontario graphics driver for linux: &lt;a href="http://www.phoronix.com/scan.php?page=article&amp;item=amd_ontario_open&amp;num=1"&gt;&lt;a href="http://www.phoronix.com/scan.php?page=article&amp;item=amd_ontario_open&amp;num=1"&gt;http://www.phoronix.com/scan.php?page=article&amp;item=amd_ontario_open&amp;num=1&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;strong&gt;OpenCL&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt;A &lt;a href="http://realworldtech.com/page.cfm?ArticleID=RWT120710035639"&gt;new article from David Kanter&lt;/a&gt; about OpenCL not only covers technical details but also the history behind it. A related discussion can be found &lt;a href="http://realworldtech.com/forums/index.cfm?action=detail&amp;id=115023&amp;threadid=115023&amp;roomid=2"&gt;here&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;This week news came in about Apple's decision to use Sandy Bridge for their notebook series. There is still discussion going on whether Sandy Bridge's GPU will support OpenCL or not. According to an &lt;a href="http://arstechnica.com/apple/news/2010/12/apple-may-drop-nvidia-for-sandy-bridges-igp-next-year.ars"&gt;Ars Technica article&lt;/a&gt;, it won't due to the limited capabilities of the GPU. Well, so far an Intel representative already said the chip will support OpenCL 1.1. There was no distinction between GPU or CPU in his statement. However, even without OpenCL support in the GPU there would be a way to use OpenCL on it as Intel &lt;a href="http://www.khronos.org/developers/library/2010_siggraph_bof_opencl/OpenCL-BOF-Intel-SIGGRAPH-Jul10.pdf"&gt;has shown at this year's SIGGRAPH&lt;/a&gt;. Intel's OpenCL SDK can be found here (incl. videos of the SIGGRAPH talk): &lt;a href="http://software.intel.com/en-us/articles/intel-opencl-sdk/"&gt;&lt;a href="http://software.intel.com/en-us/articles/intel-opencl-sdk/"&gt;http://software.intel.com/en-us/articles/intel-opencl-sdk/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;strong&gt;Upcoming server socket C2012 with three memory channels&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt;As I already wrote, the C2012 socket for the single die Sepang processor (likely Opteron 4300 series) will provide three memory channels. Some people wondered, why the bigger G2012 socket for Terramar (likely Opteron 6300 series) will only have four channels (similar to G34 now) and not six. I think the reasons are simple. Sepang and Terramar both will have about the same max. TDP limits. In case of Terramar processors this likely will lead to about 70% the max. clock frequency of Sepang processors. So with half the CPU core count and about 1.4X the clock, a Sepang processor would have about 70% the theoretical throughput of Terramar. Three memory channels would provide about 75% the memory bandwidth of a four channel socket. The other way round Terramar has about 1.4X the throughput of Sepang (0.7X the clock and 2X the cores). Four channels should be enough then. Other considerations for this three/four channel configuration could have involved the increase of attachable DIMMs in a C2012 based system and the power consumption by additional mem controllers for G2012. Harvesting dies with one non-functional mem controller for using them in a Terramar MCM could be another option.&lt;/p&gt;
	&lt;p&gt;&lt;strong&gt;Other stuff&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt;This week &lt;a href="http://www.intc.com/eventdetail.cfm?EventID=88918"&gt;Intel presented&lt;/a&gt; at the Barclays Capital Global Technology Conference. Apparently Ottelini &lt;a href="http://investorshub.advfn.com/boards/read_msg.aspx?message_id=57547296"&gt;said there&lt;/a&gt; (I didn't listen to the webcast) that Ivy Bridge samples are back from the fab and functioning well.&lt;/p&gt;
	&lt;p&gt;AMD recently launched new CPU models including their new flagship processor Phenom II 1100T. You can find a &lt;a href="http://www.xtremesystems.org/forums/showthread.php?t=263285"&gt;collection of review links&lt;/a&gt; at XtremeSystems. If you look closely you might see some strange behaviour in some tests (esp. game tests at computerbase): The 1100T sometimes is more than 3% faster than a 1090T, although both base and turbo clock speed would only suggest a 3% improvement. Maybe there were some differences in the test setups. But if not this could be related to some internal changes to the CPU.&lt;/p&gt;
	&lt;p&gt;If you missed it: In the comments to some of the more recent blog postings there were lengthy discussions about BD's IPC, issue width, die size and further attributes.&lt;/p&gt;
	&lt;p&gt;And nearly the last one: A successor to Bulldozer (incl. enhanced BD and BD NG) might be called Steamroller. Sounds even logical &lt;img src="/img/smilies/icon_wink.gif" alt=";)" class="middle" border="0"&gt; This could the APU coming around 2014.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;img class=" syunhvfiysdfwqrlxvjp syunhvfiysdfwqrlxvjp syunhvfiysdfwqrlxvjp" src="http://vg06.met.vgwort.de/na/c8ee23f7b17848ae8b196bd416acad1b" alt="" width="1" height="1"&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/12/10/brazos-opencl-steamroller-and-other-stuff-10160610/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/12/10/brazos-opencl-steamroller-and-other-stuff-10160610/</link><pubDate>Fri, 10 Dec 2010 21:58:34 +0100</pubDate></item><item><title>ISSCC 2011</title><description>	&lt;p&gt;The upcoming ISSCC 2011 conference will again bring us details of future CPUs like Intel's Poulson (12-wide IA-64 MPU, paper 4.8), Sandy Bridge (paper 15.1), AMD's Bulldozer (sessions 4 and 14) and Zacate (paper 15.4). Just yesterday "someone" at RWT posted a &lt;a href="http://isscc.org/doc/2011/isscc2011.advanceprogrambooklet_abstracts.pdf"&gt;link to the abstracts&lt;/a&gt;, which made it possible to provide an extract of the mentioned papers below:&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;4.3 A 32nm Westmere-EX Xeon® Enterprise Processor&lt;/strong&gt;&lt;br&gt; S. Sawant, U. Desai, G. Shamanna, L. Sharma, M. Ranade, A. Agarwal, S. Dakshinamurthy, R. Narayanan, Intel&lt;br&gt; This monolithic 10-core Xeon® Processor is designed in a 32nm 9M process with a shared L3 cache. Low power modes are introduced to cut idle power compared to the previous generation processor. A 2nd order CTLE and temperature compensation are implemented in the I/O receiver to enable link survivability even with low RX margins. Core- and cache-recovery techniques maximize yield.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;4.5 Design Solutions for the Bulldozer 32nm SOI 2-Core Processor Module in an 8-Core CPU&lt;/strong&gt;&lt;br&gt; T. Fischer, S. Arekapudi, E. Busta, C. Dietz, M. Golden, S. Hilker, A. Horiuchi, K. A. Hurd, D. Johnson, H. McIntyre, S. Naffziger, J. Vinh, J. White, K. Wilcox, AMD&lt;br&gt; The Bulldozer 2-core CPU module contains 213M transistors in an 11-metal layer 32nm high-k metalgate SOI CMOS process and is designed to operate from 0.8 to 1.3V. This micro-architecture improves performance and frequency while reducing area and power over a previous AMD x86-64 CPU in the same process. The design reduces the number of gates/cycle relative to prior designs, achieving 3.5GHz+ operation in an area (including 2MB L2 cache) of 30.9mm2.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;4.6 40-Entry Unified Out-of-Order Scheduler and Integer Execution Unit for the AMD Bulldozer x86-64 Core&lt;/strong&gt;&lt;br&gt; M. Golden, S. Arekapudi, J. Vinh, AMD&lt;br&gt; A 40-instruction out-of-order scheduler issues four operations per cycle and supports single-cycle operation wakeup. The integer execution unit supports single-cycle bypass between four functional units. Critical paths are implemented without exotic circuit techniques or heavy reliance on full-custom design. Architectural choices minimize power consumption.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;4.7 Clock Generation for a 32nm Server Processor with Scalable Cores&lt;/strong&gt;&lt;br&gt; S. Li, A. Krishnakumar, E. Helder, R. Nicholson, V. Jia, Intel&lt;br&gt; This paper describes the clock generation system of a multi-core processor on a 32nm CMOS process, featuring Intel® QuickPath Interconnect, PCI Express and DDR3. The clock system is designed for modularity and scalability, with a unique clock distribution structure for low skew and low power. A dedicated PLL is used for the internal high-speed data link for low data-transport latency.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;4.8 A 32nm 3.1 Billion Transistor 12-Wide-Issue Itanium® Processor for Mission-Critical Servers&lt;/strong&gt;&lt;br&gt; R. J. Riedlinger, R. Bhatia, L. Biro, B. Bowhill, E. Fetzer, P. Gronowski, T. Grutkowski, Intel&lt;br&gt; An Itanium® processor implemented in 32nm CMOS with 9 layers of Cu contains 3.1 billion transistors. The die measures 18.2×29.9mm2. The processor has 8 multi-threaded cores, a ring-based system interface and combined cache on the die is 50MB. High speed links allow for peak processor-toprocessor bandwidth of up to 128GB/s and memory bandwidth of up to 45GB/s.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;14.3 An 8MB Level-3 Cache in 32nm SOI with Column-Select Aliasing&lt;/strong&gt;&lt;br&gt; D. Weiss, M. Dreesen, M. Ciraula, C. Henrion, C. Helt, R. Freese, T. Miles, A. Karegar, R. Schreiber, B. Schneller, J. Wuu, AMD&lt;br&gt; An 8MB level 3 cache, composed of 4 independent 2MB subcaches, is built on a 32nm SOI process. It features column-select aliasing to improve area efficiency, supply gating and floating bitlines to reduce leakage power, and centralized redundant row and column blocks to improve yield and testability. The cache operates above 2.4GHz at 1.1V.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;15.1 A Fully Integrated Multi-CPU, GPU and Memory Controller 32nm Processor&lt;/strong&gt;&lt;br&gt; M. Yuffe, E. Knoll, M. Mehalel, J. Shor, T. Kurts, Intel&lt;br&gt; This paper describes the 32nm Sandy Bridge processor that integrates up to 4 high-performance Intel Architecture (IA) cores, a power/performance optimized graphic processing unit (GPU) and memory and PCIe controllers in the same die. The paper describes some of the integration methods, power saving techniques and the clock distribution network.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;15.4 A Low-Power Integrated x86-64 and Graphics Processor for Mobile Computing Devices&lt;/strong&gt;&lt;br&gt; S. R. Gutta, D. Foley, A. Naini, R. Wasmuth, D. Cherepacha, AMD&lt;br&gt; Zacate is AMD’s first generation Fusion SoC that combines x86 CPU and Radeon™GPU on a single 40nm bulk CMOS die. The SoC uses an internal bus architecture and design techniques to optimize performance and memory bandwidth without compromising on power savings. Fine-grain power gating, dynamic voltage/frequency scaling and enhanced display refresh are key enablers for low-power operation.&lt;/p&gt;
	&lt;p&gt; David Kanter wrote &lt;a href="http://realworldtech.com/page.cfm?ArticleID=RWT111710021604"&gt;an article&lt;/a&gt; to collect most of his speculations about Poulson. There is also at least &lt;a href="http://realworldtech.com/forums/index.cfm?action=detail&amp;id=114381&amp;threadid=114381&amp;roomid=2"&gt;one discussion thread over at RWT&lt;/a&gt; about Poulson's capabilities. After seeing patents from both Intel and AMD about reliable computing using multiple sets of execution hardware and comparator logic for executing the same code stream twice and checking the results of each instruction, I thought that the mentioning of improved RAS features, the Intel patent and twice the execution resources are related to eachother. But talking about Poulson or Itanium means to include Intel's IA-64 ex-partner HP. I recently received a pointer to an &lt;a href="http://www.freepatentsonline.com/7584405.html"&gt;HP patent no. 7,584,405&lt;/a&gt;, filed in 2003, about creating a second code section and comparator code for doing reliable computing using the compiler. It was originally meant to make use of the NOPs found in many VLIW instruction bundles. This technique could work even better with a new, widened IA-64 architecture with twice the hardware resources available. And it certainly would fall under the category "improved RAS features".&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/11/22/isscc-10026027/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/11/22/isscc-10026027/</link><pubDate>Mon, 22 Nov 2010 16:30:05 +0100</pubDate></item><item><title>AMD's Financial Analyst Day 2010 and Brazos Information</title><description>	&lt;p&gt;A quick roundup so far (will update it later):&lt;/p&gt;
	&lt;p&gt;AMD FAD 2010 press kit:&lt;br&gt; &lt;a href="http://blogs.amd.com/press/2010/11/09/amd-financial-analyst-day-2010-press-kit/"&gt; &lt;a href="http://blogs.amd.com/press/2010/11/09/amd-financial-analyst-day-2010-press-kit/"&gt;http://blogs.amd.com/press/2010/11/09/amd-financial-analyst-day-2010-press-kit/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;FAD 2010 slides and webcast:&lt;br&gt; &lt;a href="http://ir.amd.com/phoenix.zhtml?c=74093&amp;p=irol-2010analystday"&gt; &lt;a href="http://ir.amd.com/phoenix.zhtml?c=74093&amp;p=irol-2010analystday"&gt;http://ir.amd.com/phoenix.zhtml?c=74093&amp;p=irol-2010analystday&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;AMD blogs:&lt;br&gt; &lt;a href="http://blogs.amd.com/work/2010/11/09/server-highlights-from-financial-analyst-day/"&gt; &lt;a href="http://blogs.amd.com/work/2010/11/09/server-highlights-from-financial-analyst-day/"&gt;http://blogs.amd.com/work/2010/11/09/server-highlights-from-financial-analyst-day/&lt;/a&gt;&lt;/a&gt;&lt;br&gt; &lt;a href="http://blogs.amd.com/fusion/2010/11/09/simply-put-it%E2%80%99s-all-about-velocity/"&gt; &lt;a href="http://blogs.amd.com/fusion/2010/11/09/simply-put-it%E2%80%99s-all-about-velocity/"&gt;http://blogs.amd.com/fusion/2010/11/09/simply-put-it%E2%80%99s-all-about-velocity/&lt;/a&gt;&lt;/a&gt;&lt;br&gt; &lt;a href="http://blogs.amd.com/fusion/2010/11/09/amd%E2%80%99s-answer-to-the-big-experiencesmall-form-factor-paradox/"&gt; &lt;a href="http://blogs.amd.com/fusion/2010/11/09/amd%E2%80%99s-answer-to-the-big-experiencesmall-form-factor-paradox/"&gt;http://blogs.amd.com/fusion/2010/11/09/amd%E2%80%99s-answer-to-the-big-experiencesmall-form-factor-paradox/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;One neat detail: BD's Turbo CORE will boost it's base clock frequency by up to 500MHz for most workloads with all cores under load. John comments this further &lt;a href="http://www.semiaccurate.com/forums/showpost.php?p=81780&amp;postcount=11"&gt;here&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt; In the codename decoder page (&lt;a href="http://blogs.amd.com/work/fadcodenames/"&gt;&lt;a href="http://blogs.amd.com/work/fadcodenames/"&gt;http://blogs.amd.com/work/fadcodenames/&lt;/a&gt;&lt;/a&gt;) we have names for the 2012 sockets: C2012 and G2012. Further the smaller one will support three DDR3 channels:&lt;/p&gt;
	&lt;blockquote&gt;&lt;p&gt;“Sepang”&lt;br&gt; Market: Server&lt;br&gt; What is it: Server CPU with up to 10 next-generation “Bulldozer” CPU  cores targeting 2-way highly energy efficient and cost optimized Socket  C2012 platforms. Complete with three-channel DDR3 memory and integrated  PCIe Gen3 I/O.&lt;br&gt; Planned for introduction: 2012&lt;/p&gt;
	&lt;p&gt;“Terramar”&lt;br&gt; Market: Server&lt;br&gt; What is it? Server CPU with up to 20 next-generation “Bulldozer” CPU  cores targeting the 2- and 4-way performance-per-watt and expandable  Socket G2012 platforms. Complete with quad-channel DDR3 memory and integrated PCIe Gen3 I/O.&lt;br&gt; Planned for introduction: 2012&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;Brazos details and benchmarks:&lt;br&gt; &lt;a href="http://www.electronista.com/articles/10/11/09/amd.shows.bobcat.apu.chips.says.faster.than.atom/"&gt; Electronista&lt;/a&gt;&lt;br&gt; &lt;a href="http://www.xbitlabs.com/news/cpu/display/20101109100159_AMD_Demonstrates_Highly_Anticipated_Eigh_Core_Bulldozer_Chip_at_Conference.html"&gt; Xbitlabs&lt;/a&gt;&lt;br&gt; &lt;a href="http://blogs.pcmag.com/miller/2010/11/amds_meyer_were_first_with_apu.php"&gt;PC Mag&lt;/a&gt;&lt;br&gt; &lt;a href="http://www.techreport.com/articles.x/19937"&gt;Techreport&lt;/a&gt;&lt;br&gt; &lt;a href="http://www.anandtech.com/show/4003/previewing-amds-brazos-part-1-more-details-on-zacateontario-and-fusion"&gt;Anandtech&lt;/a&gt;&lt;br&gt; &lt;a href="http://hothardware.com/Reviews/AMDs-Low-Power-Fusion-APU--Zacate-Unveiled/"&gt;Hothardware&lt;/a&gt;&lt;br&gt; &lt;a href="http://www.hardocp.com/article/2010/11/08/sneak_peek_at_amds_first_fusion_apu"&gt;[H]ard|OCP&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;AMD's schedule for &lt;a href="http://sc10.supercomputing.org/"&gt;SC10 conference&lt;/a&gt;:&lt;br&gt; &lt;a href="http://sites.amd.com/us/Documents/AMD_at_SC10_Schedule.pdf"&gt; &lt;a href="http://sites.amd.com/us/Documents/AMD_at_SC10_Schedule.pdf"&gt;http://sites.amd.com/us/Documents/AMD_at_SC10_Schedule.pdf&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;img src="http://vg06.met.vgwort.de/na/a4379399f252428ea7c118157c2b9998" alt="" width="1" height="1"&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/11/09/amd-s-financial-analyst-day-9944971/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/11/09/amd-s-financial-analyst-day-9944971/</link><pubDate>Tue, 09 Nov 2010 23:17:28 +0100</pubDate></item><item><title>Latest News, Rumors and AMD's upcoming Financial Analyst Day</title><description>	&lt;p&gt;A lot happened during the last week, including some rumors about the begin of production of different Bulldozer based CPUs. I read it all day after day and always felt the need to blog about this. This was not possible due to several reasons. For such cases I'll change my strategy somewhat, as you'll see below. Now let's get back to the rumors. So does Charlie of SemiAccurate &lt;a href="http://www.semiaccurate.com/2010/11/06/amd-demo-bulldozer-next-week/"&gt;reports&lt;/a&gt; about Orochi (server die to be used for Interlagos and Valencia), which might be demoed at the &lt;a href="http://blogs.amd.com/fusion/2010/11/03/sneak-a-peek-into-the-future-at-amd%E2%80%99s-financial-analyst-day/"&gt;Financial Analyst Day 2010&lt;/a&gt;. Further he writes about samples and production dates, also in relation to Llano's dates, where we heard at the last CC, that it will hit the market before Bulldozer.&lt;/p&gt;
	&lt;p&gt;XbitLabs instead &lt;a href="http://www.xbitlabs.com/news/cpu/display/20101105133510_AMD_to_Start_Production_of_Desktop_Bulldozer_Microprocessors_in_April.html"&gt;reports&lt;/a&gt; that desktop Bulldozer chips (Zambezi) will see start of production in April 2011. Further they (actually Anton Shilov) &lt;a href="http://www.xbitlabs.com/news/cpu/display/20101105174003_AMD_s_Llano_Production_to_Initiate_in_July_2011.html"&gt;report &lt;/a&gt;about Llano's start of production in July. The problem with these dates is, that first AMD officially said, that Llano will come before Bulldozer and second that comments by John Fruehe indicate, that this is far from reality and we should better wait for new information to be given at the Financial Analyst Day.&lt;/p&gt;
	&lt;p&gt;Although it's rather unlikely that we'll hear about Orochi die sizes this week. But already several people including me analyzed the photoshopped die photo and found it to be in the ballpark of 300 sqmm (+/-20) in size. My analysis based on L2 cell size, I/O cell size (as Hans de Vries did) and L1 I$ size relations also resulted in a module area of ~17-19 sqmm and a L2 area of a little more than 10 sqmm. So this estimation lands at roughly 28 sqmm for a module with cache. Take this with a grain of salt.&lt;/p&gt;
	&lt;p&gt;There will soon be another blog from John, covering instruction set extensions in Bulldozer (ver 1 I suppose). It might cover LWP, AES-NI or some previously unsupported SSE extension. The new instructions for "Bulldozer 2" as found on the gcc mailing list (&lt;a href="http://citavia.blog.de/2010/10/21/signs-of-bulldozer-2-and-llano-9726240/"&gt;I reported&lt;/a&gt;) already made it &lt;a href="http://www.xbitlabs.com/news/cpu/display/20101105230927_AMD_s_Bulldozer_2_Set_to_Support_New_Extensions.html"&gt;into the news&lt;/a&gt;. Just a few days earlier the "what if" outlook became the topic of another &lt;a href="http://www.xbitlabs.com/news/cpu/display/20101103132545_AMD_Starts_to_Talk_About_Bulldozer_2_Micro_Architecture.html"&gt;news story&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;A &lt;a href="http://blogs.amd.com/work/2010/10/29/bulldozer-processor-topology-explained/"&gt;blog by AMD's Elsie Wahlig&lt;/a&gt; features a video, where she's explaining the Bulldozer topology in some detail to help software developers make better use of the new structure found in those processors based on Bulldozer microarchitecture.&lt;/p&gt;
	&lt;p&gt;What we still don't know and what is topic of an ongoing discussion at P3DNow! (&lt;a href="http://www.planet3dnow.de/vbulletin/showthread.php?p=4312241#post4312241"&gt;sneak peek&lt;/a&gt; in German) is the clocking structure of Bulldozer. So far the clock speeds have been discussed, and also the advanced turbo boost mode. But nothing much has been read in regard of differently clocked parts in a module. There is research with links to AMD, as I also mentioned on &lt;a href="https://groups.google.com/group/comp.arch/msg/13fa97f5e5dc1d3c?hl=en"&gt;comp.arch&lt;/a&gt;. Stay tuned.&lt;/p&gt;
	&lt;p&gt;Last but not least, Hiroshige Goto posted a &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20101104_404182.html"&gt;new article about Llano&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?hl=en&amp;amp;sl=ja&amp;amp;tl=en&amp;amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20101104_404182.html"&gt;Google&lt;/a&gt;/&lt;a href="http://www.microsofttranslator.com/bv.aspx?from=&amp;amp;to=en&amp;amp;a=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20101104_404182.html"&gt;Bing&lt;/a&gt;) including a nice &lt;a href="http://pc.watch.impress.co.jp/img/pcw/docs/404/182/html/01.jpg.html"&gt;die picture&lt;/a&gt; taken from a wafer photo.&lt;/p&gt;
	&lt;p&gt;Because of the reasons I mentioned above, I decided to start my own twitter feed &lt;a href="http://twitter.com/#!/Dresdenboy"&gt;"Dresdenboy"&lt;/a&gt;. If I find something worth sharing, I'll do it there first.&lt;/p&gt;
	&lt;p&gt;P.S.: Some fun in relation to processors: &lt;a href="http://www.preissuchmaschine.de/preisvergleich/popup.asp?pid=439260017&amp;pnr=1&amp;foto=1"&gt;"Bulldozer" logos from &lt;strong&gt;K-9&lt;/strong&gt;&lt;/a&gt; for dog harnesses.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;img src="http://vg06.met.vgwort.de/na/c3295c50be1a41fab19d07edeb246470" alt="" width="1" height="1"&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/11/08/latest-news-rumors-and-amd-s-upcoming-financial-analyst-day-9933368/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/11/08/latest-news-rumors-and-amd-s-upcoming-financial-analyst-day-9933368/</link><pubDate>Mon, 08 Nov 2010 03:09:37 +0100</pubDate></item><item><title>More on Bulldozer, Llano and GPUs</title><description>	&lt;p&gt;A lot of interesting articles and postings appeared recently. So here are some of them:&lt;/p&gt;
	&lt;p&gt;John Fruehe published a blog about &lt;a href="http://blogs.amd.com/work/2010/10/25/the-new-flex-fp/"&gt;Bulldozer's Flex FP unit&lt;/a&gt;. It contains details about the execution of 128 bit and 256 bit instructions and other info like the throughput of AES instructions or that the FPU can go down to 2% power consumption when idle. And if you missed it, he also posted the fourth part of the 20 Questions blog &lt;a href="http://blogs.amd.com/work/2010/10/04/20-questions-part-4/"&gt;here&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;Hiroshige Goto writes about the &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20101021_401443.html"&gt;Llano demonstration&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?hl=en&amp;sl=ja&amp;tl=en&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20101021_401443.html"&gt;Google&lt;/a&gt;/&lt;a href="http://www.microsofttranslator.com/bv.aspx?from=&amp;to=en&amp;a=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20101021_401443.html"&gt;Bing&lt;/a&gt;), &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20101022_401811.html"&gt;AMD's new Barts GPU&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?hl=en&amp;sl=ja&amp;tl=en&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20101022_401811.html"&gt;Google&lt;/a&gt;/&lt;a href="http://www.microsofttranslator.com/bv.aspx?from=&amp;to=en&amp;a=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20101022_401811.html"&gt;Bing&lt;/a&gt;) and &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20101028_402814.html"&gt;it's tesselator&lt;/a&gt; (&lt;a href="http://translate.google.com/translate?hl=en&amp;sl=ja&amp;tl=en&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20101028_402814.html"&gt;Google&lt;/a&gt;/&lt;a href="http://www.microsofttranslator.com/bv.aspx?from=&amp;to=en&amp;a=http%3a%2f%2fpc.watch.impress.co.jp%2fdocs%2fcolumn%2fkaigai%2f20101028_402814.html"&gt;Bing&lt;/a&gt;).&lt;/p&gt;
	&lt;p&gt;David Kanter from Realworldtech pointed to an interesting GPU article, which contains a deep analysis of the capabilities of the tested Fermi GPU in comparison to the AMD's Cypress GPU: &lt;a href="http://www.beyond3d.com/content/reviews/55"&gt;&lt;a href="http://www.beyond3d.com/content/reviews/55"&gt;http://www.beyond3d.com/content/reviews/55&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;While writing about GPUs: There is a &lt;a href="http://developer.amd.com/zones/OpenCLZone/Events/pages/OpenCLWebinars.aspx"&gt;webinar series&lt;/a&gt; from AMD covering OpenCL from the beginning to more advanced algorithms. You can attend them live or watch the recorded events.&lt;/p&gt;
	&lt;p&gt;The Inquirer has an &lt;a href="http://www.theinquirer.net/inquirer/feature/1810924/amd-looks-comeback-2011"&gt;article about AMD's 2011 outlook&lt;/a&gt; where Nebojsa Novakovic mentions the good mood amongst AMD'ers regarding their future products.&lt;/p&gt;
	&lt;p&gt;And last but not least: two new patches posted at the GCC patches mailing list bring a lot of instruction latency numbers and other data of Bulldozer. So there is a &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01883.html"&gt;pipeline description patch&lt;/a&gt; and a &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01866.html"&gt;processor costs table patch&lt;/a&gt;, which add to the numbers &lt;a href="http://citavia.blog.de/2010/01/21/some-instruction-latency-numbers-of-bulldozer-7850137/"&gt;already published&lt;/a&gt; in the Open64 compiler source code. There we can see again some numbers supporting Bulldozer's higher frequency design like: FMUL/FADD latencies went from 4 to 6 cycles (was known before), x87 FDIV from 19 to 42, SSE DP division from 20 to 27 and x87 FSQRT from 35 to 52 cycles. Similarly 32bit integer muls take 4 cycles (vs. 3) and 64bit integer muls take 6 cycles (vs. 4). And most 256 bit AVX ops are double decoded (two 128bit uops) in Bulldozer.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;img src="http://vg06.met.vgwort.de/na/9fdc1e7ccabe4ed79e15344c5ca38a71" alt="" width="1" height="1"&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/10/26/more-bulldozer-info-and-a-deep-gpu-analysis-9794436/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/10/26/more-bulldozer-info-and-a-deep-gpu-analysis-9794436/</link><pubDate>Tue, 26 Oct 2010 23:33:42 +0200</pubDate></item><item><title>Signs of Bulldozer 2 and Llano</title><description>	&lt;p&gt;&lt;strong&gt;Bulldozer 2&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt; In an AMD/HP slideset available on the web I found &lt;a href="http://www.worldhostingdays.com/downloads/2010/hS2a1.pdf"&gt;slides&lt;/a&gt; showing a possible future development of AMD's high end server platform and a seemingly more detailed (added TDP numbers) performance prediction. Neither is the first slide a roadmap nor is the second one a measurement presentation according to John Fruehe of AMD. The first slide was used in a discussion at an event earlier this year and even the future socket names used there are not based on any plans and will be called differently. So better take it as an idea than a grand plan of AMD, since the future of the server platform is subject to change depending on market conditions and operative/strategic decisions. Similarly the performance chart should be treated as hand drawn and not like it was plotted in Matlab.&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://info.nuje.de/AMD_Socket_G42_G44_Bulldozer_NG.png"&gt;&lt;img src="http://info.nuje.de/AMD_Socket_G42_G44_Bulldozer_NG_s.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://info.nuje.de/AMD_Interlagos_performance_estimation_new.png"&gt;&lt;img src="http://info.nuje.de/AMD_Interlagos_performance_estimation_new_s.png" alt="" width="500" height="309"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt; I added some year numbers and process nodes to it which both somehow fit to the data of an older slide listing a "2012 processor" and a "2013 processor" and GlobalFoundries current process node roadmap. And &lt;a href="http://citavia.blog.de/2010/04/30/bulldozer-will-be-version-1-and-other-info-8484302/"&gt;as we already know&lt;/a&gt;, the upcoming Orochi CPU is called "BD Ver 1" in GCC sources. This fact and the microarchitecture details in patents, which don't seem to fit to what is known about Bulldozer, but otherwise descibe a Bulldozer like processor architecture, indicate that there will be a microarchitecture update. I guess, we might see it in the 2013 timeframe, while an even more advanced microarchitecture, which could lead the Fusion concept to an even stronger integration, might come around 2015, as already reported &lt;a href="http://www.pcworld.com/article/196108/dell_looking_into_amds_fusion_chips.html?tk=rss_news"&gt;at other places&lt;/a&gt; and indicated by the slide above.&lt;/p&gt;
	&lt;p&gt; The changes to expect in a "Bulldozer NG" or "Bulldozer 2", as I call it, might be as complex and as effective as what we know from the K8 core which appeared in the first Opteron in 2003 and got its latest update with Llano. They also might be comparable to what has changed between Greyhound+ and Llano's core (also a Hound). We will see.&lt;/p&gt;
	&lt;p&gt; BTW, recently some new GCC patches for support of new ISA extensions appeared. They already mention a "bdver2" processor - and of course the new extensions:&lt;/p&gt;
	&lt;blockquote&gt;&lt;p&gt; These patches add support for upcoming bdver2 AMD processors:&lt;br&gt; BMI (Bit Manipulation Instructions)&lt;br&gt; TBM (Trailing Bit Manipulation)&lt;br&gt; FMA3 (three operand FMA) instructions&lt;/p&gt;
	&lt;p&gt; The public specifications for BMI and TBM are in progress (they are today available under NDA).  They will appear in one of the AMD64 Architecture Programmer's Manual Volumes 3-6.   I can post the mnemonics definitions if needed.  The FMA3 specification is documented in &lt;a href="http://software.intel.com/en-us/avx/"&gt;http://software.intel.com/en-us/avx/&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt; 2010-10-15  Quentin Neill  &lt;quentin.neill.gnu@amd.com&gt; &lt;/p&gt;&lt;/blockquote&gt;
	&lt;p&gt;BMI patch:&lt;br&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01356.html"&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01356.html"&gt;http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01356.html&lt;/a&gt;&lt;/a&gt;&lt;br&gt; BMI mnemonics:&lt;br&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01766.html"&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01766.html"&gt;http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01766.html&lt;/a&gt;&lt;/a&gt;&lt;br&gt; TBM patch:&lt;br&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01357.html"&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01357.html"&gt;http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01357.html&lt;/a&gt;&lt;/a&gt;&lt;br&gt; TBM mnemonics:&lt;br&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01767.html"&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01767.html"&gt;http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01767.html&lt;/a&gt;&lt;/a&gt;&lt;br&gt; FMA3 patch is not ready yet:&lt;br&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01550.html"&gt; &lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01550.html"&gt;http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01550.html&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt; The reason that these patches have already been published could mean, that Bulldozer 2 isn't that far away in the future (a few years as shown above).&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;Llano&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt; This week AMD presented a demo of the Llano APU on it's Technical Forum Exhibition 2010. The [LINK zu Mainboard] shown processor with an apparently lower TDP managed to run 4 threads of HyperPi (not SuperPi!), the DirectX nBody simulation and 1080p video playback in parallel. The calculations throughput, which was displayed by the nBody simulation, was an estimated 35GFLOPS on average. But due to the high utilization of the CPU cores only a part of available power was dedicated to the GPU. So the result is not comparable to the Zacate score reported at &lt;a href="http://www.anandtech.com/show/3933/amds-zacate-apu-performance-update"&gt;Anandtech&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt; These DirectX nBody code examples calculate the forces between a bunch of particles (several thousands I suppose). For this it has to calculate the distance between &lt;strong&gt;each&lt;/strong&gt; possible pair of particles. This needs multiplications, additions, divisions and square roots to calculate the distances and forces. With a higher number of particles the amount of calculations grows quadratically. Thanks to that the shown resolution is irrelevant, because the force calculations for the drawn particles takes the most time.&lt;/p&gt;
	&lt;p&gt; There was also a demo of AVP running on Llano. You can find videos of those demos at &lt;a href="http://blogs.amd.com/fusion/2010/10/18/amd-and-its-partner-ecosystem-showcases-amd-fusion-at-tfe/"&gt;AMD's blog&lt;/a&gt;, &lt;a href="http://www.techspot.com/news/40730-amd-demos-fusionbased-llano-apu-at-press-event.html"&gt;Techspot&lt;/a&gt;, &lt;a href="http://www.semiaccurate.com/2010/10/18/amd-demos-llano/"&gt;SemiAccurate&lt;/a&gt;, &lt;a href="http://www.legitreviews.com/article/1443/1/"&gt;LegitReviews&lt;/a&gt; for example.&lt;/p&gt;
	&lt;p&gt; Also the Brazos platform has been shown and managed to reach a much higher DirectX nBody throughput (seemingly with no other tasks in the background) at ~19W (at the wall socket) as the compared Core i5 processor at ~38W.&lt;/p&gt;
	&lt;p&gt; Planet3DNow! presented some slides, like &lt;a href="http://www.planet3dnow.de/cgi-bin/newspub/viewnews.cgi?id=1287474011"&gt;the fmax distribution&lt;/a&gt; slide for Llano (showing the max. reachable frequency of parts at a given voltage), which is already known from another presentation a while back. Back then I read a comment (Hi Paul!), that the process doesn't look healthy because of the sample points stretching that far to the lower end (on the left). But a while back I've seen a similar curve in a paper or dissertation about some fmax distributions achieved using an Intel process node. As soon as I find it I'll post it here.&lt;/p&gt;
	&lt;p&gt; &lt;strong&gt;Other stuff&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt; I'm thinking about how to link different forums and maybe interesting updates to this blog, since a lot of interesting information first pops up in a forum. So I'm open for ideas. For example, the data for the Hudson southbridges appeared &lt;a href="http://forum.ixbt.com/topic.cgi?id=8:22153:2425#2425"&gt;here&lt;/a&gt; (in Russian), reposted later &lt;a href="http://www.planet3dnow.de/vbulletin/showpost.php?p=4273970&amp;postcount=1499"&gt;here&lt;/a&gt; (in German) by me. If you ask, why I didn't post it on my blog: if I'm not sure about the nature of the document or its source, I don't publish it here.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;img src="http://vg06.met.vgwort.de/na/24ba0b8e12784a68a88a368faba0bfa8" alt="" width="1" height="1"&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/10/21/signs-of-bulldozer-2-and-llano-9726240/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/10/21/signs-of-bulldozer-2-and-llano-9726240/</link><pubDate>Thu, 21 Oct 2010 12:57:29 +0200</pubDate></item><item><title>Sandy Bridge articles and some thoughts about optimal pipeline design</title><description>	&lt;p&gt;David Kanter published his &lt;a href="http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937"&gt;in-depth article&lt;/a&gt; about the microarchitecture of Sandy Bridge. He completes the picture given by Intel at the IDF 2010 and compares the architecture to &lt;a href="http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719"&gt;Nehalem&lt;/a&gt; and &lt;a href="http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333"&gt;Bulldozer&lt;/a&gt;. While he interprets "Bridge" (Hebrew: "gesher") as a metaphor for bringing together several existing microarchitectural concepts with some new ones and a GPU, I'd also see a similarity to AMD's "Fusion", which also stands for bringing things together.&lt;/p&gt;
	&lt;p&gt;Hiroshige Goto recently wrote even four different articles about Sandy Bridge. You can find the Japanese articles &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/"&gt;here&lt;/a&gt; and the autotranslated ones &lt;a href="http://translate.googleusercontent.com/translate_c?hl=en&amp;ie=UTF-8&amp;sl=ja&amp;tl=en&amp;u=http://pc.watch.impress.co.jp/docs/column/kaigai/&amp;prev=_t&amp;rurl=translate.google.com&amp;usg=ALkJrhjkfsJQO-x_5tNavBOL74VdT_KVDQ"&gt;here&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;In a &lt;a href="http://www.planet3dnow.de/vbulletin/showthread.php?p=4295955#post4295955"&gt;response&lt;/a&gt; to some interesting thoughts I collected some data to follow AMD's past decision of going the way of increased clock frequency with Bulldozer (meaning lower FO4 delay per pipeline stage). So here is the translation of thoughts and preliminary conclusions:&lt;/p&gt;
	&lt;p&gt;Regarding power consumption, the higher clockable 12FO4 design of Bulldozer (as indicated on &lt;a href="http://citavia.blog.de/2010/08/27/a-quick-round-of-links-9265110/"&gt;comp.arch&lt;/a&gt;) inversely needs a lower voltage to run at 2GHz than a 17FO4 design like K8, because the signals simply have to run through less gates during a 0.5ns clock phase. And a lower voltage also lowers leakage. If this will lead to a higher power efficiency, depends on further factors.&lt;/p&gt;
	&lt;p&gt;Now as some might have found out by playing with the recently released &lt;a href="http://citavia.blog.de/2010/09/15/a-scheduler-simulation-and-other-things-9389561/"&gt;Scheduler Simulation&lt;/a&gt; (OpenOffice.org variant is in the pipeline), a 4-wide OoO scheduler might have a harder time keeping the EUs busy than a 2-wide scheduler due to dependencies. The narrower design just has to provide the necessary operands (registers or memory data) for up to two instructions and not four. &lt;/p&gt;
	&lt;p&gt; Designing wider OoO execution cores also means complexity growth for many logic components (often quadratic, sometimes cubic growth). This is reflected by Pollack's rule, as depicted by Hiroshige Goto:&lt;/p&gt;
	&lt;p&gt; &lt;img src="http://pc.watch.impress.co.jp/img/pcw/docs/346/902/05_s.gif" border="0" alt=""&gt;&lt;/p&gt;
	&lt;p&gt;(from &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20100205_346902.html"&gt;here&lt;/a&gt;, but more on that can be found &lt;a href="http://pc.watch.impress.co.jp/docs/2006/0822/kaigai297.htm"&gt;here&lt;/a&gt; or &lt;a href="http://software.intel.com/en-us/articles/the-new-era-of-tera-scale-computing/"&gt;at Intel&lt;/a&gt;, where Pollack actually worked)&lt;/p&gt;
	&lt;p&gt; Leakage is becoming increasingly important when designing chips for smaller structures. In case of Llano it is stated as being 29%:&lt;br&gt; &lt;img src="http://pc.watch.impress.co.jp/img/pcw/docs/348/705/5_s.gif" border="0" alt=""&gt;&lt;br&gt; (from &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20100215_348705.html"&gt;here&lt;/a&gt;)&lt;/p&gt;
	&lt;p&gt; According to an &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.9109&amp;rep=rep1&amp;type=pdf"&gt;IBM research report&lt;/a&gt; from 2003, designs with 17 or 18 FO4 (black and green curves) could be most power efficient:&lt;br&gt; &lt;img src="http://www.planet3dnow.de/vbulletin/attachment.php?attachmentid=20521&amp;stc=1&amp;d=1285239252" border="0" alt=""&gt;&lt;br&gt; &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.9109&amp;rep=rep1&amp;type=pdf"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;Their metric "Total FO4" includes jitter, skew and latches, so it's comparable to the 17FO4 of Bulldozer (12FO4 + 5FO4). The reason for the bumpy curves (compared to many other publications) could simply be, that the researchers actually looked, where they could divide the pipeline in a useful way. The Bulldozer architects could have known those results (remember Chuck Moore's job history) in advance or come to similar conclusions after doing their own research.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/09/28/in-depth-sandy-bridge-article-on-rwt-9476659/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/09/28/in-depth-sandy-bridge-article-on-rwt-9476659/</link><pubDate>Tue, 28 Sep 2010 19:48:20 +0200</pubDate></item><item><title>A scheduler simulation and other things</title><description>	&lt;p&gt;Missing anything as simple on the web and not wanting to modify &lt;a href="http://www.ptlsim.org/"&gt;PTL-Sim&lt;/a&gt; this evening, I made a small Excel tool to simulate the behaviour of reservation station based and unified schedulers at different frequency ratios using a very simple "instruction set" with only 1 instruction.&lt;/p&gt;
	&lt;p&gt;However for simulating dependencies and how K8's lane oriented execution with reservation stations compares to execution with a unified scheduler it should be fine. There's also no register file read stage etc. But some things can already be observed. The number of available registers can be changed. Using a low number like 6 or 7 (somewhat resembling the available x86 GPRs in 32 bit mode excluding ESP and maybe also EBP) or a higher number like 15 (for 64 bit mode) results in different behaviour due to more dependencies with less registers. You can download the ZIP containing the XLS file (Excel 2k3, 26kB) &lt;a href="http://info.nuje.de/ScheduleSim.zip"&gt;here&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;&lt;a title="ScheduleSim" href="javascript:window.open("&gt;&lt;img src="http://data6.blog.de/media/904/4971904_cf210c02df_m.png" alt="ScheduleSim"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;On the upper right side you can find some parameters and buttons. The parameters down to clock ratio (clock frequency unified scheduler / clock frequency RS based scheduler) can be changed and need a klick on "Generate Code" to be applied. The values below are updated on the fly. After that step the button "Initialize" sets everything up for execution. "Execute Step" and "Execute All" should execute the code step wise or in one single run. IPC will be calculated after finishing execution.&lt;/p&gt;
	&lt;p&gt;While talking about IPC: Someone did tests to find out ALU/AGU throughput of Phenom II and &lt;a href="http://www.amdzone.com/phpbb3/viewtopic.php?f=52&amp;t=137913"&gt;posted his results&lt;/a&gt; in the AMDZone forum. So he confirms the description found in the optimization manuals that K7, K8 and more recent architectures based on it are able to issue up to 3 ALU and 3 AGU µOps per cycle (plus some µOps in the FPU as well). This behaviour can be verified by using &lt;a href="http://developer.amd.com/cpu/codeanalyst/Pages/default.aspx"&gt;CodeAnalysts pipeline simulation&lt;/a&gt;, what I did a while back.&lt;/p&gt;
	&lt;p&gt;Zacate's graphics performance seems to be very good as found out by &lt;a href="http://www.anandtech.com/show/3933/amds-zacate-apu-performance-update"&gt;Anand&lt;/a&gt;. Over at &lt;a href="http://www.planet3dnow.de/vbulletin/showthread.php?p=4291586#post4291586"&gt;Planet3dNow&lt;/a&gt; we're looking for N-Body Simulation GFLOPS numbers of other GPUs. If you have some numbers, please post them in the comments.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/09/15/a-scheduler-simulation-and-other-things-9389561/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/09/15/a-scheduler-simulation-and-other-things-9389561/</link><pubDate>Wed, 15 Sep 2010 20:43:04 +0200</pubDate></item><item><title>News from Intel's IDF, AMD and the resurrection of 3DNow!</title><description>	&lt;p&gt;&lt;strong&gt;Sandy Bridge&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt; Chipper posted a &lt;a href="https://intel.wingateweb.com/us10/scheduler/catalog/catalog.jsp"&gt;link to Intel's IDF presentations&lt;/a&gt; (doesn't work for me right now but others were successful ) in this &lt;a href="http://www.semiaccurate.com/forums/showthread.php?t=3269"&gt;SemiAccurate forum thread&lt;/a&gt;. The thread already contains some extracted information. For more details and the slides there is an &lt;a href="http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed"&gt;article at AnandTech&lt;/a&gt;. It looks like many rumours (e.g. 1500 µOp buffer which is no trace cache, but a direct mapped i$), clock ranges etc were true. In his article Anand mentioned the reusing of integer SIMD datapaths for extending the available resources to 256 bit for processing AVX FP instructions.&lt;/p&gt;
	&lt;p&gt;This sounds interesting and explains the only light area increase of the FPU after adding AVX. Looking at the execution data paths, it is obvious that similar integer SIMD logic (e.g. MUL sitting next to FMUL) is available in the 128bit datapath in the middle of the schematic. For example it is possible to use parts of an adder to calculate narrower additions and the same is true for multipliers and many ALU ops. So it's possible that a Sandy Bridge SIMD unit uses integer add and mul logic to do some parts of FP calculations (working on the mantissa bits). Intel's patent no. &lt;a href="http://www.freepatentsonline.com/7389406.html"&gt;7,389,406&lt;/a&gt; (by Intel's Israel team) might come into play here, but I have to read it in full to see, if it fits. A rather similar one is no. &lt;a href="http://www.freepatentsonline.com/7457938.html"&gt;7,457,938&lt;/a&gt; by their U.S. teams.&lt;/p&gt;
	&lt;p&gt;Other nice features include a "borrowing" Turbo boost, which uses a "saved" TDP budget (accumulated during idle times) to boost cores while going over the TDP limit until the budget has been used up.&lt;/p&gt;
	&lt;p&gt;According to &lt;a href="http://www.semiaccurate.com/2010/09/13/intel-no-longer-chip-maker/"&gt;SemiAccurate&lt;/a&gt;, Intel want's to ship Ivy Bridge (22nm tick of Sandy Bridge) in the second half of 2011, which might be the same time frame when Bulldozer will arrive.&lt;/p&gt;
	&lt;p&gt;&lt;strong&gt;3DNow!&lt;/strong&gt;&lt;/p&gt;
	&lt;p&gt;After AMD made their decision public that future MPUs won't support 3DNow! anymore, many concluded that this will be the case for Bulldozer as well. But I found this:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;+  { "CPU_BDVER1_FLAGS",&lt;br&gt; "Cpu186|Cpu286|Cpu386|Cpu486|Cpu586|Cpu686|CpuSYSCALL|CpuRdtscp| Cpu387|Cpu687|CpuFISTTP|CpuMMX|&lt;strong&gt;Cpu3dnow|Cpu3dnowA&lt;/strong&gt;| CpuSSE|CpuSSE2|CpuSSE3|CpuSSE4a|CpuABM| CpuLM|CpuFMA4|CpuXOP|CpuLWP" },&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;(Source: &lt;a href="http://cygwin.ru/ml/binutils/2010-01/msg00576/7007_bdver1.diff"&gt;&lt;a href="http://cygwin.ru/ml/binutils/2010-01/msg00576/7007_bdver1.diff"&gt;http://cygwin.ru/ml/binutils/2010-01/msg00576/7007_bdver1.diff&lt;/a&gt;&lt;/a&gt;, referring to &lt;span&gt;opcodes/&lt;span class="highlight"&gt;i386&lt;/span&gt;-&lt;span class="highlight"&gt;gen&lt;/span&gt;.&lt;span class="highlight"&gt;c)&lt;/span&gt; &lt;/span&gt;&lt;/p&gt;
	&lt;p&gt;This could either be a copy paste error or BD will still support those extensions. But there are other candidates for dropping it like a BD Version 2 (which might come sooner than later since someone, who should know it better, hinted that we won't have to wait 5 years for it). Also Bobcat is a candidate since 3DNow! might just have been left out for die area and power saving reasons.&lt;/p&gt;
	&lt;p&gt;There are also some changes in AMD's manuals (thanks, &lt;a href="http://www.planet3dnow.de/vbulletin/showpost.php?p=4290791&amp;postcount=1286"&gt;Alex&lt;/a&gt;):&lt;/p&gt;
	&lt;p&gt;CPUID doc &lt;a href="http://support.amd.com/us/Processor_TechDocs/25481.pdf"&gt;25481.pdf&lt;/a&gt; now contains some CPUID bits for &lt;a href="http://support.amd.com/us/Processor_TechDocs/43724.pdf"&gt;Lightweight Profiling&lt;/a&gt; (LWP), 16bit FP conversion (F16C, similar to  CVT16), an Effective Frequency Interface and more. See also a &lt;a href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6932930"&gt;bug report&lt;/a&gt; (more a feature request) at opensolaris.org:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;Description&lt;br&gt; With introduction of core performance boost (see CR6932922) in family  15h processors (as well as 12h and 14h) it may be difficult to know  core's frequency. &lt;strong&gt;Effective Frequency Interface&lt;/strong&gt; allows kernel to find out average frequency of a core over a period of time.&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;There also is an AM3 doc: &lt;a href="http://support.amd.com/us/Processor_TechDocs/40523.pdf"&gt;40523.pdf&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;And there is more from AMD:&lt;/p&gt;
	&lt;p&gt;John Fruehe posted his &lt;a href="http://blogs.amd.com/work/2010/08/30/bulldozer-20-questions-%E2%80%93-part-2/"&gt;second&lt;/a&gt; and &lt;a href="http://blogs.amd.com/work/2010/09/13/bulldozer-20-questions-part-3/"&gt;third round&lt;/a&gt; of the 20 questions blog. There also is a &lt;a href="http://www.theinquirer.net/inquirer/news/1732952/amd-sheds-light-bulldozer"&gt;video interview&lt;/a&gt; with him at the Inquirer.&lt;/p&gt;
	&lt;p&gt;Simon Solotko demonstrates the Zacate Fusion processor &lt;a href="http://www.youtube.com/watch?v=NZE2SuJlJCw"&gt;here&lt;/a&gt;. A video showing more of the 3D action can be watched &lt;a href="http://www.youtube.com/watch?v=XH5A4D9qoDQ"&gt;here&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;InsideHW &lt;a href="http://www.insidehw.com/Editorials/Interviews/InsideHW-Interview-Leslie-Sobon-AMD.html"&gt;interviewed&lt;/a&gt; Leslie Sobon about the upcoming Fusion APUs. A quote:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;And let me tell you one thing about Llano, the reaction of all our partners after seeing the demo was, in one word, “whoa”.&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;Clock frequency numbers and other specs of embedded Ontario processors ("eOntario") &lt;a href="http://www.xtremesystems.org/forums/showpost.php?p=4547572&amp;postcount=200"&gt;appeared&lt;/a&gt;. It is possible, that the frequencies and TDPs will be the same for the consumer variants. So the highest clock frequeny is 1.6GHz, which might be the same for the &lt;a href="http://citavia.blog.de/2010/06/29/llano-tri-core-and-ontario-dual-core-spotted-8884456/"&gt;BOINCed&lt;/a&gt; engineering samples.&lt;/p&gt;
	&lt;p&gt;Andy Glew posted quotes of a Microprocessor Report article about Bulldozer on &lt;a href="http://groups.google.de/group/comp.arch/browse_thread/thread/25c8716cb413f56d/3b7b9cf627cb812a?hl=de&amp;lnk=gst&amp;q=#3b7b9cf627cb812a"&gt;comp.arch&lt;/a&gt;. In &lt;a href="http://groups.google.de/group/comp.arch/browse_thread/thread/1fc40fd72b695fe9/a4b3dc4850260b91?hl=de&amp;lnk=gst&amp;q=#a4b3dc4850260b91"&gt;another thread&lt;/a&gt; he discusses the position of Bulldozer's renamer.&lt;/p&gt;
	&lt;p&gt;And someone seemingly erroneously posted SPEC2k results of Atom using GCC:&lt;br&gt;&lt;a href="http://gcc.gnu.org/ml/gcc/2010-09/msg00000.html"&gt;&lt;a href="http://gcc.gnu.org/ml/gcc/2010-09/msg00000.html"&gt;http://gcc.gnu.org/ml/gcc/2010-09/msg00000.html&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/09/15/news-from-intel-s-idf-amd-and-the-resurrection-of-3dnow-9384691/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/09/15/news-from-intel-s-idf-amd-and-the-resurrection-of-3dnow-9384691/</link><pubDate>Wed, 15 Sep 2010 00:00:06 +0200</pubDate></item><item><title>Another round of Bulldozer and Bobcat articles and news</title><description>	&lt;p&gt;David Kanter published &lt;a href="http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&amp;p=1"&gt;his Bulldozer article&lt;/a&gt; at Real World Technologies, with many details you won't find anywhere else.&lt;/p&gt;
	&lt;p&gt;Anandtech published a &lt;a href="http://www.anandtech.com/show/3885/sandy-bridge-graphics-update"&gt;Sandy Bridge preview&lt;/a&gt; of a CPU model with the top bin integrated GPU (containing 12 EUs).&lt;/p&gt;
	&lt;p&gt;Hiroshige Goto published two articles about Bobcat (&lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20100901_390769.html"&gt;Japanese&lt;/a&gt;/&lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=ja&amp;tl=en&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20100901_390769.html"&gt;English&lt;/a&gt;) and Bulldozer  (&lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20100827_389491.html"&gt;Japanese&lt;/a&gt;/&lt;a href="http://translate.google.com/translate?js=n&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=2&amp;eotf=1&amp;sl=ja&amp;tl=en&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20100827_389491.html"&gt;English&lt;/a&gt;).&lt;/p&gt;
	&lt;p&gt;AMD indirectly presented a photoshopped Orochi die photo at GlobalFoundries' tech conference:&lt;br&gt; &lt;a title="orochi die shot" href="http://i55.tinypic.com/2nvcpcj.jpg"&gt;&lt;img src="http://data6.blog.de/media/383/4951383_3adf939eba_m.jpeg" alt="orochi die shot"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;The lower modules seem to look more like an original layout. The likely 2M L2 caches and L3 cache blocks have a significant area difference. So they might have been scaled during the photoshopping or use different SRAM macro cells. I'll come back to that later.&lt;/p&gt;
	&lt;p&gt;Another die shot, which has been presented, is Bobcat's. Hans de Vries compared it to a Pineview die shot at the same scale:&lt;a href="http://www.chip-architect.com/news/AMD_Ontario_Bobcat_vs_Intel_Pineview_Atom.jpg"&gt;&lt;img src="http://www.chip-architect.com/news/AMD_Ontario_Bobcat_vs_Intel_Pineview_Atom.jpg" alt="" width="400" height="360"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;A recently published paper (with some relation to AMD) describes the effect of workload phase prediction in frequency boost techniques. You can download it &lt;a href="http://hal.archives-ouvertes.fr/inria-00492839/"&gt;here&lt;/a&gt;. It shows the potential of knowing in advance, when performance will be needed and where.&lt;/p&gt;
	&lt;p&gt; Recent checks for further engineering sample activities (as &lt;a href="http://citavia.blog.de/2010/06/29/llano-tri-core-and-ontario-dual-core-spotted-8884456/"&gt;I did&lt;/a&gt; in &lt;a href="http://citavia.blog.de/2010/08/05/more-sandy-bridge-performance-numbers-9128712/"&gt;the past&lt;/a&gt;) were successful:&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.ufluids.net/show_host_detail.php?hostid=119914"&gt;Ontario 1&lt;/a&gt;&lt;br&gt; &lt;a href="http://www.ufluids.net/show_host_detail.php?hostid=119621"&gt;Ontario 2&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt; &lt;a href="http://en.allprojectstats.com/show.php?projekt=1&amp;id=2550250"&gt; Sandy Bridge 1&lt;/a&gt; (stepping 3, 2.1GHz) &lt;br&gt; &lt;a href="http://en.allprojectstats.com/show.php?projekt=1&amp;id=2541730"&gt;Sandy Bridge 2&lt;/a&gt; (stepping 3, 2.2GHz)  &lt;br&gt; &lt;a href="http://burp.renderfarming.net/show_host_detail.php?hostid=47707"&gt;Sandy Bridge 3&lt;/a&gt; (stepping 3, 2.6GHz)&lt;br&gt; &lt;a href="http://browse.geekbench.ca/geekbench2/view/282900"&gt;Sandy Bridge 4&lt;/a&gt; (stepping 5, 2.2GHz, Geekbench)&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/09/05/more-facts-of-bulldozer-and-bobcat-9317278/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/09/05/more-facts-of-bulldozer-and-bobcat-9317278/</link><pubDate>Sun, 05 Sep 2010 23:00:15 +0200</pubDate></item><item><title>A quick round of links</title><description>	&lt;p&gt;I won't read and post anything for a week. So there is a quick round of links and information I want to share.&lt;/p&gt;
	&lt;p&gt;First there is something I want to make clear: As described in the Bulldozer preview article AMD doesn't use the term "K10" for their current CPU cores. This is a press/community term. AMD employees use "K10" as synonym for Bulldozer. K8 is the family of cores which started with Sledgehammer. So the statement about Llano's core being an improved K8 core just means that it's an upgrade to the latest cores as used in Magny Cours and other AMD CPUs. These are also called "Greyhound". There was even a K9, which got cancelled. See &lt;a href="http://groups.google.de/group/comp.arch/browse_thread/thread/45018bf3214f6049?hl=de#"&gt;this comp.arch &lt;/a&gt;discussion for more on that and Bulldozer's 17 FO4 pipeline. This means a design which is aimed at a 20-30% higher clock frequency compared to K8 with 22 FO4 (same voltage and fab process).&lt;/p&gt;
	&lt;p&gt;Anandtech has a &lt;a href="http://www.anandtech.com/show/3871/the-sandy-bridge-preview-three-wins-in-a-row/1"&gt;Sandy Bridge Preview&lt;/a&gt; online. Some quick observations: IPC seems to be 10% higher compared to Nehalem and the 1C GPU is about 10-20% faster than a Radeon 5450 (with 80SP, which is the same number rumoured to be in Ontario). I won't go into the Radeon 6870 early benchmark results here, which indicate 30% higher performance than 5870.&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.planet3dnow.de/cgi-bin/newspub/viewnews.cgi?category=1&amp;id=1282840508"&gt;AMD confirmed&lt;/a&gt; (AMD answer is in English, so don't hesitate) that for desktop Bulldozer CPUs (Zambezi being the first) an AM3+ socket will be necessary due to performance reason (I'm sure it's related to power planes, voltages and the like).&lt;/p&gt;
	&lt;p&gt;Finally there is a raw estimation of single thread performance of Zambezi compared to Phenom II as I posted &lt;a href="http://www.xtremesystems.org/forums/showthread.php?p=4525800#post4525800"&gt;here&lt;/a&gt;:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;A quick and raw estimation of single threaded performance for Zambezi  based on the 50% number given for Interlagos (just to show, what has to  be counted in at the least):&lt;/p&gt;
	&lt;p&gt; Relative_perf_1_thread_to_AMD_fam_10h = (Perf_Magny_Cours*1.5 * 12 / 16)  * Freq_ratio_of_half_#_of_Cores * Perf_boost_single_core_in_Module *  Perf_boost_single_module_on_chip&lt;/p&gt;
	&lt;p&gt; Freq_ratio_of_half_#_of_Cores = 3.2/2.3 = 1.39&lt;br&gt; Perf_Magny_Cours = 1&lt;br&gt; Perf_boost_single_core_in_Module = 1.11 (while going from 90% back to 100%)&lt;br&gt; Perf_boost_single_module_on_chip = 1.3 (some cheap turbo)&lt;/p&gt;
	&lt;p&gt; Relative_perf_1_thread_to_AMD_fam_10h = (1 * 1.5 * 12/16) * 1.39 * 1.11 * 1.3 = 2.26&lt;/p&gt;
	&lt;p&gt; So with some frequency scaling a Zambezi core will be about 126% faster  than a core running in a 2.3GHz MC without turbo. This would equal a  5.2GHz PhII core.&lt;/p&gt;
	&lt;p&gt; This is just speculation. Anyone is invited to check this. 		&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;It might be off by 10, 20% but should show how to calculate the performance relations when working with such numbers. The 150%/133% equation for Interlagos gives a result, but this doesn't describe single core performance as we understand it. &lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/08/28/a-quick-round-of-links-9265110/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/08/28/a-quick-round-of-links-9265110/</link><pubDate>Sat, 28 Aug 2010 01:13:31 +0200</pubDate></item><item><title>More on Bulldozer, Bobcat and a first APU from GlobalFoundries</title><description>	&lt;p&gt;I'm sure you all have seen the &lt;a href="http://www.anandtech.com/show/3865/amd-bobcat-bulldozer-hot-chips-presentations-online"&gt;more detailed Hot Chips slides&lt;/a&gt; covering Bulldozer and Bobcat. They caused a lot of discussion, e.g. if per core IPC goes up with less integer units or if BD desktop CPUs will fit into AM3 sockets (although &lt;a href="http://www.planet3dnow.de/vbulletin/attachment.php?attachmentid=16950&amp;stc=1&amp;d=1258014773"&gt;AMD's roadmap&lt;/a&gt; listed "AM3" under Zambezi). There is a lot to say and I'll add to some of these discussions soon. Keep in mind, that some seemingly unchanged details (compared to family 10h cores) in the architecture actually changed, which could mean a lot to IPC. For example the integer scheduler became a unified scheduler. So instructions are no more bound to a certain ALU/AGU but can be send to any unit if it is available. Well, more on that later.&lt;/p&gt;
	&lt;p&gt;Another interesting task will be to go through all the speculations and check, which turned out to be true and which didn't and why. You'll also remember the 4 ALU / 3 AGU / 4 FP op thing found in the Open64 compiler source code. And there was also this "accelerate mode" mentioned in a GCC mailing list posting. To me it seems it's still not clear how exactly the Bulldozer units work together, how their clock frequencies are related (e.g. are they all the same or different) etc. AMD didn't publish a pipeline diagram for Bulldozer as they did for Bobcat - surely for a reason.&lt;/p&gt;
	&lt;p&gt;And then there are future Fusion designs (much more tightly coupled) and also Intel's Haswell architecture to speculate about &lt;img src="/img/smilies/icon_wink.gif" alt=";)" class="middle" border="0"&gt;&lt;/p&gt;
	&lt;p&gt;At Hot Chips Microsoft disclosed their own APU: the heart of the new Xbox 360 250GB, a SoC with CPU and GPU on a single die. There is an article at &lt;a href="http://arstechnica.com/gaming/news/2010/08/microsoft-beats-intel-amd-to-market-with-cpugpu-combo-chip.ars"&gt;Arstechnica&lt;/a&gt; covering it and another one at &lt;a href="http://pc.watch.impress.co.jp/docs/column/kaigai/20100825_389002.html"&gt;PC Watch&lt;/a&gt; (in Japanese, translation &lt;a href="http://translate.google.com/translate?js=y&amp;prev=_t&amp;hl=en&amp;ie=UTF-8&amp;layout=1&amp;eotf=1&amp;u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20100825_389002.html&amp;sl=ja&amp;tl=en"&gt;here&lt;/a&gt;), which contains many presentation slides. This chip will be produced at GlobalFoundries.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/08/25/more-on-bulldozer-bobcat-and-a-first-apu-from-globalfoundries-9247191/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/08/25/more-on-bulldozer-bobcat-and-a-first-apu-from-globalfoundries-9247191/</link><pubDate>Wed, 25 Aug 2010 21:14:57 +0200</pubDate></item><item><title>Bulldozer and Bobcat at Hot Chips #1</title><description>	&lt;p&gt;Today AMD will present more Details of their new microarchitectures at the last day of this year's Hot Chips conference. There are already many articles up, mostly covering what AMD presented to journalists last week. Many articles are being collected in &lt;a href="http://www.xtremesystems.org/forums/showthread.php?t=257927"&gt;this XS thread&lt;/a&gt;. I worked with Dr@ from Planet3DNow on this &lt;a href="http://www.planet3dnow.de/vbulletin/showthread.php?t=384581"&gt;news article&lt;/a&gt; (in German).&lt;/p&gt;
	&lt;p&gt;AMD published the &lt;a href="http://blogs.amd.com/work/2010/08/23/%E2%80%9Dbulldozer%E2%80%9D-20-questions-round-one/"&gt;first round of answers&lt;/a&gt; to the 20 Bulldozer questions as hinted by John Fruehe. He also mentioned that some journalists got more technical presentations and their NDAs will lift later today after AMD's Hot Chips presentations took place. AMD also made their &lt;a href="http://blogs.amd.com/press/2010/08/24/amd-hot-chips-press-kit/"&gt;Hot Chips press&lt;/a&gt; kit available.&lt;/p&gt;
	&lt;p&gt;However Anand Lal Shimpi already published &lt;a href="http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010"&gt;a much more detailed article&lt;/a&gt;, covering the BD units (e.g. 2 ALUs and 2 AGUs plus MUL/DIV per core), the high frequency design (deeper pipeline), improved branch prediction, energy management and much more. He also provides a lot of details regarding Bobcat.&lt;/p&gt;
	&lt;p&gt;I'm waiting for the final wave of articles. So far, some of the speculations seem to be correct, while others are not. That's the unforgiving nature of speculations. Now it's time to cut some branches of speculation and concentrate further on the others.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/08/24/bulldozer-and-bobcat-at-hot-chips-9237288/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/08/24/bulldozer-and-bobcat-at-hot-chips-9237288/</link><pubDate>Tue, 24 Aug 2010 15:04:25 +0200</pubDate></item><item><title>Bulldozer Preview Article online</title><description>	&lt;p&gt;After a lot of work and research, my Bulldozer Preview article written in German (right in time before Hot Chips - which would make many speculations obsolete) &lt;a href="http://www.planet3dnow.de/vbulletin/showthread.php?t=384394"&gt;went online at Planet3DNow&lt;/a&gt;. The lack of time wouldn't allow to translate this article to English by hand. So I just can offer the well known translation services to make the content accessible to non-German-speaking readers:&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://translate.google.com/translate?js=y&amp;prev=_t&amp;hl=de&amp;ie=UTF-8&amp;layout=1&amp;eotf=1&amp;u=http%3A%2F%2Fwww.planet3dnow.de%2Fvbulletin%2Fshowthread.php%3Ft%3D384394&amp;sl=de&amp;tl=en"&gt;Googlish&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://de.babelfish.yahoo.com/translate_url?doit=done&amp;tt=url&amp;intl=1&amp;fr=bf-home&amp;trurl=http%3A%2F%2Fwww.planet3dnow.de%2Fvbulletin%2Fshowthread.php%3Ft%3D384394&amp;lp=de_en&amp;btnTrUrl=%C3%9Cbersetzen"&gt;Babelfish&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://www.microsofttranslator.com/BV.aspx?ref=BVNav&amp;from=&amp;to=en&amp;a=http%3A%2F%2Fwww.planet3dnow.de%2Fvbulletin%2Fshowthread.php%3Ft%3D384394%26garpg%3D3"&gt;Binglish&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;These translations look at least readable &lt;img src="http://www.blog.de/image/smileys/08wink.gif" alt=""&gt;&lt;/p&gt;
	&lt;p&gt;Matt.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/08/18/bulldozer-preview-article-online-9203608/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/08/18/bulldozer-preview-article-online-9203608/</link><pubDate>Wed, 18 Aug 2010 21:51:27 +0200</pubDate></item><item><title>More Sandy Bridge performance numbers</title><description>	&lt;p&gt;Compared to Bulldozer there already is a nice collection of benchmark numbers for Sandy Bridge. For example those &lt;a href="http://forum.coolaler.com/showthread.php?t=240578&amp;page=1"&gt;posted by Coolaler&lt;/a&gt;, a few &lt;a href="http://citavia.blog.de/2010/07/06/bulldozer-likely-with-4-alus-and-at-least-3-agus-per-core-8927293/"&gt;BOINC benchmark results&lt;/a&gt; and a &lt;a href="http://www.youtube.com/watch?v=TkP4rEV_MTE"&gt;video&lt;/a&gt; with a mobile Sandy Bridge running Cinema 4D. The &lt;a href="http://www.planet3dnow.de/vbulletin/showthread.php?t=368878&amp;page=9"&gt;video analysis&lt;/a&gt; done in the Planet3DNow forums resulted in a deciphered score of &lt;a href="http://www.planet3dnow.de/vbulletin/showthread.php?p=4253785#post4253785"&gt;19641&lt;/a&gt;, confirmed by the measured run time (44 s). This means, the tested mobile Sandy Bridge processor was as fast as a Core i7-975 Extreme. Another comparison could be done by using a recently published Geekbench result of a &lt;a href="http://browse.geekbench.ca/geekbench2/view/273184"&gt;1.6 GHz Sandy Bridge&lt;/a&gt; CPU. So I compared it to a &lt;a href="http://browse.geekbench.ca/geekbench2/view/274637"&gt;Core i7&lt;/a&gt; also running at 1.6 GHz and made following table with overall results and a diagram showing the differences in detail.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://data6.blog.de/media/417/4863417_081e137fc4_m.png" alt="SBvsi7"&gt;&lt;br&gt;So the average performance increase with those CPUs at the same base clock, but with different Turbo Boost implementations,  is about 20%. In the diagram below we can see a significant average difference in multi-threaded benchmarks:&lt;br&gt; &lt;a title="SBvsi7diag" href="javascript:window.open("&gt;&lt;img src="http://data6.blog.de/media/478/4863478_a4aa2e84fe_m.png" alt="SBvsi7diag"&gt;&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/08/06/more-sandy-bridge-performance-numbers-9128712/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/08/06/more-sandy-bridge-performance-numbers-9128712/</link><pubDate>Fri, 06 Aug 2010 00:38:32 +0200</pubDate></item><item><title>GCC scheduler code for Bulldozer</title><description>	&lt;p&gt;Another&lt;a href="http://gcc.gnu.org/ml/gcc-patches/2010-07/msg00717.html"&gt; source code&lt;/a&gt; containing an instruction scheduling algorithm for Bulldozer appeared on the GCC patches mailing list a few weeks ago. This source code sheds some light on an &lt;a href="http://gcc.gnu.org/ml/gcc/2010-06/msg00402.html"&gt;earlier posting&lt;/a&gt; on the (general) GCC mailing list, as &lt;a href="http://citavia.blog.de/2010/06/14/more-signs-of-bulldozer-8805042/"&gt;reported&lt;/a&gt; a couple of weeks ago. In combination with the &lt;a href="http://citavia.blog.de/2010/07/06/bulldozer-likely-with-4-alus-and-at-least-3-agus-per-core-8927293/"&gt;Open64 sources&lt;/a&gt; the picture becomes a bit clearer - but not completely. One of the more interesting parts might be this one:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;&lt;span&gt;+/* The size of the dispatch window is the total number of bytes of&lt;br&gt; +   object code allowed in a window.  */&lt;br&gt; +#define DISPATCH_WINDOW_SIZE 16&lt;br&gt; +&lt;br&gt; +/* Number of dispatch windows considered for scheduling.  */&lt;br&gt; +#define MAX_DISPATCH_WINDOWS 3&lt;br&gt; +&lt;br&gt; +/* Maximum number of instructions in a window.  */&lt;br&gt; +#define MAX_INSN 4&lt;br&gt; +&lt;br&gt; +/* Maximum number of immediate operands in a window.  */&lt;br&gt; +#define MAX_IMM 4&lt;br&gt; +&lt;br&gt; +/* Maximum number of immediate bits allowed in a window.  */&lt;br&gt; +#define MAX_IMM_SIZE 128&lt;br&gt; +&lt;br&gt; +/* Maximum number of 32 bit immediates allowed in a window.  */&lt;br&gt; +#define MAX_IMM_32 4&lt;br&gt; +&lt;br&gt; +/* Maximum number of 64 bit immediates allowed in a window.  */&lt;br&gt; +#define MAX_IMM_64 2&lt;br&gt; +&lt;br&gt; +/* Maximum total of loads or prefetches allowed in a window.  */&lt;br&gt; +#define MAX_LOAD 2&lt;br&gt; +&lt;br&gt; +/* Maximum total of stores allowed in a window.  */&lt;br&gt; +#define MAX_STORE 1&lt;/span&gt;&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;So there are some dispatch window restrictions, which fit very well to what is already known. In addition to that it seems to me that "micro op" refers to an operation which could contain both an ALU/FP op and a mem op - similar to the MacroOps found in K7 to K8L (or K10).&lt;/p&gt;
	&lt;p&gt;I'm just wondering, how the mentioned accelerate mode works. In the older GCC posting we could read:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;The new hardware issues two windows of the size N bytes of instructions in every cycle. It goes into accelerate mode if the windows have the right combination of instructions or alignments.&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;This looks to me like Bulldozer could issue up to the equivalent of &lt;strong&gt;eight&lt;/strong&gt; x86 ops per cycle, if its accelerate mode kicks in. How does that fit to the 4-wide decoder assumed by many of us? Well, looking into the &lt;a href="http://www.freepatentsonline.com/y2009/0019263.html"&gt;patents&lt;/a&gt; &lt;a href="http://www.freepatentsonline.com/y2009/0019257.html"&gt;again&lt;/a&gt; I found:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;It is noted, however, that in other embodiments the components of processor core 100 may determine the actual start of &lt;strong&gt;two or eight&lt;/strong&gt; variable length instructions per cycle (or other quantities). In other  words, the &lt;strong&gt;design may be scalabl&lt;/strong&gt;e to meet various specifications, as  desired.&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;So while in the past - mostly based on described embodiments found in the patents - it looked like there will be a 4-wide decoder and two integer cores/cluster with 2 ALUs and 2 AGUs each, it now looks more like not only the integer cores are wider (4/4), but the decoder as well. This could be achieved by making it this big or by double pumping it. Instead of double pumping it could be running at a higher clock frequency than the cores. OTOH instead of wider decoders there could also be a trace cache as already speculated. What we'll hear about at Hot Chips depends on what has won the performance/power/leakage contest. E.g. a trace cache could have too much leakage because of it's area, so that in the end refetching and redecoding the predecoded instructions from instruction cache might be more efficient.&lt;/p&gt;
	&lt;p&gt;P.S.: Maybe you already noticed the Ontario BOINC related news popping up recently. It started on &lt;a href="http://www.heise.de/newsticker/meldung/Erste-Hinweise-auf-die-Performance-des-AMD-Netbookprozessors-Ontario-1047271.html"&gt;Heise News&lt;/a&gt;, which were so kind citing and linking their &lt;a href="http://citavia.blog.de/2010/06/29/llano-tri-core-and-ontario-dual-core-spotted-8884456/"&gt;source&lt;/a&gt;. &lt;a href="http://www.hardware-infos.com/news.php?news=3644"&gt;Hardwareinfos&lt;/a&gt; was the next, making a table based on the article published on Heise News. And this one finally was the base for at least a dozen of news bits all around the world. Nice to see how it evolves.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/07/30/gcc-scheduler-code-for-bulldozer-9074571/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/07/30/gcc-scheduler-code-for-bulldozer-9074571/</link><pubDate>Fri, 30 Jul 2010 21:35:51 +0200</pubDate></item><item><title>Bulldozer likely with 4 ALUs and at least 3 AGUs per core</title><description>	&lt;p&gt;Some recent digging into the Open64 source code revealed more information about a microarchitecture, which seems to be Bulldozer's. This open source code looks to be a good source as shown before (&lt;a href="http://citavia.blog.de/2010/01/21/bulldozer-s-cache-sizes-leaked-7846952/"&gt;cache sizes&lt;/a&gt; or &lt;a href="http://citavia.blog.de/2010/01/21/some-instruction-latency-numbers-of-bulldozer-7850137/"&gt;instruction latencies&lt;/a&gt;). For example, the file &lt;a href="http://svn.open64.net/filedetails.php?repname=Open64&amp;path=%2Fbranches%2Fopen64-booster%2Fosprey%2Fbe%2Fcg%2Fx8664%2Fcg_sched.cxx&amp;"&gt;cg_sched.cxx&lt;/a&gt; (part of the code generator and responsible for the scheduling of instructions in compiled code) contains:&lt;/p&gt;
&lt;code&gt;&lt;strong&gt;&lt;span&gt;static&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;const&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;int&lt;/span&gt;&lt;/strong&gt; num_fu[] = {&lt;br&gt;&lt;/code&gt;&lt;code&gt;  0,   &lt;em&gt;&lt;span&gt;/* NONE  */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;  4,   &lt;em&gt;&lt;span&gt;/* ALU   */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;  3,   &lt;em&gt;&lt;span&gt;/* AGU   */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;  4,   &lt;em&gt;&lt;span&gt;/* FPU   */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;};&lt;br&gt;&lt;/code&gt;	&lt;p&gt;This looked like this in an &lt;a href="http://svn.open64.net/filedetails.php?repname=Open64&amp;path=%2Fbranches%2Fopen64-booster%2Fosprey%2Fbe%2Fcg%2Fx8664%2Fcg_sched.cxx&amp;rev=3107"&gt;earlier revision&lt;/a&gt; (according to the comment it is dedicated to the Opteron CPU):&lt;/p&gt;
&lt;code&gt;&lt;strong&gt;&lt;span&gt;static&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;const&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;int&lt;/span&gt;&lt;/strong&gt; num_fu[] = {&lt;br&gt;&lt;/code&gt;&lt;code&gt;  0,   &lt;em&gt;&lt;span&gt;/* NONE */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;  3,   &lt;em&gt;&lt;span&gt;/* ALU  */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;  3,   &lt;em&gt;&lt;span&gt;/* AGU  */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;  1,   &lt;em&gt;&lt;span&gt;/* FADD */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;  1,   &lt;em&gt;&lt;span&gt;/* FMUL */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;  1,   &lt;em&gt;&lt;span&gt;/* FMISC */&lt;/span&gt;&lt;/em&gt;&lt;br&gt;&lt;/code&gt;&lt;code&gt;};&lt;br&gt;&lt;/code&gt;	&lt;p&gt;So this looks like each integer core has 4 ALUs and at least 3 AGUs (perhaps there are 4 for easier scheduling, but only 3 can be used during a single cycle). The number of AGUs fits well to the already speculated 2 loads and 1 store per cycle. The 4 FP units match to the 4 issue FPU mentioned by Chuck Moore. Now one might think that available decode bandwidth is not enough to keep two integer cores with these execution capabilities highly utilized. But since Bulldozer could be a latency tolerant design with data speculation, checkpointing, replay and runahead execution to cover L1 and L2 misses, the execution resources could be needed for such features.&lt;/p&gt;
	&lt;p&gt;The changes came with a comment "Phase 1 implementation of support for new target work". Some other interesting lines (copied from different lines of the file):&lt;/p&gt;
&lt;code&gt;  &lt;strong&gt;&lt;span&gt;static&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;const&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;int&lt;/span&gt;&lt;/strong&gt; load_ops_rate = 2;&lt;br&gt;&lt;/code&gt;&lt;code&gt;  &lt;strong&gt;&lt;span&gt;static&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;const&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;int&lt;/span&gt;&lt;/strong&gt; store_ops_rate = 1;&lt;/code&gt;&lt;br&gt;&lt;code&gt;or&lt;br&gt;  &lt;strong&gt;&lt;span&gt;static&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;const&lt;/span&gt;&lt;/strong&gt; &lt;strong&gt;&lt;span&gt;int&lt;/span&gt;&lt;/strong&gt; issue_rate = 4;&lt;/code&gt;&lt;br&gt;	&lt;p&gt;Further reading reveals more info, like there are up to 4 single decoded ops per dispatch group. Some instructions are decoded as fast double decoded ops like in K8 or K10. The number of 64 bit immediates is limited to 2. I assume that there can be up to four 32 bit immediates. So there seems to exist what already has been mentioned as a immediate/constant steering unit, described in one patent.&lt;/p&gt;
	&lt;p&gt;There is more, but I think these were the most interesting findings.&lt;/p&gt;
	&lt;p&gt;P.S.: Since looking for BOINC stats of Ontario and Llano was successful I tried the same for Sandy Bridge (although there are other benchmark results out there):&lt;/p&gt;
	&lt;p&gt;&lt;a href="http://en.allprojectstats.com/show.php?projekt=1&amp;id=2541730"&gt;Sandy Bridge Stepping 3, 2.2GHz&lt;/a&gt;&lt;br&gt;&lt;a href="http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1285409"&gt;Sandy Bridge Stepping 2, 2.0GHz&lt;/a&gt;&lt;br&gt;&lt;a href="http://allprojectstats.com/show.php?projekt=0&amp;id=4279387743"&gt;Sandy Bridge Stepping 0, 2.2GHz&lt;/a&gt;&lt;br&gt;&lt;a href="http://en.allprojectstats.com/show.php?projekt=1&amp;id=2550250"&gt;Sandy Bridge Stepping 3, 2.4GHz&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/07/07/bulldozer-likely-with-4-alus-and-at-least-3-agus-per-core-8927293/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/07/07/bulldozer-likely-with-4-alus-and-at-least-3-agus-per-core-8927293/</link><pubDate>Wed, 07 Jul 2010 00:56:58 +0200</pubDate></item><item><title>Some updates regarding Ontario, BOINC and Bobcat</title><description>	&lt;p&gt;BOINC is indeed doing double precision floating point calculations during the floating point benchmark as can be seen in the &lt;a href="http://boinc.berkeley.edu/trac/browser/branches/boinc_core_release_6_10/client/whetstone.cpp"&gt;source code&lt;/a&gt;. Thanks, Alex. So the BOINC results of Ontario (Bobcat core) indicate a rather high DP throughput for a mobile CPU core. It is possible that the FPU contains a multiplier like the one described in the paper I mentioned in my last blog entry. You can read it in full &lt;a href="http://mesa.ece.wisc.edu/publications/cp_2009-02.pdf"&gt;here&lt;/a&gt;. Thanks, Hans.&lt;/p&gt;
	&lt;p&gt;After looking at the Whetstone code I found that additions and subtractions play a bigger role than multiplications and there are also several divisions and even square root, and transcendental and trigonometric functions. The add/sub instructions shouldn't have a lower throughput like DP multiplications. And I remembered another paper, covering the division algorithm. You can read it &lt;a href="http://mesa.ece.wisc.edu/~mesa/publications/cp_2007-12.pdf"&gt;here&lt;/a&gt;. The described Goldschmidt division algorithm achieves a rather low latency using the rectangular multiplier.&lt;/p&gt;
	&lt;p&gt;Comparing the Bobcat core to an Atom core at 1.66 GHz (as &lt;a href="http://en.allprojectstats.com/show.php?projekt=0&amp;id=4279616414"&gt;this one&lt;/a&gt;), the benchmarked Bobcat cores are about 2x the integer performance and 3x the FP performance of Atom. Thanks, informal.&lt;/p&gt;
	&lt;p&gt;And if you haven't noticed, there is a &lt;a href="http://www.xtremesystems.org/forums/showthread.php?p=4427287"&gt;list of sockets and specs&lt;/a&gt; for AMD's Fusion MPU lineup. Some news sites use this data to produce multiple news bits, but sometimes it's better to go back to the source and get it all at once. In this list you'll find 3 core Llanos like the one tested with BOINC. The TDPs for the FT1 socket (Ontario, family 20 or 14h) indicate a possible TDP between 9 and 20 W for a dual core. This includes the GPU part and some uncore stuff.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/07/02/some-updates-regarding-ontario-boinc-and-bobcat-8899248/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/07/02/some-updates-regarding-ontario-boinc-and-bobcat-8899248/</link><pubDate>Fri, 02 Jul 2010 00:00:19 +0200</pubDate></item><item><title>Llano Tri Core and Ontario Dual Core spotted</title><description>	&lt;p&gt;According to some BOINC stats there are some Fusion APU samples in the labs, running complex calculations for over a month.&lt;/p&gt;
	&lt;p&gt;There we have: &lt;br&gt; &lt;a href="http://de.allprojectstats.com/show.php?projekt=0&amp;id=4279889175"&gt;Ontario 1 ("AMD64 Family 20 Model 0 Stepping 0")&lt;/a&gt;&lt;br&gt; &lt;a href="http://de.allprojectstats.com/show.php?projekt=0&amp;id=4279787263"&gt;Ontario 2 ("AMD64 Family 20 Model 0 Stepping 0")&lt;/a&gt;&lt;br&gt; &lt;a href="http://de.allprojectstats.com/show.php?projekt=0&amp;id=4279752916"&gt;Llano 1 ("AMD64 Family 18 Model 0 Stepping 0")&lt;/a&gt;&lt;br&gt; &lt;a href="http://de.allprojectstats.com/show.php?projekt=0&amp;id=4279733735"&gt;Llano 2 ("AMD64 Family 18 Model 0 Stepping 0")&lt;/a&gt;&lt;/p&gt;
	&lt;p&gt; The operating systems being used are Win 2k8 x64 and Linux. A model number 0 and stepping number 0 for all of them indicate early samples (likely early A0 silicon). These systems could actually be just two systems, running different OS' at different times. The number of cores (2 for Ontario, 3! for Llano) or the RAM size are the same. One strange number is the cache size of the Ontarios. In one case it is listed as 512 kB which is ok. But in the other case it is 488 kB, 24 kB less than the maximum amount. One explaination could be a power management feature, which dynamically resizes the L2 cache depending on cache usage and power budget. Edit: This seems to be a problem of the BOINC software.&lt;/p&gt;
	&lt;p&gt;The ratio of integer performance to floating point performance, which I &lt;a href="http://citavia.blog.de/2010/03/02/some-more-info-on-thuban-8105413/"&gt;used in the past&lt;/a&gt; to detect a turbo mode in Thuban, is 3.1 for Llano 1 and 3.9 (+26%) for Llano 2 (on Linux, so probably with customized processor driver). In case of Thuban the difference was 3.7 to 3.2 (+16%). This &lt;em&gt;could&lt;/em&gt; indicate an improved turbo mode, but this is what is expected for Llano. But there obviously is a flaw in the measurement method, since we don't know when a frequency boost happens.&lt;/p&gt;
	&lt;p&gt;Ontario has a ratio of 2.3 which speaks for the reduced number of integer units in its Bobcat cores. Another factor is the throughput of the floating point unit. If I'm correct, BOINC uses single precision arithmetic in some of its projects. If it uses single precision for the floating point benchmark, the throughput could be four times that of double precision calculations according to a &lt;a href="http://www.computer.org/portal/web/csdl/doi/10.1109/TC.2008.203"&gt;paper&lt;/a&gt;, which I see as being related to the Bobcat FPU. I've read it in full, but the linked abstract tells enough:&lt;/p&gt;
	&lt;blockquote&gt;
	&lt;p&gt;"The FPM can perform two parallel single-precision multiplies every cycle with a latency of two cycles, one double-precision multiply every two cycles with a latency of four cycles, or one extended-double-precision multiply every three cycles with a latency of five cycles."&lt;/p&gt;
	&lt;/blockquote&gt;
	&lt;p&gt;According to my table of BOINC results, the cores of the Llano sample(s) are comparable to 1.5 to 1.9 GHz Phenom II cores. I think it's more likely, that the actual clock frequency range is 1.4 to 1.8 GHz due to the changes to the core and the bigger L2. The Ontario core's integer performance is comparable to an 1.3 GHz Phenom II core, while the (single precision) floating point performance even matches that of an 1.6 GHz Phenom II core. The BOINC benchmark only measures single core performance.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/06/29/llano-tri-core-and-ontario-dual-core-spotted-8884456/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/06/29/llano-tri-core-and-ontario-dual-core-spotted-8884456/</link><pubDate>Tue, 29 Jun 2010 12:05:47 +0200</pubDate></item><item><title>Bulldozer and Bobcat presentations at Hot Chips 22</title><description>	&lt;p&gt;On August 24th AMD will present more details of Bulldozer and Bobcat at the next Hot Chips conference according to the &lt;a href="http://www.hotchips.org/program/conference-day-two/"&gt;schedule&lt;/a&gt;. The presenters will be Mike Butler and Brad Burgess. While the latter is known for working on several PowerPC designs, Mike Butler is named as inventor in several patents or patent applications. The described inventions cover many interesting topics like eager execution (or runahead execution), checkpointing and instruction replay (for quick recovery from different kinds of speculation), loop detection and even redundant computing.&lt;/p&gt;
	&lt;p&gt;Thanks to Hans de Vries at SemiAccurate and isigrim at P3DNow! for pointing this out.&lt;/p&gt;
	&lt;p&gt;P.S.: Don't miss the &lt;a href="http://citavia.blog.de/2010/06/14/more-signs-of-bulldozer-8805042/#c13340729"&gt;discussion&lt;/a&gt; going on in the comments to my last blog entry. There we try to find the reasons behind the instruction windows mentioned in the &lt;a href="http://gcc.gnu.org/ml/gcc/2010-06/msg00402.html"&gt;GCC mail&lt;/a&gt;.&lt;/p&gt;
	&lt;p&gt;&lt;img src="http://nuje.de/img/ddb5.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt; &lt;small&gt; &lt;a href="http://citavia.blog.de/2010/06/22/bulldozer-and-bobcat-presentations-at-hot-chips-8844003/#comments"&gt;Comments&lt;/a&gt; &lt;/small&gt; &lt;/p&gt; </description><link>http://citavia.blog.de/2010/06/22/bulldozer-and-bobcat-presentations-at-hot-chips-8844003/</link><pubDate>Tue, 22 Jun 2010 08:15:13 +0200</pubDate></item></channel></rss>
