iXBT Labs - AMD Phenom X4 Bug. The Effect of AMD Patch on Processor and Platform Low-Level Characteristics

Our main review of the new architecture used in AMD Phenom X4 processors already touched upon their bug in L3 Cache TLB, which can freeze the system in experimental conditions (and probably in real conditions, according to the latest info). Our Phenom test results show that the bug does not have to do with L1/L2 D-TLB and/or I-TLB. Perhaps, it really affects some area in the integrated memory controller and/or L3 Cache, which can be called TLB (translation lookaside buffer). But this structure is not officially documented for these processors (and we don't know its characteristics). Anyway, AMD confirms that the bug exists. Motherboard manufacturers are also aware of this bug, so they release new BIOS versions with the patch to fix it. Unfortunately, it cannot be fixed "for free" (to all appearances, the patch disables this "TLB" structure), it reduces system performance by approximately 10-15%. According to some test results that have recently appeared in the Web, this patch can reduce system performance in various real applications (the spread of results is wide, the average performance drop is about 14%). However, the idea that the bug (and the corresponding patch) can affect a processor core, that is its execution units, seems unlikely. System performance is probably reduced because of deteriorated speed characteristics of memory and/or L3 Cache in the integrated memory controller. In order to prove it and to show how exactly the above-mentioned systems change with the patch, we decided to compare low-level characteristics of the AMD Phenom X4 platform with what we already obtained without the patch.

Testbed configurations

Testbed 1 (without the patch)

CPU: AMD Phenom X4 9700 (engineering sample, CPUID 100F22h, Barcelona core rev. B2, 2.4 GHz CPU, 2.0 GHz NB)
Chipset: AMD 790FX
Motherboard: MSI K9A2 Platinum, BIOS V1.1B3 dated 16.11.2007
Memory: 2x1 GB Corsair XMS2-6400 DDR2-800, 5-5-5-18 timings, ganged mode

Testbed 2 (patched)

CPU: AMD Phenom X4 9700 (engineering sample, CPUID 100F22h, Barcelona core rev. B2, 2.4 GHz CPU, 2.4 GHz NB)
Chipset: AMD 790FX
Motherboard: Gigabyte MA790GX-DQ6, BIOS F3c dated 07.12.2007
Memory: 2x1 GB Corsair XMS2-6400 DDR2-800, 5-5-5-18 timings, ganged mode

Real Bandwidth of Data Cache/Memory

First of all I'd like to note that absolute test results of the patched AMD Phenom X4 published in this review may differ from our older results (obtained without the patch) because of different motherboards used in the testbeds (see testbed configurations). In particular, when we tested the platform for the first time (Testbed 1, MSI K9A2 Platinum motherboard), we didn't know which frequency of the north bridge and L3 Cache were set by default. What concerns the second system analyzed in this article (Testbed 2, Gigabyte MA790GX-DQ6 motherboard), AMD Overdrive detects that the default frequency of the memory controller and L3 Cache is 2.0 GHz, while the core clock rate is 2.4 GHz. It will be reasonable to assume that the same default mode is used by the first motherboard. In this article we've increased the frequency of the north bridge to 2.4 GHz manually in BIOS Setup to ensure synchronous operation of processor cores and the integrated memory controller and to compare this mode with the default one (2.4 GHz CPU, 2.0 GHz NB).

Measurements of D-Cache and Memory bandwidths are published in Table 1.

Table 1

Level	Average real bandwidth, bytes/cycle
Level	Phenom X4 without the patch	Patched Phenom X4
L1, reading, MMX L1, reading, SSE2 L1, writing, MMX L1, writing, SSE2	15.69 31.69 7.98 15.67	15.69 31.69 7.98 15.67
L2, reading, MMX L2, reading, SSE2 L2, writing, MMX L2, writing, SSE2	7.66 7.98 4.94 5.10	7.66 7.98 4.94 5.10
L3, reading, MMX L3, reading, SSE2 L3, writing, MMX L3, writing, SSE2	3.69 3.71 3.38 3.38	3.97 4.02 3.74 3.76
RAM^*, reading (SSE2) RAM, writing (SSE2)	6.38 GB/s (49.9%) 3.49 GB/s (27.3%)	5.23 GB/s (40.9%) 3.43 GB/s (26.8%)

^*values relative to theoretical FSB bandwidth limit are in parentheses

As we have expected, the patch does not change performance characteristics of L1- and L2- D-Caches. Performance characteristics of L3 Cache in the second testbed grow a little (by 8-11%), while memory performance drops significantly (especially, read bandwidth, approximately by 18%). Results of this test can be used to draw a preliminary conclusion that the patch does not affect L3 Cache much (its bandwidth grows, because its clock rate is increased from 2.0 GHz to 2.4 GHz, although bandwidth gain is smaller than the 20% frequency growth), but it affects memory bandwidth (despite the increased frequency of the memory controller). The lack of influence on L3 Cache will be confirmed by our other tests.

Maximum Real Memory Bandwidth

Maximum real memory bandwidth values of the patched system, published in Table 2, are also lower: maximum real memory bandwidth for reading drops by 19.5%, for writing - by approximately 4.5%.

Table 2

Operation	Maximum real memory bandwidth, GB/s*
Operation	Phenom X4 without the patch	Patched Phenom X4
Reading, Software Prefetch	7.49 (58.5%)	6.03 (47.1%)
Writing, Non-Temporal Store	4.99 (39.0%)	4.82 (37.7%)

*values relative to theoretical memory bus bandwidth limit are in parentheses

Average Latency of Data Cache/Memory

Let's proceed to tests of D-Cache and RAM latencies, which can reveal the effect of the patch on the qualitative level (Picture 1).

Picture 1. Average D-Cache/Memory Latency

Namely, the effect manifests itself as a surge in random access latency, starting from 2 MB block size, which corresponds to the L2 D-TLB size of the processor core. Thus, this test is an indirect proof that the L2 D-TLB miss penalty grows much, we'll see it in the special D-TLB test.

Table 3

Level, Access Mode	Average latency, cycles (ns)
Level, Access Mode	Phenom X4 without the patch	Patched Phenom X4
L1 Cache, in all cases	3.0	3.0
L2, forward L2, backward L2, pseudo-random L2, random^*	~9.2 ~9.0 ~12.1 ~14.5	~9.3 ~9.3 ~12.2 ~14.6
L3, forward L3, backward L3, pseudo-random L3, random^*	~19.4 ~19.5 ~31.9 ~47.5	~18.2 ~18.7 ~31.7 ~48.3
RAM, forward RAM, backward RAM, pseudo-random RAM, random^*	16.2 ns 17.0 ns 34.4 ns 85.3 ns	18.2 ns 18.6 ns 39.0 ns 225.8 ns

^*32 MB block size

Quantitative characteristics of L1-, L2-, and L3-Cache latencies, published in Table 3, show that practically nothing changes with the patch. It again proves that the patch does not affect L3 Cache. The increased memory latencies (approximately by 13% for pseudo-random access and over 2.6 times(!) for random access) prove that memory performance deteriorates significantly when L2 D-TLB misses (that is when these misses should be masked by the undocumented TLB structure of the integrated memory controller, disabled by the patch).

Minimum D-Cache/Memory Latency

Conclusions on average D-Cache and memory latencies can be applied to minimum latencies as well (Table 4). Note the reduced efficiency of hardware prefetch, which manifests itself in the increased memory latencies for forward and backward walks, although it cannot be a direct result of disabling "TLB" in the integrated memory controller.

Table 4

Level, Access Mode	Minimum latency, cycles (ns)
Level, Access Mode	Phenom X4 without the patch	Patched Phenom X4
L1 Cache, in all cases	3.0	3.0
L2, forward L2, backward L2, pseudo-random L2, random^*	~3.3 (3.0^) ~3.3 (3.0^) ~8.2 (3.0^) ~11.3 (12.0^)	~3.2 (3.0^) ~3.3 (3.0^) ~8.2 (3.0^) ~11.2 (12.0^)
L3, forward L3, backward L3, pseudo-random L3, random^*	~5.0 (3.0^) ~5.5 (3.0^) ~28.2 (3.0^) ~46.7 (48.0^)	~5.0 (3.0^) ~5.6 (3.0^) ~27.6 (3.0^) ~46.9 (48.0^)
RAM, forward RAM, backward RAM, pseudo-random RAM, random^*	4.6 ns 5.4 ns 33.9 ns 84.6 ns	7.0 ns 7.6 ns 38.5 ns 225.8 ns

^*32 MB block size
^**Values in brackets are obtained with Method 2

Data Cache Associativity

L1-, L2-, and L3 D-Cache associativity measurements (Picture 2) do not differ from the point of view of cache associativity values. But they demonstrate a significant associativity miss penalty for all cache levels, if more than 48 cache segments are used. This effect is probably related to the L2 D-TLB miss, when memory access latencies should be masked by TLB of the memory controller.

Picture 2. Data Cache Associativity

L1-L2 Cache Bus Real Bandwidth

Like the average real bandwidth of L2 Cache, the patch does not change real bandwidth of L1-L2 bus (see Table 5) either.

Table 5

Access mode	Real L1-L2 Bandwidth, bytes/cycle
Access mode	Phenom X4 without the patch	Patched Phenom X4
Reading (forward) Reading (backward) Writing (forward) Writing (backward)	7.98 7.99 4.71 4.67	7.99 7.99 4.73 4.67

Real Bandwidth of the L2-L3 Bus

What concerns the real bandwidth of L2(processor core)-L3(memory controller) bus, its speed characteristics (see Table 6), just like the previously measured bandwidth of L3 Cache, are a tad higher in the new testbed (approximately by 8%) because of the higher frequency of the integrated memory controller (2.4 GHz versus 2.0 GHz).

Table 6

Access mode	Real L2-L3 bandwidth, bytes/cycle
Access mode	Phenom X4 without the patch	Patched Phenom X4
Reading (forward) Reading (backward) Writing (forward) Writing (backward)	3.73 3.73 3.46 3.46	4.03 4.03 3.68 3.68

I-Cache, Real Decode/Execute Bandwidth

There is apparently no effect of the patch on the decode/execute speed of instructions from L1-I and L2 Caches (see Table 7). Yet, execution speed of code from L3 Cache grows again (approximately by the same 8% as in case of L3 bandwidth) because its frequency has grown from 2.0 GHz to 2.4 GHz.

Table 7

Instruction type (size, bytes)	Decode/execute bandwidth, bytes/cycle (instructions/cycle)
	Phenom X4 without the patch			Patched Phenom X4
	L1 I-Cache	L2 Cache	L3 Cache	L1 I-Cache	L2 Cache	L3 Cache
NOP (1)	3.00 (3.00)	3.00 (3.00)	1.88 (1.88)	3.00 (3.00)	3.00 (3.00)	1.88 (1.88)
SUB (2) XOR (2) TEST (2) XOR/ADD (2) CMP 1 (2)	6.00 (3.00)	3.78 (1.89)	1.99 (0.99)	6.00 (3.00)	3.78 (1.89)	2.15 (1.08)
CMP 2 (4)	11.99 (3.00)	3.78 (0.95)	1.99 (0.50)	11.99 (3.00)	3.78 (0.95)	2.15 (0.54)
CMP 3-6 (6)	17.97 (3.00)	3.78 (0.63)	1.99 (0.33)	17.97 (3.00)	3.78 (0.63)	2.15 (0.36)
Prefixed CMP 1-4 (8)	23.22 (2.90)	3.78 (0.47)	1.99 (0.25)	23.22 (2.90)	3.78 (0.47)	2.15 (0.27)

I-Cache Associativity

I-Cache Associativity test (Picture 3) shows an interesting situation. To be more exact, the second inflection of the L3 associativity at 50 cache segments that we found in the previous tests (without the patch) has disappeared here. At the same time, an associativity miss penalty of the last cache level grows much, like in the D-Cache Associativity test. So we can draw a conclusion that the effective associativity of L3 Cache is 14 (32 minus 18), and the second inflection at 50 cache segments in our old tests is just an artifact.

Picture 3. Data Cache Associativity

TLB Characteristics

The strongest effect from the patch is expectedly demonstrated in TLB tests. TLB characteristics themselves do not change, of course (because these architectural elements belong to the processor core), but the size/associativity miss penalty of the last TLB level grows significantly.

Picture 4. D-TLB size

Picture 5. L2 D-TLB Associativity

Picture 4 shows results of the D-TLB Size test, and Picture 5 - results of the L2 D-TLB Associativity test. The L2 D-TLB miss penalty grows much in both cases - approximately 290 cycles for size and 400 cycles for associativity.

Picture 6. I-TLB size

Picture 7. L2 I-TLB Associativity

The same results are demonstrated by the I-TLB size (Picture 6) and associativity (Picture 7) tests. The L2 I-TLB miss penalty for size is approximately 300 cycles, and its associativity miss penalty amounts to about 400 cycles, which is close to corresponding values obtained in the D-TLB tests.

Conclusions

What conclusions follow from our analysis? First of all, we can draw an important conclusion that AMD Phenom (K10) and AMD Athlon 64 (K8) processors indeed contain some structure in their integrated memory controllers, which can be called a large TLB. It's efficient both for data calls (D-TLB) and for code execution (I-TLB). Existence of this structure in both processor families can be proved by moderate miss penalties of the last level (L2) of D-TLB and I-TLB, about 20-40 cycles. When it's disabled (it's the most reasonable explanation of what the patch does with AMD Phenom processors), the above mentioned miss penalties grow significantly (up to 300-400 cycles, that is practically by ten times!) We should also mention Energy Efficient AMD Athlon 64 X2 EE processors here, where miss penalties of L2 D-TLB and I-TLB are initially high. We can assume that such processors either lack the TLB structure in the integrated memory controller (which is unlikely, because we cannot say that it's so hard to implement and consumes so much power, that it had to be removed from Energy Efficient modifications of these processors), or... it's initially disabled because of a bug, similar to the bug in the integrated memory controller in Phenom processors (which is much more likely).

The next conclusion, which can be drawn from our test results, is that the above-mentioned TLB structure belongs to the integrated memory controller, not to its L3 Cache (as was mentioned in early reports about the bug in Phenom processors). It's proved by the fact that the patch has practically no negative effect on performance characteristics (bandwidth and latency) of L3 Cache. Thus, a general performance drop in the patched system can be explained solely by the reduced memory performance characteristics, and particularly, by a significantly increased random access latency. Here is a summary table.

Characteristic	No patch	Patched	Difference
Average Memory Bandwidth for Reading	6.38 GB/s	5.23 GB/s	-18.0%
Average Memory Bandwidth for Writing	3.49 GB/s	3.43 GB/s	-1.7%
Maximum Memory Bandwidth for Reading	7.49 GB/s	6.03 GB/s	-19.5%
Maximum Memory Bandwidth for Writing	4.99 GB/s	4.82 GB/s	-3.4%
Average Memory Latency, pseudo-random access	34.4 ns	39.0 ns	+13.4%
Average Memory Latency, random access	85.3 ns	225.8 ns	+164.7%
L2 D-TLB miss penalty size	28 cycles	290 cycles	10.4 times
L2 D-TLB miss penalty associativity	34 cycles	400 cycles	11.8 times
L2 I-TLB miss penalty size	30 cycles	300 cycles	10.0 times
L2 I-TLB miss penalty associativity	36 cycles	400 cycles	11.1 times

It's a bad idea to publish a mean value of the patch effect on so different low-level characteristics - the spread of results varies from 1.7% to 11.8 times. And the effects themselves (for example, the increased TLB miss penalties) are not as strong in real applications, because these characteristics are purely synthetic. However, we can group reduced memory performance characteristics (18-20%), which are close to reality. This reduction is comparable to the reduction of system performance in most real applications, which operate data in a stream rather than in random way. What concerns random data access, we can expect higher performance drops, because the memory access latency in this mode grows significantly.

Dmitri Besedin (dmitri_b@ixbt.com)
February 28, 2008

Write a comment below. No registration needed!