Detailed Platform Analysis in RightMark Memory Analyzer. Part 9 - Dual Core Intel Pentium Extreme Edition 840 (Smithfield)

. The first article from the new series.

April 18, Intel has announced the first dual core Pentium Extreme Edition 840 processor, based on the new core codenamed Smithfield, that supports Extended Memory 64-bit Technology (EM64T). Note that the name of the new processor lacks the habitual "Pentium 4" — this phrase is curtailed to "Pentium", though there are no doubts that Smithfield core is based on the NetBurst microarchitecture. Secondly, the new processor poses another change in the "extreme" ideology of processors with NetBurst microarchitecture. The extreme nature of the first Pentium 4 Extreme Edition processors consisted in 2MB L3 Cache. Later on, with the relatively recent launch of Pentium 4 Extreme Edition 3.73 GHz processors, this notion has become associated solely with the 266MHz FSB (1066 MHz Quad-Pumped). And the extreme nature of the new processor has obviously to do with its dual cores (because there is nothing else we can tie it to). At the same time, the future will see the launch of a new series of dual core processors from Intel under the Pentium D trademark, which will again require to review the notion of CPU extremity. Well, these are manufacturer's problems. And we are to examine the new processors and their architectural features in particular. The present article will traditionally review the key low-level characteristics of the new dual core in comparison with its previous single-core counterparts — Prescott, Nocona, and Prescott/2M cores, which were reviewed in the articles "Detailed Platform Analysis in RightMark Memory Analyzer. Part 6 — Intel Xeon (Nocona) platform" and "Detailed Platform Analysis in RightMark Memory Analyzer. Part 8 — Intel Pentium 4 and Pentium 4 Extreme Edition processors with a new revision of Prescott core".

Testbed configuration

CPU: Intel Pentium Extreme Edition 840 (3.2 GHz, Smithfield core, Socket 775, FSB 200 MHz)
Motherboard: Intel D955XBK on Intel 955X chipset, BIOS dated 04/04/2005
Memory: 2x512 MB PC2-4300 DDR2-533 Samsung (4-4-4-11 timings)

Software

Windows XP Professional SP2
Intel Chipset Installation Utility 7.0.0.1019
DirectX 9.0c
RightMark Memory Analyzer 3.5 pre-release

CPUID Characteristics

We'll start the review of the new Smithfield core with the analysis of the most important characteristics, provided by the CPUID instruction.

CPUID function	Value	Comments
Processor signature	0F44h	Family 15 , Model 4, Stepping 4
Brand ID	00h	Not supported
Cache/TLB descriptors	50h 5Bh 60h 40h 70h 7Ch	I-TLB: full associativity, 64 entries D-TLB: full associativity, 64 entries L1 Cache: 16 KB, 8-way assoc., 64-byte line L3 Cache is not available Trace Cache: 12K-uops, 8-way associativity L2 Cache: 1 MB, 8-way assoc., 64-byte line
Number of logical processors	04h	4 logical processors
Basic Features, ECX (the most important characteristics)	0000641Dh	Bit 0, 3: SSE3 support, MONITOR/MWAIT Bit 2: Unknown Bit 13: Unknown
Extended Features, EDX	20000000h	Bit 29: Intel (R) EM64T (x86-64) support

So, according to the CPU signature (0F44h), the new Smithfield core can be considered a continuation, sort of a new revision of Prescott and Nocona/Irwindale cores (their latest revision N0 is signed as 0F43h). Nevertheless, Smithfield core used in Pentium Extreme Edition 840 (or simply Pentium EE 840) got the official alphanumeric revision designation A0, that is it's the first core in a series.

Cache/TLB descriptors look usual, as in Prescott/Nocona cores with 1MB L2 Cache (the total claimed size of L2 D-Cache in Pentium EE 840 is 2 MB. But it's not hard to guess that this notion is quite relative, because L2 Cache is an isolated but not a shared resource for both cores — 1 MB per each core).

Extended features of the processor offer nothing principally new either. You can easily notice that the new core supports neither Thermal Monitor 2 nor Enhanced Intel SpeedStep (Demand-Based Switching), or Execute Disable Bit. However it completely supports Extended Memory 64-bit Technology (EM64T) as well as Enhanced Halt State C1E (it cannot be presented in CPUID data). The presence of two unknown technologies is also a usual thing (we have assumed they are LaGrande and VanderPool), designated by the 2nd and 13th Basic Features bits, ECX register. Thus, according to its key characteristics, the new Smithfield core ranks somewhere between the latest Revision E0 of Nocona core (if we take it without TM2 and DBS technologies) and the extreme version of Prescott/2M core, implemented in Pentium 4 Extreme Edition 3.73 GHz, but with 1MB L2 Cache and disabled XD bit.

Nevertheless, the main difference on the processor level has to do with the number of logical processors, designed in a single physical package (in the package, not in the core) — it has grown to 4 (each physical core is represented by two logical processors, due to Hyper-Threading support).

Real Bandwidth of Data Cache/Memory

Let's proceed to the results of the new processor in the latest RMMA pre-release 3.5. When we tested the new processor, the SysInfo component of the test did not properly support this processor and the chipset yet. Consequently, this processor is identified as a regular Intel Pentium 4 540 (it's actually true in terms of the clock and L2 Cache size for each core — if we pretend not to notice the dual core and EM64T support). When the program tries to write out timing values by configuration registers of the i955X chipset according to the official documentation, there occurs an error in t_RAS — it turns out twice as low compared to the true value (which probably has to do with a mistake in the documentation). Nevertheless, these shortcomings are already corrected in the coming release (Version 3.5) of this test package.

The general picture of real memory bandwidth (L1/L2/RAM) on Pentium EE 840 is shown below.

Real Bandwidth of Data Cache and Memory
Pentium EE 840

You can see the following key features on the graph: indeed, L2 D-Cache size is 1MB (that is there can be no intrusion of L2 Cache from the neighboring core), but the memory bandwidth of this level slumps noticeably already at 256 KB blocks. It has to do with the D-TLB depletion, which size is still 64 entries, that is 256 KB of covered virtual address space. No differences in L1 and L2 Cache write efficiency (there is no typical inflection at 16 KB) indicate the Write-Through organization, implemented in all previous cores of Pentium 4/Xeon processors.

Level	Average bandwidth, byte/cycle (MB/sec)
Level	Pentium 4 (Prescott D0)	Xeon (Nocona D0)	Pentium 4 660 (Prescott N0)	Pentium EE 840 (Smithfield A0)
L1, reading, MMX L1, reading, SSE L1, writing, MMX L1, writing, SSE	7.98 15.93 2.91 3.56	7.96 15.93 2.90 3.54	7.98 15.93 2.91 3.56	7.98 15.93 2.91 3.56
L2, reading, MMX L2, reading, SSE L2, writing, MMX L2, writing, SSE	4.41 8.02 2.91 3.56	4.39 7.84 2.89 3.54	4.53 8.13 2.91 3.56	4.57 8.21 2.91 3.56
RAM, reading, MMX RAM, reading, SSE RAM, writing, MMX RAM, writing, SSE	3901.4 MB/sec 4457.4 MB/sec 1750.0 MB/sec 1760.6 MB/sec	3215.2 MB/sec 3620.1 MB/sec 1863.0 MB/sec 1855.0 MB/sec	5187.3 MB/sec 5490.6 MB/sec 2061.8 MB/sec 2057.8 MB/sec	5361.4 MB/sec 5649.6 MB/sec 2408.9 MB/sec 2430.9 MB/sec

Quantitative characteristics of L1 and L2 D-Cache bandwidth in Smithfield match the previously obtained characteristics of the latest Prescott/2M core revision — we should note a higher read efficiency of L2 Cache compared to the earlier revisions (D0) of Prescott/Nocona cores.

More prominent changes can be seen in average real memory bandwidth. While the real memory read bandwidth has grown insignificantly compared to Pentium 4 660, the real memory write bandwidth has grown quite noticeably (approximately by 20%). As usual, it can be a result of improved BIU interface of the core or an improved memory controller in the new i955X chipset.

Maximum Real Memory Bandwidth

As usual (for Pentium 4 processors), Software Prefetch method allows maximum memory bandwidth, while other methods are not so highly efficient.

Maximum Real Memory Bandwidth, Software Prefetch, Pentium EE 840

The curves of real memory bandwidth versus software prefetch distance in Pentium EE 840 are typical of Prescott cores (they match those we obtained on previous revisions of this core). I want to note a smoother decrease of values relative to the maximum as the prefetch distance is increased. Most likely it has to do with the improved memory controller in the i955X chipset rather than the prefetch as such.

Access mode	Maximum memory read bandwidth, MB/sec*
Access mode	Pentium 4 (Prescott D0)	Xeon (Nocona D0)	Pentium 4 660 (Prescott N0)	Pentium EE 840 (Smithfield A0)
Reading, MMX Reading, SSE Reading, MMX, SW Prefetch Reading, SSE, SW Prefetch Reading, MMX, Block Prefetch 1 Reading, SSE, Block Prefetch 1 Reading, MMX, Block Prefetch 2 Reading, SSE, Block Prefetch 2 Reading cache lines, forward Reading cache lines, backward	3901.4 (61.0%) 4457.4 (69.6%) 6311.3 (98.6%) 6334.2 (99.0%) 4191.0 (65.5%) 4614.8 (72.1%) 3948.3 (61.7%) 4517.2 (70.6%) 5180.9 (81.0%) 5178.7 (80.9%)	3215.2 (50.2%) 3620.1 (56.6%) 5334.2 (83.3%) 5329.9 (83.3%) 3302.2 (51.6%) 3524.3 (55.1%) 3392.3 (53.0%) 3784.5 (59.1%) 3313.5 (51.8%) 3315.8 (51.8%)	5187.3 (81.1%) 5490.6 (85.8%) 6521.3 (101.9%) 6679.7 (104.4%) 4630.6 (72.4%) 5046.4 (78.9%) 5146.9 (80.4%) 5492.2 (85.8%) 5968.8 (93.3%) 5957.7 (93.1%)	5361.4 (83.8%) 5649.6 (88.3%) 6404.7 (100.1%) 6438.3 (100.6%) 4729.7 (73.9%) 5245.2 (82.0%) 5350.9 (83.6%) 5681.4 (88.8%) 6213.0 (97.1%) 6208.1 (97.0%)

^*values relative to the theoretical memory bandwidth limit (6.4 GB/sec for 200 MHz FSB) are in parentheses

This assumption is also backed up by high (compared to the previous reviews) maximum real memory bandwidth, obtained with other methods that do not involve software prefetch, especially the cache line read method (memory bandwidth values in this case are just a tad lower than the theoretical limit).

Access mode	Maximum memory write bandwidth, MB/sec^*
Access mode	Pentium 4 (Prescott D0)	Xeon (Nocona D0)	Pentium 4 660 (Prescott N0)	Pentium EE 840 (Smithfield A0)
Writing, MMX Writing, SSE Writing, MMX, Non-Temporal Writing, SSE, Non-Temporal Writing cache lines, forward Writing cache lines, backward	1750.0 (27.3%) 1760.6 (27.5%) 4265.9 (66.7%) 4266.0 (66.7%) 2283.9 (35.7%) 2254.2 (35.2%)	1863.0 (29.1%) 1855.0 (29.0%) 4236.2 (66.2%) 4235.9 (66.2%) 2380.2 (37.2%) 2386.1 (37.3%)	2061.8 (32.2%) 2057.8 (32.2%) 4255.0 (66.5%) 4254.9 (66.5%) 2527.7 (39.5%) 2435.0 (38.0%)	2408.9 (37.6%) 2430.9 (38.0%) 4266.0 (66.6%) 4266.0 (66.6%) 3114.4 (48.7%) 3112.8 (48.6%)

^*values relative to the theoretical memory bandwidth limit (6.4 GB/sec for 200 MHz FSB) are in parentheses

The most striking changes can be seen in maximum real memory write bandwidth. Of course, Non-Temporal Store method allows to get 2/3 of the maximum theoretical FSB bandwidth in all cases, no more, no less. The cache line write method has grown noticeably more efficient on the new platform — the obtained memory bandwidth values amount to approximately half of the maximum theoretical value. It's again the effect of the improved memory controller in the new i955X chipset.

Data Cache/Memory Latency

Data Cache/Memory Latency, Pentium EE 840

Typical features of Data Cache and D-TLB organization are no less clear in this test: 16 KB and 2MB inflections, which correspond to L1 and L2 Cache sizes, as well as a smooth rise of random L2 Cache access latency with the block size starting from 256KB.

Level, access	Average latency, cycles (ns)
Level, access	Pentium 4 (Prescott D0)	Xeon (Nocona D0)	Pentium 4 660 (Prescott N0)	Pentium EE 840 (Smithfield A0)
L1	4.0	4.0	4.0	4.0
L2	~28.5	~28.5	~28.5	~28.5
RAM, forward RAM, backward RAM, random^* RAM, pseudo-random	37.3 ns 41.1 ns 126.0 ns 56.1 ns	50.3 ns 52.6 ns 134.1 ns 75.8 ns	32.5 ns 36.8 ns 106.1 ns 52.0 ns	32.3 ns 35.7 ns 100.9 ns 51.5 ns

^*4MB block size

Quantitative latency characteristics of L1 and L2 Cache are the same for all processors included into the table. Concerning the memory latency, the table values have been obtained in a separate test, where the data chain is walked at 128byte steps, i.e. the effective L2 Cache line size. It's not hard to notice that the average memory latencies in all walk modes on Pentium EE 840 platform are almost no different from the latencies obtained on Pentium 4 660 platform. A slightly lower random access latency (101 ns versus 106 ns), when the hardware prefetch algorithm is nearly idle, can serve as an additional sign of an improved memory controller used in the i955X chipset, which we have mentioned above several times.

Minimum Latency of Data Cache/Memory

Minimum L2 Cache latency, Pentium EE 840, Method 1

Minimum L2 Cache latency, Pentium EE 840, Method 2

The L1-L2 bus unload curves, that is the minimum L2 Cache latency in Smithfield core, look as usual. They are similar to the curves, previously obtained on Prescott cores: latency of this level obviously does not reach its maximum at standard L1-L2 bus "unloading" by inserting "empty" operations (Method 1), and it goes down to 22 cycles at "non-standard" unloading, specially developed for processors with pronounced speculative data loading (Method 2).

Minimum memory latency, Pentium EE 840

L2-RAM bus unload curves for the Pentium EE 840 processor also look usual.

Level, access	Minimum latency, cycles (ns)
Level, access	Pentium 4 (Prescott D0)	Xeon (Nocona D0)	Pentium 4 660 (Prescott N0)	Pentium EE 840 (Smithfield A0)
L1	4.0	4.0	4.0	4.0
L2^*	24.0 (22.0)	24.0 (22.0)	24.0 (22.0)	24.0 (22.0)
RAM, forward RAM, backward RAM, random^** RAM, pseudo-random	28.7 ns 31.1 ns 125.2 ns 55.0 ns	38.4 ns 41.2 ns 134.0 ns 74.5 ns	27.0 ns 31.1 ns 105.4 ns 50.9 ns	24.6 ns 27.0 ns 98.9 ns 50.5 ns

^*Values in brackets are obtained by Method 2
^**4MB block size

Minimum memory latencies, obtained with the sufficiently unloaded L2-RAM bus, turn out somewhat lower compared to the Pentium 4 660 test results. Considering that the buses on these processors are unloaded in absolutely the same way (unload curves, so to speak), this phenomenon should be again attributed to the improvement of a memory controller in the new chipset.

Data Cache Associativity

D-Cache Associativity, Pentium EE 840

L1/L2 D-Cache associativity test for both processors, which result is shown on the picture, again indicates the lack of any changes in this parameter. As in all the other reviewed Pentium 4/Xeon processors on Prescott and Nocona cores, the "effective" L1 data cache associativity is equal to 1, associativity of the integrated L2 instruction cache/data cache is equal to 8.

L1-L2 Cache Bus Real Bandwidth

Access mode	Bandwidth, bytes/cycle*
Access mode	Pentium 4 (Prescott D0)	Xeon (Nocona D0)	Pentium 4 660 (Prescott N0)	Pentium EE 840 (Smithfield A0)
Reading (forward) Reading (backward) Writing (forward) Writing (backward)	16.42 (51.3%) 16.40 (51.3%) 4.76 (14.9%) 4.75 (14.9%)	16.42 (51.3%) 16.42 (51.3%) 4.79 (15.0%) 4.78 (14.9%)	14.50 (45.3%) 14.53 (45.4%) 3.99 (12.5%) 4.00 (12.5%)	16.75 (52.3%) 16.58 (51.8%) 4.89 (15.3%) 4.85 (15.2%)

^*values relative to the theoretical limit are in parentheses

In comparison with the previously reviewed Prescott/2M core revision N0, which demonstrated noticeably lower results in this test compared to the earlier core revisions, the new Smithfield core revision A0 gives us a pleasant surprise in this test. Namely, not only the real L1-L2 D-Cache bus bandwidth returned to the previous values, usual for Prescott cores, but even grew a tad higher. Thus, the reason for slumped values on Pentium 4 660 has become obvious — it has to do with the L2 D-Cache, increased twofold. As soon as it was decreased to 1MB per core (in Pentium EE 840), everything fell into place... and even a tad higher :).

Trace Cache, Decode/Execute Efficiency

Let's examine another interesting component of the NetBurst microarchitecture - its specialized cache for micro-operations (Execution Trace Cache) provided by the predecoder.

Decode/execute efficiency, Pentium EE 840

As always, the overall picture of decode/execute speed for "large" 6-byte CMP instructions is the most illustrative. This test, as well as all the other tests of this type, demonstrates no qualitative differences in the behaviour of Smithfield core under review and the previously tested Prescott/Nocona cores. So, let's proceed to the qualitative evaluation.

Decode/execute efficiency, Xeon (Nocona D0)

Instruction type	Effective size of Trace Cache, KB (Kuop)	Decode efficiency, bytes/cycle (instructions/cycle)
		Trace Cache	L2 Cache
Independent
NOP SUB XOR TEST XOR/ADD CMP 1 CMP 2 CMP 3 CMP 4 CMP 5 CMP 6^* Prefixed CMP 1 Prefixed CMP 2 Prefixed CMP 3 Prefixed CMP 4^*	10.5 (10.5) 22.0 (11.0) 22.0 (11.0) 22.0 (11.0) 22.0 (11.0) 22.0 (11.0) 44.0 (11.0) 63.0 (10.5) 63.0 (10.5) 63.0 (10.5) 32.0 (10.6) 63.0 (7.9; 10.5^) 63.0 (7.9; 10.5^) 63.0 (7.9; 10.5^) 44.0 (11.0; 14.7^)	2.85 (2.85) 5.70 (2.85) 3.97 (1.98) 3.64 (1.82) 5.70 (2.85) 5.40 (2.70) 10.29 (2.57) 15.50 (2.58) 15.50 (2.58) 15.50 (2.58) 8.62 (1.44) 20.66 (2.58) 20.66 (2.58) 20.66 (2.58) 11.53 (1.44)	0.99 (0.99) 1.99 (0.99) 1.99 (0.99) 1.99 (0.99) 1.99 (0.99) 1.99 (0.99) 3.98 (0.99) 4.25 (0.71) 4.25 (0.71) 4.25 (0.71) 4.25 (0.71) 4.40 (0.55) 4.40 (0.55) 4.40 (0.55) 4.40 (0.55)
Dependent
LEA MOV ADD OR SHL ROL	- - - - - -	1.99 (0.99) 1.99 (0.99) 1.99 (0.99) 1.99 (0.99) 3.00 (1.00) 3.00 (1.00)	1.99 (0.99) 1.99 (0.99) 1.99 (0.99) 1.99 (0.99) 3.00 (1.00) 3.00 (1.00)

*2 micro-operations
**in the assumption that prefixes are truncated before they are placed into Trace Cache

Here is a log of our research to track the mysterious and interesting tendency of Execution Trace Cache changes. At first, let's have a look at the data, previously obtained on Xeon platform (Nocona core, Revision D0). This processor core, equipped with EM64T, was the first to show the deterioration tendency for decode/execute efficiency of some commands - in particular, simple operations like TEST (test eax, eax) and CMP 1 (cmp eax, eax). For your convenience, the results that suffer changes in the next cores are marked with red.

Decode/execute efficiency, Pentium 4 660 (Prescott N0)

Instruction type	Effective size of Trace Cache, KB (Kuop)	Decode efficiency, bytes/cycle (instructions/cycle)
		Trace Cache	L2 Cache
Independent
NOP SUB XOR TEST XOR/ADD CMP 1 CMP 2 CMP 3 CMP 4 CMP 5 CMP 6^* Prefixed CMP 1 Prefixed CMP 2 Prefixed CMP 3 Prefixed CMP 4^*	10.5 (10.5) 22.0 (11.0) 22.0 (11.0) 22.0 (11.0) 22.0 (11.0) 22.0 (11.0) 44.0 (11.0) 63.0 (10.5) 63.0 (10.5) 63.0 (10.5) 32.0 (10.6) 63.0 (7.9; 10.5^) 63.0 (7.9; 10.5^) 63.0 (7.9; 10.5^) 44.0 (11.0; 14.7^)	2.87 (2.87) 5.73 (2.87) 3.99 (2.00) 3.42 (1.71) 5.73 (2.87) 5.16 (2.58) 10.32 (2.58) 15.48 (2.58) 15.48 (2.58) 15.48 (2.58) 8.67 (1.45) 20.62 (2.58) 20.60 (2.58) 20.60 (2.58) 11.56 (1.45)	1.00 (1.00) 2.00 (1.00) 2.00 (1.00) 2.00 (1.00) 2.00 (1.00) 2.00 (1.00) 3.99 (1.00) 4.00 (0.67) 4.00 (0.67) 4.00 (0.67) 4.00 (0.67) 4.14 (0.52) 4.14 (0.52) 4.14 (0.52) 4.14 (0.52)
Dependent
LEA MOV ADD OR SHL ROL	- - - - - -	2.01 (1.00) 2.01 (1.00) 2.01 (1.00) 2.00 (1.00) 3.00 (1.00) 3.00 (1.00)	2.01 (1.00) 2.01 (1.00) 2.01 (1.00) 2.00 (1.00) 3.00 (1.00) 3.00 (1.00)

*2 micro-operations
**in the assumption that prefixes are truncated before they are placed into Trace Cache

The outlined tendency of CPU performance deterioration (execution of some commands) carries on its development in Prescott/2M core revision N0 (Pentium 4 660 and Pentium 4 Extreme Edition 3.73 GHz). First of all, we can see the subsequent decrease of TEST and CMP 1 execution efficiency. The second significant modification is the reduction of maximum execute speed for all CMP operations from L2 Cache to 4.0 bytes/cycle (1.0 or 0.67 instructions/cycle, depending on the command length) as well as prefixed CMP to 4.14 bytes/cycle (0.52 instructions/cycle).

Decode/execute efficiency, Pentium EE 840 (Smithfield A0)

Instruction type	Effective size of Trace Cache, KB (Kuop)	Decode efficiency, bytes/cycle (instructions/cycle)
		Trace Cache	L2 Cache
Independent
NOP SUB XOR TEST XOR/ADD CMP 1 CMP 2 CMP 3 CMP 4 CMP 5 CMP 6^* Prefixed CMP 1 Prefixed CMP 2 Prefixed CMP 3 Prefixed CMP 4^*	10.5 (10.5) 22.0 (11.0) 22.0 (11.0) 22.0 (11.0) 22.0 (11.0) 22.0 (11.0) 44.0 (11.0) 63.0 (10.5) 63.0 (10.5) 63.0 (10.5) 32.0 (10.6) 63.0 (7.9; 10.5^) 63.0 (7.9; 10.5^) 63.0 (7.9; 10.5^) 44.0 (11.0; 14.7^)	2.87 (2.87) 5.73 (2.87) 3.99 (2.00) 3.42 (1.71) 5.73 (2.87) 5.16 (2.58) 10.32 (2.58) 15.48 (2.58) 15.48 (2.58) 15.48 (2.58) 8.67 (1.45) 20.62 (2.58) 20.60 (2.58) 20.60 (2.58) 11.56 (1.45)	1.00 (1.00) 2.00 (1.00) 2.00 (1.00) 2.00 (1.00) 2.00 (1.00) 2.00 (1.00) 3.99 (1.00) 4.26 (0.71) 4.26 (0.71) 4.26 (0.71) 4.26 (0.71) 4.45 (0.56) 4.45 (0.56) 4.45 (0.56) 4.45 (0.56)
Dependent
LEA MOV ADD OR SHL ROL	- - - - - -	2.01 (1.00) 2.01 (1.00) 2.01 (1.00) 2.00 (1.00) 3.00 (1.00) 3.00 (1.00)	2.01 (1.00) 2.01 (1.00) 2.01 (1.00) 2.00 (1.00) 3.00 (1.00) 3.00 (1.00)

*2 micro-operations
**in the assumption that prefixes are truncated before they are placed into Trace Cache

Let's proceed to the data obtained on the latest Pentium EE 840. Nothing has changed with TEST and CMP 1 — their execute efficiency from Trace Cache is the same as in Prescott/2M based processors. But the execute efficiency of CMP3-CMP6 as well as prefixed CMP1-CMP4 operations from L2 Cache has grown higher again, even a tad higher than Nocona. Does it ring any bells? We have just seen a similar picture in the previous test — Real L1-L2 Cache Throughput. It has obviously to do with it. It's clear now: the drop in execution performance of some large instructions from L2 Cache on processors with 2MB Prescott cores (Revision N0) is the result of enlarged combined L2 Cache for instructions/data.

The second significant efficiency deterioration of the decoder/pipeline in Prescott/Nocona cores with EM64T was in the decreased efficiency of truncating "meaningless" prefixes in the test that executed instructions of the type [0x66]_nNOP, n = 0..14.

Decode/execute efficiency for prefix instructions, Pentium EE 840

Number of prefixes	Decode/execute efficiency, bytes/cycle (instructions/cycle)
Number of prefixes	Pentium 4 (Prescott D0)	Xeon (Nocona D0)	Pentium 4 660 (Prescott N0)	Pentium EE 840 (Smithfield A0)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14	2.84 (2.84) 5.68 (2.84) 8.52 (2.84) 11.34 (2.84) 14.09 (2.82) 16.89 (2.82) 19.51 (2.79) 22.30 (2.79) 24.87 (2.76) 27.54 (2.75) 30.76 (2.80) 33.24 (2.77) 35.86 (2.76) 38.18 (2.73) 40.85 (2.72)	2.79 (2.79) 5.41 (2.71) 8.16 (2.72) 10.48 (2.62) 12.73 (2.55) 14.73 (2.46) 16.63 (2.38) 18.75 (2.34) 20.63 (2.29) 21.93 (2.19) 23.44 (2.13) 25.78 (2.15) 27.14 (2.09) 28.64 (2.05) 30.33 (2.02)	2.80 (2.80) 5.43 (2.72) 8.13 (2.71) 10.42 (2.61) 12.74 (2.55) 14.74 (2.46) 16.64 (2.38) 18.76 (2.35) 20.23 (2.25) 21.96 (2.20) 23.45 (2.13) 25.17 (2.10) 26.46 (2.04) 27.89 (1.99) 30.35 (2.02)	2.80 (2.80) 5.43 (2.72) 8.13 (2.71) 10.42 (2.61) 12.74 (2.55) 14.74 (2.46) 16.64 (2.38) 18.76 (2.35) 20.23 (2.25) 21.96 (2.20) 23.45 (2.13) 25.17 (2.10) 26.46 (2.04) 27.89 (1.99) 30.35 (2.02)

Note that the new revision of Prescott N0 cores with EM64T has very little changes in this respect: it's an unaccountable but easily replicable "slump" in case of 13 prefixes sharp before the NOP instruction. You can easily see that the same peculiarity pertains to the new Smithfield core as well.

TLB Characteristics

We shall not go into the analysis of D-TLB and I-TLB characteristics, considering that they (by CPUID descriptors) match in all processors under review.

D-TLB size, Pentium EE 840

D-TLB associativity, Pentium EE 840

D-TLB size is 64 page entries (we have already seen that in the other test results), a miss penalty (when the TLB size is used up) costs a processor minimum 57 cycles. Associativity - full.

I-TLB size, Pentium EE 840

I-TLB associativity, Pentium EE 840

I-TLB size is 64 entries, the miss penalty is 45 cycles (forward, backward walk) and higher (random walk), full associativity. It's important to note that the buffer size (which is a shared resource) in this processor does not grow to 128 entries when Hyper-Threading is disabled (according to CPUID data as well as to the test results), which used to be demonstrated by all Pentium 4 processors. This behaviour is rather strange. At least it shouldn't be inevitable due to the dual core, considering that each core is like a separate, independent physical processor with its own resources like I-TLB. There is just a chance that the current first revision of Smithfield core had some problems, which didn't allow the full sized I-TLB with disabled Hyper-Threading, or this issue was just... forgotten by the manufacturer :).

Conclusion

Our analysis shows that the new Smithfield core can be considered a sterling successor to the NetBurst microarchitecture, implemented in Prescott, Nocona, and Irwindale cores. According to CPUID characteristics as well as the results of our today's tests, the new core (to be more exact, its single-core half) can be ranked somewhere between the latest revision of Nocona core (E0) and the latest revisions of Prescott/2M and Irwindale (N0) cores. Namely, the new Smithfield A0 core is characterized by absolutely the same construction of Trace Cache, decoder, pipeline, and execution modules, which is implemented in Prescott/2M and Irwindale cores. But due to a twice as small L2 Cache size (per one core) it has improved L1-L2 bus performance characteristics, which even surpass those of the Nocona E0 core.

Considering the above said as well as the organization of multi-core architecture (two independent, isolated cores, planted on the common system bus), we should expect that in terms of performance the new platform will be at least on a par with a dual-processor system with equally-clocked Intel Xeon processors on Nocona E0 cores, or even outperform it due to the improved memory controller, integrated into the Intel 955X chipset.

Dmitri Besedin (dmitri_b@ixbt.com)

May 11, 2005.

Write a comment below. No registration needed!