iXBT Labs - Computer Hardware in Detail

Platform

Video

Multimedia

Mobile

Other

Detailed Platform Analysis in RightMark Memory Analyzer. Part 9 - Dual Core Intel Pentium Extreme Edition 840 (Smithfield)

May 12, 2005



. The first article from the new series.

April 18, Intel has announced the first dual core Pentium Extreme Edition 840 processor, based on the new core codenamed Smithfield, that supports Extended Memory 64-bit Technology (EM64T). Note that the name of the new processor lacks the habitual "Pentium 4" — this phrase is curtailed to "Pentium", though there are no doubts that Smithfield core is based on the NetBurst microarchitecture. Secondly, the new processor poses another change in the "extreme" ideology of processors with NetBurst microarchitecture. The extreme nature of the first Pentium 4 Extreme Edition processors consisted in 2MB L3 Cache. Later on, with the relatively recent launch of Pentium 4 Extreme Edition 3.73 GHz processors, this notion has become associated solely with the 266MHz FSB (1066 MHz Quad-Pumped). And the extreme nature of the new processor has obviously to do with its dual cores (because there is nothing else we can tie it to). At the same time, the future will see the launch of a new series of dual core processors from Intel under the Pentium D trademark, which will again require to review the notion of CPU extremity. Well, these are manufacturer's problems. And we are to examine the new processors and their architectural features in particular. The present article will traditionally review the key low-level characteristics of the new dual core in comparison with its previous single-core counterparts — Prescott, Nocona, and Prescott/2M cores, which were reviewed in the articles "Detailed Platform Analysis in RightMark Memory Analyzer. Part 6 — Intel Xeon (Nocona) platform" and "Detailed Platform Analysis in RightMark Memory Analyzer. Part 8 — Intel Pentium 4 and Pentium 4 Extreme Edition processors with a new revision of Prescott core".

Testbed configuration

  • CPU: Intel Pentium Extreme Edition 840 (3.2 GHz, Smithfield core, Socket 775, FSB 200 MHz)
  • Motherboard: Intel D955XBK on Intel 955X chipset, BIOS dated 04/04/2005
  • Memory: 2x512 MB PC2-4300 DDR2-533 Samsung (4-4-4-11 timings)

Software

  • Windows XP Professional SP2
  • Intel Chipset Installation Utility 7.0.0.1019
  • DirectX 9.0c
  • RightMark Memory Analyzer 3.5 pre-release

CPUID Characteristics

We'll start the review of the new Smithfield core with the analysis of the most important characteristics, provided by the CPUID instruction.

CPUID function Value Comments
Processor signature 0F44h Family 15 , Model 4, Stepping 4
Brand ID 00h Not supported
Cache/TLB descriptors 50h
5Bh
60h
40h
70h
7Ch
I-TLB: full associativity, 64 entries
D-TLB: full associativity, 64 entries
L1 Cache: 16 KB, 8-way assoc., 64-byte line
L3 Cache is not available
Trace Cache: 12K-uops, 8-way associativity
L2 Cache: 1 MB, 8-way assoc., 64-byte line
Number of logical processors 04h 4 logical processors
Basic Features, ECX (the most important characteristics) 0000641Dh Bit 0, 3: SSE3 support, MONITOR/MWAIT
Bit 2: Unknown
Bit 13: Unknown
Extended Features, EDX 20000000h Bit 29: Intel (R) EM64T (x86-64) support

So, according to the CPU signature (0F44h), the new Smithfield core can be considered a continuation, sort of a new revision of Prescott and Nocona/Irwindale cores (their latest revision N0 is signed as 0F43h). Nevertheless, Smithfield core used in Pentium Extreme Edition 840 (or simply Pentium EE 840) got the official alphanumeric revision designation A0, that is it's the first core in a series.

Cache/TLB descriptors look usual, as in Prescott/Nocona cores with 1MB L2 Cache (the total claimed size of L2 D-Cache in Pentium EE 840 is 2 MB. But it's not hard to guess that this notion is quite relative, because L2 Cache is an isolated but not a shared resource for both cores — 1 MB per each core).

Extended features of the processor offer nothing principally new either. You can easily notice that the new core supports neither Thermal Monitor 2 nor Enhanced Intel SpeedStep (Demand-Based Switching), or Execute Disable Bit. However it completely supports Extended Memory 64-bit Technology (EM64T) as well as Enhanced Halt State C1E (it cannot be presented in CPUID data). The presence of two unknown technologies is also a usual thing (we have assumed they are LaGrande and VanderPool), designated by the 2nd and 13th Basic Features bits, ECX register. Thus, according to its key characteristics, the new Smithfield core ranks somewhere between the latest Revision E0 of Nocona core (if we take it without TM2 and DBS technologies) and the extreme version of Prescott/2M core, implemented in Pentium 4 Extreme Edition 3.73 GHz, but with 1MB L2 Cache and disabled XD bit.

Nevertheless, the main difference on the processor level has to do with the number of logical processors, designed in a single physical package (in the package, not in the core) — it has grown to 4 (each physical core is represented by two logical processors, due to Hyper-Threading support).

Real Bandwidth of Data Cache/Memory

Let's proceed to the results of the new processor in the latest RMMA pre-release 3.5. When we tested the new processor, the SysInfo component of the test did not properly support this processor and the chipset yet. Consequently, this processor is identified as a regular Intel Pentium 4 540 (it's actually true in terms of the clock and L2 Cache size for each core — if we pretend not to notice the dual core and EM64T support). When the program tries to write out timing values by configuration registers of the i955X chipset according to the official documentation, there occurs an error in tRAS — it turns out twice as low compared to the true value (which probably has to do with a mistake in the documentation). Nevertheless, these shortcomings are already corrected in the coming release (Version 3.5) of this test package.

The general picture of real memory bandwidth (L1/L2/RAM) on Pentium EE 840 is shown below.




Real Bandwidth of Data Cache and Memory
Pentium EE 840

You can see the following key features on the graph: indeed, L2 D-Cache size is 1MB (that is there can be no intrusion of L2 Cache from the neighboring core), but the memory bandwidth of this level slumps noticeably already at 256 KB blocks. It has to do with the D-TLB depletion, which size is still 64 entries, that is 256 KB of covered virtual address space. No differences in L1 and L2 Cache write efficiency (there is no typical inflection at 16 KB) indicate the Write-Through organization, implemented in all previous cores of Pentium 4/Xeon processors.

Level Average bandwidth, byte/cycle (MB/sec)
Pentium 4
(Prescott D0)
Xeon
(Nocona D0)
Pentium 4 660
(Prescott N0)
Pentium EE 840
(Smithfield A0)
L1, reading, MMX
L1, reading, SSE
L1, writing, MMX
L1, writing, SSE
7.98
15.93
2.91
3.56
7.96
15.93
2.90
3.54
7.98
15.93
2.91
3.56
7.98
15.93
2.91
3.56
L2, reading, MMX
L2, reading, SSE
L2, writing, MMX
L2, writing, SSE
4.41
8.02
2.91
3.56
4.39
7.84
2.89
3.54
4.53
8.13
2.91
3.56
4.57
8.21
2.91
3.56
RAM, reading, MMX
RAM, reading, SSE
RAM, writing, MMX
RAM, writing, SSE
3901.4 MB/sec
4457.4 MB/sec
1750.0 MB/sec
1760.6 MB/sec
3215.2 MB/sec
3620.1 MB/sec
1863.0 MB/sec
1855.0 MB/sec
5187.3 MB/sec
5490.6 MB/sec
2061.8 MB/sec
2057.8 MB/sec
5361.4 MB/sec
5649.6 MB/sec
2408.9 MB/sec
2430.9 MB/sec

Quantitative characteristics of L1 and L2 D-Cache bandwidth in Smithfield match the previously obtained characteristics of the latest Prescott/2M core revision — we should note a higher read efficiency of L2 Cache compared to the earlier revisions (D0) of Prescott/Nocona cores.

More prominent changes can be seen in average real memory bandwidth. While the real memory read bandwidth has grown insignificantly compared to Pentium 4 660, the real memory write bandwidth has grown quite noticeably (approximately by 20%). As usual, it can be a result of improved BIU interface of the core or an improved memory controller in the new i955X chipset.

Maximum Real Memory Bandwidth

As usual (for Pentium 4 processors), Software Prefetch method allows maximum memory bandwidth, while other methods are not so highly efficient.




Maximum Real Memory Bandwidth, Software Prefetch, Pentium EE 840

The curves of real memory bandwidth versus software prefetch distance in Pentium EE 840 are typical of Prescott cores (they match those we obtained on previous revisions of this core). I want to note a smoother decrease of values relative to the maximum as the prefetch distance is increased. Most likely it has to do with the improved memory controller in the i955X chipset rather than the prefetch as such.

Access mode Maximum memory read bandwidth, MB/sec*
Pentium 4
(Prescott D0)
Xeon
(Nocona D0)
Pentium 4 660
(Prescott N0)
Pentium EE 840
(Smithfield A0)
Reading, MMX
Reading, SSE
Reading, MMX, SW Prefetch
Reading, SSE, SW Prefetch
Reading, MMX, Block Prefetch 1
Reading, SSE, Block Prefetch 1
Reading, MMX, Block Prefetch 2
Reading, SSE, Block Prefetch 2
Reading cache lines, forward
Reading cache lines, backward
3901.4 (61.0%)
4457.4 (69.6%)
6311.3 (98.6%)
6334.2 (99.0%)
4191.0 (65.5%)
4614.8 (72.1%)
3948.3 (61.7%)
4517.2 (70.6%)
5180.9 (81.0%)
5178.7 (80.9%)
3215.2 (50.2%)
3620.1 (56.6%)
5334.2 (83.3%)
5329.9 (83.3%)
3302.2 (51.6%)
3524.3 (55.1%)
3392.3 (53.0%)
3784.5 (59.1%)
3313.5 (51.8%)
3315.8 (51.8%)
5187.3 (81.1%)
5490.6 (85.8%)
6521.3 (101.9%)
6679.7 (104.4%)
4630.6 (72.4%)
5046.4 (78.9%)
5146.9 (80.4%)
5492.2 (85.8%)
5968.8 (93.3%)
5957.7 (93.1%)
5361.4 (83.8%)
5649.6 (88.3%)
6404.7 (100.1%)
6438.3 (100.6%)
4729.7 (73.9%)
5245.2 (82.0%)
5350.9 (83.6%)
5681.4 (88.8%)
6213.0 (97.1%)
6208.1 (97.0%)

*values relative to the theoretical memory bandwidth limit (6.4 GB/sec for 200 MHz FSB) are in parentheses

This assumption is also backed up by high (compared to the previous reviews) maximum real memory bandwidth, obtained with other methods that do not involve software prefetch, especially the cache line read method (memory bandwidth values in this case are just a tad lower than the theoretical limit).

Access mode Maximum memory write bandwidth, MB/sec*
Pentium 4
(Prescott D0)
Xeon
(Nocona D0)
Pentium 4 660
(Prescott N0)
Pentium EE 840
(Smithfield A0)
Writing, MMX
Writing, SSE
Writing, MMX, Non-Temporal
Writing, SSE, Non-Temporal
Writing cache lines, forward
Writing cache lines, backward
1750.0 (27.3%)
1760.6 (27.5%)
4265.9 (66.7%)
4266.0 (66.7%)
2283.9 (35.7%)
2254.2 (35.2%)
1863.0 (29.1%)
1855.0 (29.0%)
4236.2 (66.2%)
4235.9 (66.2%)
2380.2 (37.2%)
2386.1 (37.3%)
2061.8 (32.2%)
2057.8 (32.2%)
4255.0 (66.5%)
4254.9 (66.5%)
2527.7 (39.5%)
2435.0 (38.0%)
2408.9 (37.6%)
2430.9 (38.0%)
4266.0 (66.6%)
4266.0 (66.6%)
3114.4 (48.7%)
3112.8 (48.6%)

*values relative to the theoretical memory bandwidth limit (6.4 GB/sec for 200 MHz FSB) are in parentheses

The most striking changes can be seen in maximum real memory write bandwidth. Of course, Non-Temporal Store method allows to get 2/3 of the maximum theoretical FSB bandwidth in all cases, no more, no less. The cache line write method has grown noticeably more efficient on the new platform — the obtained memory bandwidth values amount to approximately half of the maximum theoretical value. It's again the effect of the improved memory controller in the new i955X chipset.

Data Cache/Memory Latency




Data Cache/Memory Latency, Pentium EE 840

Typical features of Data Cache and D-TLB organization are no less clear in this test: 16 KB and 2MB inflections, which correspond to L1 and L2 Cache sizes, as well as a smooth rise of random L2 Cache access latency with the block size starting from 256KB.

Level, access Average latency, cycles (ns)
Pentium 4
(Prescott D0)
Xeon
(Nocona D0)
Pentium 4 660
(Prescott N0)
Pentium EE 840
(Smithfield A0)
L1
4.0
4.0
4.0
4.0
L2
~28.5
~28.5
~28.5
~28.5
RAM, forward
RAM, backward
RAM, random*
RAM, pseudo-random
37.3 ns
41.1 ns
126.0 ns
56.1 ns
50.3 ns
52.6 ns
134.1 ns
75.8 ns
32.5 ns
36.8 ns
106.1 ns
52.0 ns
32.3 ns
35.7 ns
100.9 ns
51.5 ns

*4MB block size

Quantitative latency characteristics of L1 and L2 Cache are the same for all processors included into the table. Concerning the memory latency, the table values have been obtained in a separate test, where the data chain is walked at 128byte steps, i.e. the effective L2 Cache line size. It's not hard to notice that the average memory latencies in all walk modes on Pentium EE 840 platform are almost no different from the latencies obtained on Pentium 4 660 platform. A slightly lower random access latency (101 ns versus 106 ns), when the hardware prefetch algorithm is nearly idle, can serve as an additional sign of an improved memory controller used in the i955X chipset, which we have mentioned above several times.

Minimum Latency of Data Cache/Memory




Minimum L2 Cache latency, Pentium EE 840, Method 1



Minimum L2 Cache latency, Pentium EE 840, Method 2

The L1-L2 bus unload curves, that is the minimum L2 Cache latency in Smithfield core, look as usual. They are similar to the curves, previously obtained on Prescott cores: latency of this level obviously does not reach its maximum at standard L1-L2 bus "unloading" by inserting "empty" operations (Method 1), and it goes down to 22 cycles at "non-standard" unloading, specially developed for processors with pronounced speculative data loading (Method 2).




Minimum memory latency, Pentium EE 840

L2-RAM bus unload curves for the Pentium EE 840 processor also look usual.

Level, access Minimum latency, cycles (ns)
Pentium 4
(Prescott D0)
Xeon
(Nocona D0)
Pentium 4 660
(Prescott N0)
Pentium EE 840
(Smithfield A0)
L1
4.0
4.0
4.0
4.0
L2*
24.0 (22.0)
24.0 (22.0)
24.0 (22.0)
24.0 (22.0)
RAM, forward
RAM, backward
RAM, random**
RAM, pseudo-random
28.7 ns
31.1 ns
125.2 ns
55.0 ns
38.4 ns
41.2 ns
134.0 ns
74.5 ns
27.0 ns
31.1 ns
105.4 ns
50.9 ns
24.6 ns
27.0 ns
98.9 ns
50.5 ns

*Values in brackets are obtained by Method 2
**4MB block size

Minimum memory latencies, obtained with the sufficiently unloaded L2-RAM bus, turn out somewhat lower compared to the Pentium 4 660 test results. Considering that the buses on these processors are unloaded in absolutely the same way (unload curves, so to speak), this phenomenon should be again attributed to the improvement of a memory controller in the new chipset.

Data Cache Associativity




D-Cache Associativity, Pentium EE 840

L1/L2 D-Cache associativity test for both processors, which result is shown on the picture, again indicates the lack of any changes in this parameter. As in all the other reviewed Pentium 4/Xeon processors on Prescott and Nocona cores, the "effective" L1 data cache associativity is equal to 1, associativity of the integrated L2 instruction cache/data cache is equal to 8.

L1-L2 Cache Bus Real Bandwidth

Access mode Bandwidth, bytes/cycle*
Pentium 4
(Prescott D0)
Xeon
(Nocona D0)
Pentium 4 660
(Prescott N0)
Pentium EE 840
(Smithfield A0)
Reading (forward)
Reading (backward)
Writing (forward)
Writing (backward)
16.42 (51.3%)
16.40 (51.3%)
4.76 (14.9%)
4.75 (14.9%)
16.42 (51.3%)
16.42 (51.3%)
4.79 (15.0%)
4.78 (14.9%)
14.50 (45.3%)
14.53 (45.4%)
3.99 (12.5%)
4.00 (12.5%)
16.75 (52.3%)
16.58 (51.8%)
4.89 (15.3%)
4.85 (15.2%)

*values relative to the theoretical limit are in parentheses

In comparison with the previously reviewed Prescott/2M core revision N0, which demonstrated noticeably lower results in this test compared to the earlier core revisions, the new Smithfield core revision A0 gives us a pleasant surprise in this test. Namely, not only the real L1-L2 D-Cache bus bandwidth returned to the previous values, usual for Prescott cores, but even grew a tad higher. Thus, the reason for slumped values on Pentium 4 660 has become obvious — it has to do with the L2 D-Cache, increased twofold. As soon as it was decreased to 1MB per core (in Pentium EE 840), everything fell into place... and even a tad higher :).

Trace Cache, Decode/Execute Efficiency

Let's examine another interesting component of the NetBurst microarchitecture - its specialized cache for micro-operations (Execution Trace Cache) provided by the predecoder.




Decode/execute efficiency, Pentium EE 840

As always, the overall picture of decode/execute speed for "large" 6-byte CMP instructions is the most illustrative. This test, as well as all the other tests of this type, demonstrates no qualitative differences in the behaviour of Smithfield core under review and the previously tested Prescott/Nocona cores. So, let's proceed to the qualitative evaluation.

Decode/execute efficiency, Xeon (Nocona D0)

Instruction type Effective size of Trace Cache, KB (Kuop) Decode efficiency, bytes/cycle (instructions/cycle)
Trace Cache L2 Cache
Independent
NOP
SUB
XOR
TEST
XOR/ADD
CMP 1
CMP 2
CMP 3
CMP 4
CMP 5
CMP 6*
Prefixed CMP 1
Prefixed CMP 2
Prefixed CMP 3
Prefixed CMP 4*
10.5 (10.5)
22.0 (11.0)
22.0 (11.0)
22.0 (11.0)
22.0 (11.0)
22.0 (11.0)
44.0 (11.0)
63.0 (10.5)
63.0 (10.5)
63.0 (10.5)
32.0 (10.6)
63.0 (7.9; 10.5**)
63.0 (7.9; 10.5**)
63.0 (7.9; 10.5**)
44.0 (11.0; 14.7**)
2.85 (2.85)
5.70 (2.85)
3.97 (1.98)
3.64 (1.82)
5.70 (2.85)
5.40 (2.70)
10.29 (2.57)
15.50 (2.58)
15.50 (2.58)
15.50 (2.58)
8.62 (1.44)
20.66 (2.58)
20.66 (2.58)
20.66 (2.58)
11.53 (1.44)
0.99 (0.99)
1.99 (0.99)
1.99 (0.99)
1.99 (0.99)
1.99 (0.99)
1.99 (0.99)
3.98 (0.99)
4.25 (0.71)
4.25 (0.71)
4.25 (0.71)
4.25 (0.71)
4.40 (0.55)
4.40 (0.55)
4.40 (0.55)
4.40 (0.55)
Dependent
LEA
MOV
ADD
OR
SHL
ROL
-
-
-
-
-
-
1.99 (0.99)
1.99 (0.99)
1.99 (0.99)
1.99 (0.99)
3.00 (1.00)
3.00 (1.00)
1.99 (0.99)
1.99 (0.99)
1.99 (0.99)
1.99 (0.99)
3.00 (1.00)
3.00 (1.00)

*2 micro-operations
**in the assumption that prefixes are truncated before they are placed into Trace Cache

Here is a log of our research to track the mysterious and interesting tendency of Execution Trace Cache changes. At first, let's have a look at the data, previously obtained on Xeon platform (Nocona core, Revision D0). This processor core, equipped with EM64T, was the first to show the deterioration tendency for decode/execute efficiency of some commands - in particular, simple operations like TEST (test eax, eax) and CMP 1 (cmp eax, eax). For your convenience, the results that suffer changes in the next cores are marked with red.

Decode/execute efficiency, Pentium 4 660 (Prescott N0)

Instruction type Effective size of Trace Cache, KB (Kuop) Decode efficiency, bytes/cycle (instructions/cycle)
Trace Cache L2 Cache
Independent
NOP
SUB
XOR
TEST
XOR/ADD
CMP 1
CMP 2
CMP 3
CMP 4
CMP 5
CMP 6*
Prefixed CMP 1
Prefixed CMP 2
Prefixed CMP 3
Prefixed CMP 4*
10.5 (10.5)
22.0 (11.0)
22.0 (11.0)
22.0 (11.0)
22.0 (11.0)
22.0 (11.0)
44.0 (11.0)
63.0 (10.5)
63.0 (10.5)
63.0 (10.5)
32.0 (10.6)
63.0 (7.9; 10.5**)
63.0 (7.9; 10.5**)
63.0 (7.9; 10.5**)
44.0 (11.0; 14.7**)
2.87 (2.87)
5.73 (2.87)
3.99 (2.00)
3.42 (1.71)
5.73 (2.87)
5.16 (2.58)
10.32 (2.58)
15.48 (2.58)
15.48 (2.58)
15.48 (2.58)
8.67 (1.45)
20.62 (2.58)
20.60 (2.58)
20.60 (2.58)
11.56 (1.45)
1.00 (1.00)
2.00 (1.00)
2.00 (1.00)
2.00 (1.00)
2.00 (1.00)
2.00 (1.00)
3.99 (1.00)
4.00 (0.67)
4.00 (0.67)
4.00 (0.67)
4.00 (0.67)
4.14 (0.52)
4.14 (0.52)
4.14 (0.52)
4.14 (0.52)
Dependent
LEA
MOV
ADD
OR
SHL
ROL
-
-
-
-
-
-
2.01 (1.00)
2.01 (1.00)
2.01 (1.00)
2.00 (1.00)
3.00 (1.00)
3.00 (1.00)
2.01 (1.00)
2.01 (1.00)
2.01 (1.00)
2.00 (1.00)
3.00 (1.00)
3.00 (1.00)

*2 micro-operations
**in the assumption that prefixes are truncated before they are placed into Trace Cache

The outlined tendency of CPU performance deterioration (execution of some commands) carries on its development in Prescott/2M core revision N0 (Pentium 4 660 and Pentium 4 Extreme Edition 3.73 GHz). First of all, we can see the subsequent decrease of TEST and CMP 1 execution efficiency. The second significant modification is the reduction of maximum execute speed for all CMP operations from L2 Cache to 4.0 bytes/cycle (1.0 or 0.67 instructions/cycle, depending on the command length) as well as prefixed CMP to 4.14 bytes/cycle (0.52 instructions/cycle).

Decode/execute efficiency, Pentium EE 840 (Smithfield A0)

Instruction type Effective size of Trace Cache, KB (Kuop) Decode efficiency, bytes/cycle (instructions/cycle)
Trace Cache L2 Cache
Independent
NOP
SUB
XOR
TEST
XOR/ADD
CMP 1
CMP 2
CMP 3
CMP 4
CMP 5
CMP 6*
Prefixed CMP 1
Prefixed CMP 2
Prefixed CMP 3
Prefixed CMP 4*
10.5 (10.5)
22.0 (11.0)
22.0 (11.0)
22.0 (11.0)
22.0 (11.0)
22.0 (11.0)
44.0 (11.0)
63.0 (10.5)
63.0 (10.5)
63.0 (10.5)
32.0 (10.6)
63.0 (7.9; 10.5**)
63.0 (7.9; 10.5**)
63.0 (7.9; 10.5**)
44.0 (11.0; 14.7**)
2.87 (2.87)
5.73 (2.87)
3.99 (2.00)
3.42 (1.71)
5.73 (2.87)
5.16 (2.58)
10.32 (2.58)
15.48 (2.58)
15.48 (2.58)
15.48 (2.58)
8.67 (1.45)
20.62 (2.58)
20.60 (2.58)
20.60 (2.58)
11.56 (1.45)
1.00 (1.00)
2.00 (1.00)
2.00 (1.00)
2.00 (1.00)
2.00 (1.00)
2.00 (1.00)
3.99 (1.00)
4.26 (0.71)
4.26 (0.71)
4.26 (0.71)
4.26 (0.71)
4.45 (0.56)
4.45 (0.56)
4.45 (0.56)
4.45 (0.56)
Dependent
LEA
MOV
ADD
OR
SHL
ROL
-
-
-
-
-
-
2.01 (1.00)
2.01 (1.00)
2.01 (1.00)
2.00 (1.00)
3.00 (1.00)
3.00 (1.00)
2.01 (1.00)
2.01 (1.00)
2.01 (1.00)
2.00 (1.00)
3.00 (1.00)
3.00 (1.00)

*2 micro-operations
**in the assumption that prefixes are truncated before they are placed into Trace Cache

Let's proceed to the data obtained on the latest Pentium EE 840. Nothing has changed with TEST and CMP 1 — their execute efficiency from Trace Cache is the same as in Prescott/2M based processors. But the execute efficiency of CMP3-CMP6 as well as prefixed CMP1-CMP4 operations from L2 Cache has grown higher again, even a tad higher than Nocona. Does it ring any bells? We have just seen a similar picture in the previous test — Real L1-L2 Cache Throughput. It has obviously to do with it. It's clear now: the drop in execution performance of some large instructions from L2 Cache on processors with 2MB Prescott cores (Revision N0) is the result of enlarged combined L2 Cache for instructions/data.

The second significant efficiency deterioration of the decoder/pipeline in Prescott/Nocona cores with EM64T was in the decreased efficiency of truncating "meaningless" prefixes in the test that executed instructions of the type [0x66]nNOP, n = 0..14.




Decode/execute efficiency for prefix instructions, Pentium EE 840


Number of prefixes Decode/execute efficiency, bytes/cycle (instructions/cycle)
Pentium 4
(Prescott D0)
Xeon
(Nocona D0)
Pentium 4 660
(Prescott N0)
Pentium EE 840
(Smithfield A0)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
2.84 (2.84)
5.68 (2.84)
8.52 (2.84)
11.34 (2.84)
14.09 (2.82)
16.89 (2.82)
19.51 (2.79)
22.30 (2.79)
24.87 (2.76)
27.54 (2.75)
30.76 (2.80)
33.24 (2.77)
35.86 (2.76)
38.18 (2.73)
40.85 (2.72)
2.79 (2.79)
5.41 (2.71)
8.16 (2.72)
10.48 (2.62)
12.73 (2.55)
14.73 (2.46)
16.63 (2.38)
18.75 (2.34)
20.63 (2.29)
21.93 (2.19)
23.44 (2.13)
25.78 (2.15)
27.14 (2.09)
28.64 (2.05)
30.33 (2.02)
2.80 (2.80)
5.43 (2.72)
8.13 (2.71)
10.42 (2.61)
12.74 (2.55)
14.74 (2.46)
16.64 (2.38)
18.76 (2.35)
20.23 (2.25)
21.96 (2.20)
23.45 (2.13)
25.17 (2.10)
26.46 (2.04)
27.89 (1.99)
30.35 (2.02)
2.80 (2.80)
5.43 (2.72)
8.13 (2.71)
10.42 (2.61)
12.74 (2.55)
14.74 (2.46)
16.64 (2.38)
18.76 (2.35)
20.23 (2.25)
21.96 (2.20)
23.45 (2.13)
25.17 (2.10)
26.46 (2.04)
27.89 (1.99)
30.35 (2.02)

Note that the new revision of Prescott N0 cores with EM64T has very little changes in this respect: it's an unaccountable but easily replicable "slump" in case of 13 prefixes sharp before the NOP instruction. You can easily see that the same peculiarity pertains to the new Smithfield core as well.

TLB Characteristics

We shall not go into the analysis of D-TLB and I-TLB characteristics, considering that they (by CPUID descriptors) match in all processors under review.




D-TLB size, Pentium EE 840





D-TLB associativity, Pentium EE 840

D-TLB size is 64 page entries (we have already seen that in the other test results), a miss penalty (when the TLB size is used up) costs a processor minimum 57 cycles. Associativity - full.




I-TLB size, Pentium EE 840





I-TLB associativity, Pentium EE 840

I-TLB size is 64 entries, the miss penalty is 45 cycles (forward, backward walk) and higher (random walk), full associativity. It's important to note that the buffer size (which is a shared resource) in this processor does not grow to 128 entries when Hyper-Threading is disabled (according to CPUID data as well as to the test results), which used to be demonstrated by all Pentium 4 processors. This behaviour is rather strange. At least it shouldn't be inevitable due to the dual core, considering that each core is like a separate, independent physical processor with its own resources like I-TLB. There is just a chance that the current first revision of Smithfield core had some problems, which didn't allow the full sized I-TLB with disabled Hyper-Threading, or this issue was just... forgotten by the manufacturer :).

Conclusion

Our analysis shows that the new Smithfield core can be considered a sterling successor to the NetBurst microarchitecture, implemented in Prescott, Nocona, and Irwindale cores. According to CPUID characteristics as well as the results of our today's tests, the new core (to be more exact, its single-core half) can be ranked somewhere between the latest revision of Nocona core (E0) and the latest revisions of Prescott/2M and Irwindale (N0) cores. Namely, the new Smithfield A0 core is characterized by absolutely the same construction of Trace Cache, decoder, pipeline, and execution modules, which is implemented in Prescott/2M and Irwindale cores. But due to a twice as small L2 Cache size (per one core) it has improved L1-L2 bus performance characteristics, which even surpass those of the Nocona E0 core.

Considering the above said as well as the organization of multi-core architecture (two independent, isolated cores, planted on the common system bus), we should expect that in terms of performance the new platform will be at least on a par with a dual-processor system with equally-clocked Intel Xeon processors on Nocona E0 cores, or even outperform it due to the improved memory controller, integrated into the Intel 955X chipset.

Dmitri Besedin (dmitri_b@ixbt.com)

May 11, 2005.


Write a comment below. No registration needed!


Article navigation:



blog comments powered by Disqus

  Most Popular Reviews More    RSS  

AMD Phenom II X4 955, Phenom II X4 960T, Phenom II X6 1075T, and Intel Pentium G2120, Core i3-3220, Core i5-3330 Processors

Comparing old, cheap solutions from AMD with new, budget offerings from Intel.
February 1, 2013 · Processor Roundups

Inno3D GeForce GTX 670 iChill, Inno3D GeForce GTX 660 Ti Graphics Cards

A couple of mid-range adapters with original cooling systems.
January 30, 2013 · Video cards: NVIDIA GPUs

Creative Sound Blaster X-Fi Surround 5.1

An external X-Fi solution in tests.
September 9, 2008 · Sound Cards

AMD FX-8350 Processor

The first worthwhile Piledriver CPU.
September 11, 2012 · Processors: AMD

Consumed Power, Energy Consumption: Ivy Bridge vs. Sandy Bridge

Trying out the new method.
September 18, 2012 · Processors: Intel
  Latest Reviews More    RSS  

i3DSpeed, September 2013

Retested all graphics cards with the new drivers.
Oct 18, 2013 · 3Digests

i3DSpeed, August 2013

Added new benchmarks: BioShock Infinite and Metro: Last Light.
Sep 06, 2013 · 3Digests

i3DSpeed, July 2013

Added the test results of NVIDIA GeForce GTX 760 and AMD Radeon HD 7730.
Aug 05, 2013 · 3Digests

Gainward GeForce GTX 650 Ti BOOST 2GB Golden Sample Graphics Card

An excellent hybrid of GeForce GTX 650 Ti and GeForce GTX 660.
Jun 24, 2013 · Video cards: NVIDIA GPUs

i3DSpeed, May 2013

Added the test results of NVIDIA GeForce GTX 770/780.
Jun 03, 2013 · 3Digests
  Latest News More    RSS  

Platform  ·  Video  ·  Multimedia  ·  Mobile  ·  Other  ||  About us & Privacy policy  ·  Twitter  ·  Facebook


18

Copyright © Byrds Research & Publishing, Ltd., 1997–2011. All rights reserved.