Just two days ago we published an article
with the performance comparison between the new dual core Intel platforms
(Pentium Extreme Edition 840 processor, Intel 955X chipset) and the
traditional dual processor platforms — SMP systems based on
equally clocked Intel Xeon processors (Nocona and Irwindale cores)
and the Intel E7525 chipset for workstations. We got really interesting
results in some tests — performance of the system based on the
3.2 GHz dual core Pentium Extreme 840 processor (rigorously analogous
to a dual processor system based on 3.2 GHz Xeon (Nocona)) turned
out not only higher than the performance of this platform, but also
of the platform built on Intel Xeon processors with Irwindale core,
which have twice as large L2 Cache (2 MB in each processor/core, compared
to 2 MB L2 Cache for the entire Pentium Extreme Edition 840 processor,
that is 1 MB per core).
Such a result could be explained by faster DDR2-533 memory on the
desktop dual core platform compared to Registered ECC DDR2-400, used
in server platforms. It's quite clear that the reason is not in higher
DDR2-533 bandwidth, which potential is not revealed completely in
this case (dual channel mode) due to the 200MHz FSB. It's a fault
of registered modules to some degree, but the most likely reason is
better characteristics of the memory controller in the new i955X chipset
than the older E7525. Well, enough of guessing — in this little
article we shall compare main memory characteristics of the platforms
on the quantitative level. The recently released RightMark
Memory Analyzer 3.55 will help us in the matter.
Testbed configurations
Testbed 1
- CPU: Intel Pentium Extreme Edition 840
(Smithfield
core, 2 x 1 MB L2, 800 MHz FSB, 2 x 3.2 GHz core)
- Motherboard: ASUS P5WD2-Premium (Intel 955X chipset, BIOS 0205 dated 04/22/2005)
- Memory: 2x512 MB PC2-5400 Corsair XMS2 PRO DDR2-533, 3-3-3-8
- Video card: ATI Radeon X800 (256 MB)
- HDD: Samsung SP1614C (SATA), 7200 rpm, 8 MB Cache
- AC power adapter: FSP 550-60PLN (500-550W)
Testbed 2
- Processors: 2 x Intel Xeon 3.2 GHz (Irwindale
core, 2 MB L2, 800 MHz FSB)
- Motherboard: ASUS NCT-D (Intel E7525 chipset, BIOS 1006 dated 02/23/2005)
- Memory: 2x512 MB PC2-3200 Samsung DDR2-400, ECC, 3-3-3-8
- Video card: ATI Radeon X800 (256 MB)
- HDD: Samsung SP1614C (SATA), 7200 rpm, 8 MB Cache
- AC power adapter: FSP 550-60PLN (500-550W)
Software
Real Memory Bandwidth
The real read and write memory bandwidth was tested in two modes — with enabled hardware prefetch, which is a normal processor mode, and with disabled hardware prefetch on the one hand. On the other hand, the real memory read/write bandwidth results were obtained without software prefetch, while the maximum real memory read bandwidth result — with software prefetch (using PREFETCHNTA instructions with optimal prefetch distance). And finally, the maximum real memory bandwidth results are obtained by the Non-Temporal Store method (using such instructions as MOVNTPS/MOVNTDQ).
For definiteness (in order to avoid confusion in interpreting relative percentages), the tables below contain parenthetic values for the lower performance platform that show how much a given parameter on this platform is worse in comparison with the other platform.
| Characteristic |
Pentium XE 840
(Smithfield) |
Xeon
(Irwindale) |
| Real Memory Read Bandwidth, MB/s |
5747
|
4345
(1.32)
|
| Real Memory Write Bandwidth, MB/s |
2153
|
1878
(1.15)
|
| Real Memory Read Bandwidth without
Hardware Prefetch, MB/s |
3605
|
2422
(1.49)
|
| Real Memory Write Bandwidth without
Hardware Prefetch, MB/s |
2229
|
1725
(1.29)
|
| Maximum Real Memory Read Bandwidth,
MB/s |
6501
|
5641
(1.15)
|
| Maximum Real Memory Write Bandwidth,
MB/s |
4281
|
4232
(1.01)
|
| Maximum Real Memory Read Bandwidth
without Hardware Prefetch, MB/s |
6532
|
5614
(1.16)
|
| Maximum Real Memory Write Bandwidth
without Hardware Prefetch, MB/s |
4281
|
4233
(1.01)
|
Absolute results of the Pentium Extreme Edition 840 desktop platform are impressive — the real memory read bandwidth (5747 MB/s) is higher (!) than the maximum real memory read bandwidth, obtained on the Xeon (Irwindale) platform — 5641 MB/s. By the way, the latter is only 88% of the theoretical FSB bandwidth and the theoretical DDR2-400 bandwidth. According to our multiple reviews of the Intel Pentium 4 platforms, tests with software prefetch practically always, irregardless of a chipset type and its operating mode, reach 100% of the theoretical memory bandwidth (sometimes even higher — due to a higher FSB frequency as well as relatively large L2 or L3 Cache). Thus, we can draw a conclusion that approximately 15% of memory performance losses on dual processor Intel Xeon platforms have to do solely with registered modules and the error correction code (ECC).
As we have already mentioned above, another important factor that influences memory performance is a chipset itself (to be more exact, a built-in memory controller). Performance losses in the older E7525 chipset are more prominent in real memory read bandwidth tests. While the excellent hardware prefetch algorithm partially hides the breakaway between the i955X and the E7525 (in this case the memory bandwidth of the Xeon platforms is 1.32 times as low as the memory bandwidth of the Pentium XE 840 platform), disabled hardware prefetch illustrates the advantage of the latest desktop chipset in comparison with the older chipset for workstations (E7525). In this case the Xeon platform is almost 1.5 times as inferior to the dual core platform.
Results of the maximum real memory write bandwidth tests are the least interesting — in this case everything is limited to 2/3 of the theoretical memory bandwidth, which is always lower than the maximum real memory bandwidth even for registered DDR2-400. That's why the differences between the platforms in this parameter are negligibly small.
Memory Latency
Memory latency in case of pseudo-random (random within one page,
but sequential on the level of full pages) and random access modes
was also measured in two modes, with enabled and disabled hardware
prefetch. Remember that the first
mode provides "real" memory latency and the second mode —
sort of ideal latency, depending only on the memory modules and the
chipset, but not depending on the CPU.
| Characteristic |
Pentium XE 840
(Smithfield) |
Xeon
(Irwindale) |
| Pseudo Random Access Latency (min
— max), ns |
47.4 55.3
|
77.7 86.1
(1.56 1.64)
|
| Pseudo Random Access Latency (min
— max) without Hardware Prefetch, ns |
72.8 95.2
|
125.8 149.5
(1.57 1.73)
|
| Random Access Latency (min —
max), ns |
93.7 114.9
|
137.4 159.5
(1.39 1.46)
|
| Random Access Latency (min —
max) without Hardware Prefetch, ns |
94.7 118.0
|
138.7 163.3
(1.38 1.46)
|
While the memory bandwidth disadvantage of the Xeon (Irwindale) platform reaches 1.5 times maximum, the situation with memory latency is still worse. Interestingly, it almost doesn't depend on whether Hardware Prefetch is enabled or disabled (it quite naturally influences only absolute values, but the alignment of forces with disabled hardware prefetch remains the same). On the average, the Xeon platform is defeated by the Pentium XE 840 desktop platform by 1.4 times in terms of random access latency. In case of the pseudo random walk, the breakaway grows to 1.55 — 1.7 times.
Conclusion
Thus, the reason for lower performance of server Intel Xeon dual processor platforms (by the example of Irwindale) compared to the desktop dual core Intel Pentium Extreme Edition is determined for certain. The weak spot of server platforms from Intel is their memory system. Firstly, it requires registered DDR2-400 modules with ECC. Secondly, it's based on the older E7525 chipset, which memory controller is noticeably inferior to that in the new desktop i955X chipset.
Memory bandwidth losses due to registered memory modules amount to 1.15 times (relative to the maximum theoretical value, which can actually be obtained on Pentium XE 840/i955X). Memory controller in the E7525 chipset has noticeably stronger influence — the average memory performance drop due to the chipset amounts to 1.3 times (irrelative to whether the modules are registered or not), in some cases it even reaches 1.5 times.
In conclusion I want to note that despite the significant differences in low-level memory characteristics of these platforms, performance differences in real tests are much lower. It can be explained by the fact that real applications and tests are far from 100% sensitive to memory bandwidth and latency.