Just two days ago we published an article with the performance comparison between the new dual core Intel platforms (Pentium Extreme Edition 840 processor, Intel 955X chipset) and the traditional dual processor platforms — SMP systems based on equally clocked Intel Xeon processors (Nocona and Irwindale cores) and the Intel E7525 chipset for workstations. We got really interesting results in some tests — performance of the system based on the 3.2 GHz dual core Pentium Extreme 840 processor (rigorously analogous to a dual processor system based on 3.2 GHz Xeon (Nocona)) turned out not only higher than the performance of this platform, but also of the platform built on Intel Xeon processors with Irwindale core, which have twice as large L2 Cache (2 MB in each processor/core, compared to 2 MB L2 Cache for the entire Pentium Extreme Edition 840 processor, that is 1 MB per core).
Such a result could be explained by faster DDR2-533 memory on the desktop dual core platform compared to Registered ECC DDR2-400, used in server platforms. It's quite clear that the reason is not in higher DDR2-533 bandwidth, which potential is not revealed completely in this case (dual channel mode) due to the 200MHz FSB. It's a fault of registered modules to some degree, but the most likely reason is better characteristics of the memory controller in the new i955X chipset than the older E7525. Well, enough of guessing — in this little article we shall compare main memory characteristics of the platforms on the quantitative level. The recently released RightMark Memory Analyzer 3.55 will help us in the matter.
Real Memory Bandwidth
The real read and write memory bandwidth was tested in two modes — with enabled hardware prefetch, which is a normal processor mode, and with disabled hardware prefetch on the one hand. On the other hand, the real memory read/write bandwidth results were obtained without software prefetch, while the maximum real memory read bandwidth result — with software prefetch (using PREFETCHNTA instructions with optimal prefetch distance). And finally, the maximum real memory bandwidth results are obtained by the Non-Temporal Store method (using such instructions as MOVNTPS/MOVNTDQ).
For definiteness (in order to avoid confusion in interpreting relative percentages), the tables below contain parenthetic values for the lower performance platform that show how much a given parameter on this platform is worse in comparison with the other platform.
Absolute results of the Pentium Extreme Edition 840 desktop platform are impressive — the real memory read bandwidth (5747 MB/s) is higher (!) than the maximum real memory read bandwidth, obtained on the Xeon (Irwindale) platform — 5641 MB/s. By the way, the latter is only 88% of the theoretical FSB bandwidth and the theoretical DDR2-400 bandwidth. According to our multiple reviews of the Intel Pentium 4 platforms, tests with software prefetch practically always, irregardless of a chipset type and its operating mode, reach 100% of the theoretical memory bandwidth (sometimes even higher — due to a higher FSB frequency as well as relatively large L2 or L3 Cache). Thus, we can draw a conclusion that approximately 15% of memory performance losses on dual processor Intel Xeon platforms have to do solely with registered modules and the error correction code (ECC).
As we have already mentioned above, another important factor that influences memory performance is a chipset itself (to be more exact, a built-in memory controller). Performance losses in the older E7525 chipset are more prominent in real memory read bandwidth tests. While the excellent hardware prefetch algorithm partially hides the breakaway between the i955X and the E7525 (in this case the memory bandwidth of the Xeon platforms is 1.32 times as low as the memory bandwidth of the Pentium XE 840 platform), disabled hardware prefetch illustrates the advantage of the latest desktop chipset in comparison with the older chipset for workstations (E7525). In this case the Xeon platform is almost 1.5 times as inferior to the dual core platform.
Results of the maximum real memory write bandwidth tests are the least interesting — in this case everything is limited to 2/3 of the theoretical memory bandwidth, which is always lower than the maximum real memory bandwidth even for registered DDR2-400. That's why the differences between the platforms in this parameter are negligibly small.
Memory latency in case of pseudo-random (random within one page,
but sequential on the level of full pages) and random access modes
was also measured in two modes, with enabled and disabled hardware
prefetch. Remember that the first
mode provides "real" memory latency and the second mode —
sort of ideal latency, depending only on the memory modules and the
chipset, but not depending on the CPU.
While the memory bandwidth disadvantage of the Xeon (Irwindale) platform reaches 1.5 times maximum, the situation with memory latency is still worse. Interestingly, it almost doesn't depend on whether Hardware Prefetch is enabled or disabled (it quite naturally influences only absolute values, but the alignment of forces with disabled hardware prefetch remains the same). On the average, the Xeon platform is defeated by the Pentium XE 840 desktop platform by 1.4 times in terms of random access latency. In case of the pseudo random walk, the breakaway grows to 1.55 — 1.7 times.
Thus, the reason for lower performance of server Intel Xeon dual processor platforms (by the example of Irwindale) compared to the desktop dual core Intel Pentium Extreme Edition is determined for certain. The weak spot of server platforms from Intel is their memory system. Firstly, it requires registered DDR2-400 modules with ECC. Secondly, it's based on the older E7525 chipset, which memory controller is noticeably inferior to that in the new desktop i955X chipset.
Memory bandwidth losses due to registered memory modules amount to 1.15 times (relative to the maximum theoretical value, which can actually be obtained on Pentium XE 840/i955X). Memory controller in the E7525 chipset has noticeably stronger influence — the average memory performance drop due to the chipset amounts to 1.3 times (irrelative to whether the modules are registered or not), in some cases it even reaches 1.5 times.
In conclusion I want to note that despite the significant differences in low-level memory characteristics of these platforms, performance differences in real tests are much lower. It can be explained by the fact that real applications and tests are far from 100% sensitive to memory bandwidth and latency.
Dmitri Besedin (firstname.lastname@example.org)
June 22, 2005.
Write a comment below. No registration needed!