We have already touched upon the real potential of dual-channel DDR2 memory many times. The reason lies in the real potential of DDR2 memory as such. Its bandwidth in dual-channel mode, starting from the very first modifications like DDR2-533, is very high. As a rule, it always exceeds the bandwidth of other components in the CPU Registers - RAM path that limit memory subsystem throughput in general.
The most frequent bottleneck is FSB that links BIU of a processor (actually its L2 Cache) with a memory controller in Northbridge. That was the case with Intel platforms, which are known for such memory organization. FSB frequency of these platforms does not exceed 200 MHz (most Pentium 4 systems, including dual-core Pentium D) or 266 MHz (Pentium 4 Extreme Edition and dual-core Pentium Extreme Edition, as well as the new platforms with dual-core Intel Core processors). In bandwidth terms, it corresponds only to dual-channel DDR2-400 or DDR2-533 correspondingly. Thus, faster memory types, such as DDR2-667, DDR2-800, and especially unofficial DDR2-1066, are obviously not justified in this class of platforms, at least from the point of view of memory bandwidth. On the other hand, higher memory frequency results in the reduction of its system latencies. It's definitely a plus and serves as an argument for using fast DDR2 memory on the Intel platform.
Anyway, lots of hopes to reveal the real potential of DDR2-667 memory and higher were pinned on the new AMD AM2 platform (or in official terms — AMD K8 New Platform Technology, NPT). It possesses a dual-channel memory controller integrated into a processor, overhauled for DDR2 memory (from DDR2-400 to DDR2-800, inclusive) instead of old DDR, which hadn't passed the DDR-400 frequency limit. Indeed, in this case the memory controller is located right in a processor, it operates at full CPU frequency, and it has its own memory data exchange interface, which frequency may reach 200 MHz - 400 MHz and is specified by some integer divider to full CPU clock. Nothing seems to stand in the way to "pump" data from memory to CPU registers at maximum speed possible... but no. According to our analysis, the bottleneck is in the L1-L2 bus of a processor this time, which possesses a relatively low capacity (64 bit to each direction) and complex (exclusive) organization. Theoretically, its peak bandwidth is 8.0 bytes/cycle. But according to our tests, its real bandwidth is approximately twice as low (about 4.0 bytes/cycle), which is not enough to provide non-stop data flow from dual-channel DDR2-800 into CPU registers even in Athlon 64 FX-62 2.8 GHz.
As the bottleneck of the AMD AM2 system lies right inside a processor core, in conclusion of the above-mentioned analysis we could only assume that two cores are better than one. That's because the memory bus in AMD Athlon 64 X2 processors is shared by both cores, while each core has its individual components. Thus, memory bandwidth can be utilized better, when both cores of a processor access memory simultaneously. That's true at least theoretically. And our test results published in this article will tell us whether it's true in practice.
So, our objective here is to compare efficiency of using dual-channel DDR2-800 memory bandwidth, when accessed by both CPU cores versus a single core. In our tests we used RightMark Multi-Threaded Memory Test, a part of RightMark Memory Analyzer 3.7 and higher. This utility is intended to measure memory bandwidth (either L2 Cache bandwidth, analyzed in our previous article, or system memory to be analyzed here) in case of single- or multi-threaded access with the following operations: Read, Write, Read PF (software prefetch), and Write NT (non-temporal store). We also decided to run the tests not only on AMD AM2 platforms, but also on modern platforms based on Intel Core 2 Duo processors for comparison. The first reviews of these processors did not show the highest efficiency of utilizing even the 266 MHz FSB bandwidth in case of single-core access.
Intel Core 2 Duo
We decided to start the analysis of our test results with this very platform, because it's the least probable candidate for revealing the real potential of dual-channel DDR2-800 than the AMD AM2 platform to be analyzed in the next place.
We evaluated efficiency of memory access on Intel Core 2 Duo in two configurations — with the new Intel 965 chipset, designed for the new platform (represented by Intel 965G on the Gigabyte 965G-DS3 motherboard, Testbed 1) and an older chipset Intel 975X (ASUS P5W DH Deluxe motherboard, Testbed 2), assumably offering higher performance.
Picture 1. Memory bandwidth (GB/s), Intel Core 2 Duo E6600, Intel 965G
Test results for the Intel 965G chipset are published in Picture 1. The data block is always 32 MB in this article (32 MB for single-threaded access, 16+16 MB for double-threaded access). Single-threaded memory access on this processor demonstrated the bandwidth of 5.96 GB/s (69.8% of the theoretical FSB bandwidth) for reading and 2.39 GB/s (28.0%) for writing. Software Prefetch (SP distance was adjusted to get maximum memory bandwidth - 1024 bytes) increases memory bandwidth for reading to 6.76 GB/s (79.2%). And Non-Temporal Store increases memory bandwidth for writing to 4.84 GB/s (56.7%). Thus, real memory bandwidth values in all cases are evidently far from the theoretical maximum of FSB bandwidth, 8.53 GB/s. They are about 80% of this limit at best (software prefetch).
Starting the second thread (that is enabling the second CPU core) to access data in system memory has practically no effect on memory bandwidth for writing — both regular and non-temporal store. At the same time, memory bandwidth for reading grows to 7.38 GB/s (by 23.8%) for reading and to 8.33 GB/s (by a similar value - 23.2%) for reading with software prefetch. This result looks much better, because the maximum memory bandwidth demonstrated in this case is nearly 98% of the theoretical maximum. Nevertheless, we still cannot speak of the DDR2-800 potential revealed, the theoretical memory bandwidth is 12.8 GB/s. However, considering that the FSB bandwidth is on the level of DDR2-533, we couldn't expect it on this platform.
Picture 2. Memory bandwidth (GB/s), Intel Core 2 Duo E6600, Intel 975X
Intel 975X (Pic. 2) demonstrates a tad different absolute values of memory bandwidth. Having compared results of this chipset with the other ones above, we cannot appoint a single winner among Intel chipsets. Sometimes Intel 965G is better, sometimes it's Intel 975X. Namely, Intel 975X (other things being equal) is characterized by higher memory bandwidth for reading (regular reading as well as software fetch) in case of a single thread — 6.15 GB/s and 6.96 GB/s versus 5.96 GB/s and 6.76 GB/s. At the same time, Intel 975X does not fair well for writing data into memory (regular writing as well as non-temporal store) — 1.81 GB/s and 4.72 GB/s versus 2.39 GB/s and 4.84 GB/s. This chipset again demonstrates the best memory bandwidth results in case of single-threaded reading with software prefetch - about 81.6% of the maximum theoretical FSB bandwidth.
What concerns dual-core access, you can easily see that the gain from using the second memory access thread in this chipset is less pronounced than in the Intel 965G, both in absolute and in relative results. They amount to 7.12 GB/s (15.8% gain) and 8.00 GB/s (14.9% gain) for regular reading and software prefetch. It should also be noted that the chipset's gain from two-threaded writing is a tad more noticeable (memory bandwidth grows from 1.81 GB/s to 2.28 GB/s). So, maximum attainable memory bandwidth in case of using both cores of a processor and the Intel 975X chipset is 8.00 GB/s, which corresponds to approximately 93.8% of theoretical FSB bandwidth. Considering these results, we can assume that the new Intel chipset is better optimized for dual-core memory access than for single-core memory access.
AMD Athlon 64 X2
We chose a fast Athlon 64 X2 4800+ 2.4 GHz processor for our analysis of the AMD AM2 platform. As AMD processors allow to adjust CMD rate directly, we took the readings in 2T mode (typical mode for two 1GB DDR2-800 memory modules) and 1T mode. It must be noted that even the latest DDR2-800 memory modules from Corsair (XMS2-6400C3), which offer extremely low latencies (their timings can be set to 3-4-3-9), do not provide 100% operating stability at the reduced CMD rate 1T, though it's sufficient to carry out our tests. That's why our readings in 1T mode, published in this article, are interesting only from the theoretical point of view rather than from the practical one. Nevertheless, they can be more useful in case of smaller memory volumes, for example, two DDR2-800 512MB modules.
Picture 3. Memory bandwidth (GB/s), AMD Athlon 64 X2 4800+, 2T CMD rate
Our readings in 2T mode are published in Picture 3. Compared to the Intel platform analyzed above, AMD processors are characterized by higher memory bandwidth values for writing (both regular writing and non-temporal store). Our readers know it well from our previous articles. Memory read/write bandwidth values in case of single-thread access are rather close - 3.97 GB/s and 3.17 GB/s correspondingly. It's just 31.0% and 24.8% in terms of theoretical DDR2-800 memory bandwidth. Optimizations — software prefetch for reading (PF distance - 1024 bytes) and non-temporal store for writing — allow to increase these results to 7.59 GB/s (59.3%) and 6.90 GB/s (53.9%) correspondingly.
The most significant increase in memory bandwidth in case of two-threaded memory access is demonstrated for "regular" reading — the result grows to 6.76 GB/s, which is higher by 70.3% than in case of the single-threaded access. Increase in memory bandwidth is less noticeable for two-threaded writing — memory bandwidth grows to 4.11 GB/s, that is approximately by 29.7%. In case of reading with software prefetch, the increase in memory bandwidth is just 15.9% (from 7.59 GB/s to 8.80 GB/s). Nevertheless, the most unexpectable result is demonstrated by non-temporal store — the real memory bandwidth for two-threaded memory access is lower (approximately by 4.5%) than memory bandwidth for single-threaded access.
Theoretically, "dual-core" versus "single-core" memory access should not be limited by the data exchange rate inside a processor (as each core possesses dedicated L1- and L2 D-Caches). In other words, we can expect a nearly two-fold increase in memory bandwidth, when single-threaded memory access is replaced with two-threaded access (in theory). In this respect, the last two results (the relatively low increase in memory bandwidth for reading with software prefetch and its decrease for non-temporal store) seem to have to do with limitations of the DDR2 memory controller, integrated into AMD AM2 processors, on the level of calls to the controller from both cores or the memory bus as such. Anyway, the best result, demonstrated for reading data with software prefetch, is still far from the theoretical bandwidth of dual-channel DDR2-800 memory - just 68.8%
Picture 4. Memory bandwidth (GB/s), AMD Athlon 64 X2 4800+, 1T CMD rate
Let's see the effect of reducing the CMD rates to 1T (Picture 4). The picture looks absolutely the same on the qualitative level, only quantitative level is different (1T mode is at advantage, of course). And again dual-channel memory access has the strongest effect on read memory bandwidth — it grows by 68.9%. Memory bandwidth does not grow that much for writing (28.1%). Reading data with software prefetch increases the memory bandwidth by approximately 17.0%. And finally, two-threaded non-temporal store decreases memory bandwidth by 4.6% compared to the single-threaded writing. Nevertheless, the best result is just 9.30 GB/s as well, which is worse than even dual-channel DDR2-667 (10.67 GB/s), to say nothing of dual-channel DDR2-800 (12.8 GB/s).
Having analyzed the efficiency of simultaneous memory access from both cores in modern dual-core processors Intel Core 2 Duo and AMD Athlon 64 X2, we found out that "dual-core" memory access indeed increases the real memory bandwidth. But we cannot say that the increase is very noticeable.
For example, the real memory bandwidth of the Intel Core 2 Duo platform with 266 MHz FSB is limited to the level of FSB bandwidth (8.53 GB/s), which is certainly lower than the theoretical bandwidth of dual-channel DDR2-800 memory (12.8 GB/s). So we should pin all our hopes for revealing the real potential of high-speed DDR2 memory on the future generation of processors/chipsets from Intel, designed for 333 MHz FSB (it will increase its bandwidth to the level of theoretical DDR2-667 bandwidth).
It seems that the AMD AM2 platform shouldn't have any limitations to reveal the real DDR2-800 potential, as the initially-detected limitations in single-core access mode had to do with the data exchange rate inside the core. In case of two cores, data could have been transferred in two threads at double speed, but for a new limitation. This time it seems like it has to do with the implementation of the shared system request interface and/or the memory bus of the integrated memory controller.
It's sad, but true: the real potential of dual-channel DDR2-800 (in terms of its high bandwidth) is still unexposed. The "two cores are better than one" approach failed this time. In some cases the results are limited by FSB bandwidth, in other ones — by the memory controller efficiency. To say nothing about the currently unofficial DDR2-1066 and the future DDR3 memory with performance ratings promised covering DDR3-1600. We can only repeat our old conclusion: system memory stops acting as a bottleneck, CPU/chipset manufactures must take it into account. Let's hope this experience will be taken into account in new processors and chipsets, designed for faster FSB or equipped with better integrated memory controllers.
Dmitri Besedin (firstname.lastname@example.org)
October 25, 2006.
Write a comment below. No registration needed!