DDR2 memory has been known for rather long already — from the first models of DDR2-400 and DDR2-533 that appeared approximately two years ago and possessed quite high latencies (both dynamic, which have to do with properties of memory chips, and system ones, which have to do with initially slow timings — 3-3-3 and 4-4-4 correspondingly) and up to its modern modifications offering high performance (DDR2-667, DDR2-800, and unofficial DDR2-1000/1066) as well as relatively low latencies (up to 4-4-4 for DDR2-800). Nevertheless, the real potential of this memory type in dual channel mode is still unrevealed (using relatively expensive high-speed DDR2 memory in single-channel mode is not expedient at least for economic reasons, as much cheaper DDR-400 memory in dual-channel mode offers no worse results at minimum, very often - even higher results). The reason is that DDR2 memory has been supported, actually was intended only for one platform — Intel Pentium 4/Pentium D with Intel chipsets - from the 915 series to the modern 975 series (or NVIDIA nForce 4 Intel Edition, but it does not essentially change status quo). The main stumbling block to revealing the full potential of DDR2 memory on this platform lies in its traditional bus architecture, when a processor uses a Front-Side Bus (FSB) to connect to the Northbridge, which important component is a memory controller proper, which can operate in a dual-channel mode. Even though this memory controller can provide the internal memory bus bandwidth on the level with the theoretical memory bandwidth of dual-channel DDR2 (from 6.4 GB/s for DDR2-400 to 12.8 GB/s for DDR2-800), the real data exchange rate between a processor and memory is limited to the bandwidth of FSB, which operates either at 200 MHz or at 266 MHz (in case of "extreme" CPU modifications). Its throughput is just 6.4 GB/s or 8.53 GB/s — which does not exceed the theoretical bandwidth of dual-channel DDR2-533 memory at best. Proceeding from this simple mathematics, we draw the main conclusion time and again that dual-channel DDR2-533 memory for Intel platforms is still more than enough even these days. Of course, in practice faster memory (DDR2-667 or DDR2-800) still has an advantage on this class of platforms as well - it consists solely in reducing latencies for random memory access. But you must agree that constant random memory access is hardly a typical mode for most real tasks.
So, necessity and sufficiency of DDR2-533 memory might have still persisted for a long time — at least until there appear Intel processors and chipsets with a higher FSB clock (by the way, future Intel Core 2 processors (Conroe/Merom/Woodcrest cores) and corresponding chipsets will qualify for it). But for the competing solution, which offered potentially better characteristics of data exchange with memory, at least "on paper".
As is known, AMD Athlon 64 processors/platforms (including their dual-core modification - Athlon 64 X2) are notable for an integrated memory controller (dual-channel or single-channel in the budget sector) that operates at full frequency (equal to the CPU core frequency) and has a bus for direct exchange of commands and data with RAM. The latter operates at the frequency, which is maximum close (I mean it, it's maximum close, not equal, as the latter equals a CPU clock divided by some integer constant) to the nominal memory frequency. The role of FSB (in the usual sense of the word) is played by HyperTransport bus, which has nothing to do with the memory bus. It is used for data exchange with peripheral devices via the chipset (we cannot speak of the strict division of the latter into Northbridge and Southbridge, as some Northbridge functions are performed by a processor). This setup evidently (at least theoretically) allows to obtain the real memory bandwidth even in dual channel mode — data are exchanged with memory directly, without going through various bottlenecks (unless there are some of them in the processor core itself, read about it below). Our multiple reviews prove that it's true for the present-day AMD Athlon 64 processors (from the first Core Revision C and to the latest Revision E), which integrated memory controllers are designed for DDR-400 memory (in the last case — even higher, up to unofficial DDR-533) — maximum real memory bandwidth results are very close to theoretical values.
A new solution from AMD — dual-core processors under the same Athlon 64 X2 title (top models — Athlon 64 FX), but with core/memory controller Revision F and a slightly different processor socket (Socket AM2), is essentially new to some degree. As you might have already guessed, the integrated memory controller in these processors now supports DDR2 memory (DDR2 only, from DDR2-400 up to DDR2-800, so far). That's probably the only serious difference — cursory microarchitectural tests of this processor in RightMark Memory Analyzer did not reveal any noticeable differences from all previous revisions of AMD64 processors (which cannot be said about constantly changing revisions of Intel cores!). Well, it's high time we should challenge the abilities of the new memory controller from AMD, whether it can reveal the real potential of dual-channel DDR2-800 memory. Or paraphrasing the last sentence, "Is it time for truly fast DDR2?" :)
Let's start with test results of the new platform as such (Testbed 1) and then let's compare them with the existing results of DDR2-800 memory tests on the Intel platform (Testbed 2) as well as of DDR-400 memory tests on the current generation of AMD Athlon 64 platforms (Testbed 3).
Like in case of all previous revisions of AMD64 cores, memory controller settings in BIOS of the MSI K9N SLI Platinum motherboard allow specifying the maximum memory frequency (MemCLK Limit). In this case it can take the following values: DDR2-400, DDR2-533, DDR2-667, and DDR2-800, which correspond to the four DDR2 memory standards. Let's make a reservation again that we mean maximum, but not nominal memory frequency, as it's obtained by dividing CPU clock (memory controller frequency) by some integer value. Thus, the real frequency of a memory bus may be less or equal to the specified MemCLK Limit.
Unfortunately, it's impossible to get the real memory bus frequency in this or that case. This parameter is just not available in configuration registers of processor's Northbridge (they store only the above-mentioned MemCLK Limit), so it's always calculated empirically. But the processor's logic of setting memory bus frequency may differ from that in system utilities (including our RMMA). We should also note that reorientation of the integrated AMD64 memory controller to DDR2 memory in the new Core Revision F has inevitably led to multiple changes in parameters of configuration registers. Thus, the processor itself and its memory controller settings are not detected correctly. For example, the latest RMMA 3.65 detects this processor as AMD Athlon 64 FX-39, 100 MHz memory bus clock (irregardless of MemCLK Limit). And it just fails to detect most memory timings. We are looking forward to the new AMD documentation to fix the problem. As for now, let's return to our tests.
The table below contains the results of testing DDR2-800 memory (with default timings in BIOS) in all the four speed modes supported by the new memory controller. For each of these modes we publish theoretical memory bandwidth, calculated with regard for the most probable real memory bus clock (it differs from the nominal value only in case of DDR2-533 - 250 (500) MHz), and readings of real memory bandwidth and latencies for pseudo-random (random within a single memory page, forward walk of pages) and random walks of a 32-MB memory block.
The results are not impressive, to put it mildly. Even in case of the slowest DDR2-400 mode, maximum real memory bandwidth barely exceeds 5.2-5.6 MB/s (note that it's lower for reading data with hardware prefetch than for writing data using the non-temporal store method). It's evidently less than typical 6.2-6.4 MB/s, demonstrated in the current generation of AMD64 platforms with DDR-400 memory.
Memory bus clock obviously changes, when we go from the slowest DDR2-400 to faster modes — it's indicated by increasing memory bandwidth values, but the gain in these values is hardly praiseworthy. In DDR2-533 mode, memory read bandwidth starts exceeding memory write bandwidth (this situation remains when the memory bus clock is increased even higher), but it still fails to reach the values, typical of... DDR-400. Parity is reached approximately in the DDR2-667 area — memory read bandwidth in this mode starts slightly exceeding DDR-400 memory bandwidth, but it reaches only about 62% of its theoretical maximum. Still weaker effect is demonstrated in the fastest mode — memory bandwidth reaches only 6.8 GB/s, that is approximately 53% of the theoretical maximum. The situation is regrettable — according to our recent tests, much better results can be obtained with the current generation of AMD64 memory controllers (Revision E) by using non-standard DDR-533 memory (for overclockers).
The only advantage of upgrading to high-speed DDR2 mode consists in steadily decreasing latencies, most noticeable in random access latencies (from 142 ns to 85 ns). But this fact also has a prosaic explanation — the same default timings were used in all modes (5-5-5-12), which look differently in absolute values for DDR2-400 and DDR2-800 modes (to be more exact, the difference is double the amount in favor of DDR2-800).
And now let's compare these results with typical performance results of DDR2-800 memory on Intel platforms and DDR-400 memory on AMD platforms, taken from test results of Corsair XMS2 PC2-8500 (in DDR2-800 mode) and Corsair XMS 3500LLPRO (in DDR-400 mode).
We have already touched upon the comparison of memory bandwidth values obtained in this review with the test results of other platforms — they are only insignificantly higher even in the fastest DDR2-800 mode than the typical results of DDR-400 memory on the current generation of AMD platforms. Even the Intel platform, which evidently limits the real potential of DDR2-800, demonstrates much higher results — at least they correspond to nearly 100%-efficient utilization of FSB, which acts as a bottleneck. The integrated DDR2 controller demonstrates good results only in terms of latencies — they are lower than those on the Intel platform with an external memory controller (the results are obtained with enabled hardware prefetch in both cases), that is the integrated memory controller actually has some advantage over the "traditional" memory design. DDR2-800 memory latencies during pseudo-random walks on the AM2 platform are also no worse than DDR-400 pseudo-random walk latencies. And in case of random memory access, the new integrated DDR2 memory controller is actually slightly outperformed by its previous DDR counterpart.
Instead of a conclusion
What can be the reasons for such results? In our opinion, there are at least two of them. The first one is quite obvious — the built-in DDR2 memory controller is still "buggy", it evidently fails to provide such high-speed modes as DDR2-800. So to all appearances, it sends NOPs on the memory bus most of the time :). The second reason is less obvious. Besides, it can explain only part of the above mentioned facts. It's a narrow organization of the L1-L2 Cache of a processor (a bidirectional bus with the effective capacity of just 64 bit to each direction versus a 256-bit bus of the L1-L2 Cache in Intel Pentium 4/Pentium D processors, which also feature inclusive cache organization that does not require "extra traffic" on the L1-L2 bus). Its peak throughput is 8 bytes/cycle, which makes it 16.0 GB/s for a 2GHz processor (much higher than bandwidth of dual channel DDR2-800 memory, but comparable to bandwidth of faster dual-channel DDR2-1066 memory). But we have quite a different situation in practice.
Real Bandwidth of L1/L2 Cache and RAM
So, the real L2 Cache bandwidth is just 4 bytes/cycle, that is 8.0 GB/s sharp, which is evidently lower than the theoretical bandwidth of dual-channel DDR2-800 memory. As we have already demonstrated, the real bandwidth of memory as such is even lower. Nevertheless, even though the L1-L2 bus is really narrow, it cannot explain extremely low results of memory write bandwidth using non-temporal store (avoiding the entire hierarchy of processor caches), which is always limited to 5.6-5.7 GB/s (that is it's outperformed even by the previous memory controller, designed for DDR-400!) We don't know yet what the last limitation has to do with. But we have an impression that the procedure of writing into memory through write-combining buffers, implemented since the first Athlon 64 processors and not changed afterwards, is evidently out of date and does not catch up with modern standards of dual-channel DDR2. The latter possesses a serious bandwidth potential (as we can see, it's comparable to the speed of interprocessor communications!)
Thus, the real potential of top DDR2 models (DDR2-667 and higher) is not revealed again, the epoch of high-speed DDR2 has not come yet :(. In our opinion, this time the one to blame is AMD itself, who tried to use outdated technologies (nearly five-year old), implemented in the first Athlon 64 processors (like the relatively narrow bus between L1-L2 Caches of the exclusive organization) in the latest memory technologies, which bandwidth is rapidly reaching the data exchange rate inside a processor. Memory definitely stops being the bottleneck of a system, so processor manufacturers should take this fact into account. Well, let's hope that manufacturers will heed our words and will fix such problems in the next Revision G of AMD64 processor cores (or in the absolutely new core from AMD.)
Dmitri Besedin (firstname.lastname@example.org)
May 23, 2006
Write a comment below. No registration needed!