It's known that Shared L2 Cache makes a significant microarchitectural difference between modern dual-core processors from Intel, based on the improved P6+ microarchitecture and the new Intel Core microarchitecture (Intel Core Duo/Core Solo and Intel Core 2), and competing solutions (AMD Athlon 64 X2 dual-core processors). It's dynamically distributed between individual cores of a processor depending on their needs in cached memory space. And dual-core processors from AMD are notable for individual L2 Cache of a fixed size for each core.
We can assume that the shared architecture of L2 Cache can sometimes be less advantageous than the traditional architecture of L2 Caches dedicated to each core (given the same total sizes – for example, 2 MB shared versus 1+1 MB dedicated) due to the shared data bus and Shared L2 Cache access system. If it's true, the most efficient way to detect this drawback will be providing maximum load on L2 Cache of a processor (from both cores), which is much easier to detect with a special test application than with real tasks, which requirements to L2 Cache of a processor are not known directly.
That’s why in order to check the above assumption and to compare efficiency of shared access of two cores to Dedicated (AMD) and Shared (Intel) L2 Cache, we did just so. We used a recently developed utility RightMark Multi-Threaded Memory Test that is included into the latest official version (3.70) of RightMark Memory Analyzer.
Here is the idea: we create two threads, each of them being "tied" to its core by default (to avoid turning these threads over from core to core by an operating system). Each of the threads allocates its own memory space of a specified size and can perform the following operations: reading, writing, reading with software prefetch (a user can vary the prefetch distance), and non-temporal store. The total data size to be read and written is specified by a user separately for each thread. The program can start and stop each thread any time, as well as start and stop both threads at once simultaneously. Results of the test are output on the fly – instant (averaged by a second) and average (averaged by the entire test duration) bandwidth (MB/s). Depending on a selected data size, the application allows to analyze shared (or dedicated, in case of a single thread) accesses to CPU cache as well as to system memory. Evidently, it makes sense to use the first two access modes (reading and writing) to analyze L2 Cache of a processor, as the last two options (software prefetch and non-temporal store) may be useful for analyzing memory characteristics.
Dedicated L2 Cache, AMD Athlon 64 X2 3800+
We developed and debugged RightMark Multi-Threaded Memory Test using AMD Athlon 64 X2 3800+ processor (Testbed 1) with 512KB dedicated L2 Cache in each core. So let's start the analysis from this very L2 Cache in this processor.
Table 1. AMD Athlon 64 X2, Data in L2 Cache
Test results for a 512KB block (maximal L2 Cache size) are published in Table 1. We used an evident approach to our tests - tests were carried out in five modes: reading in a single thread, writing in a single thread, reading in two threads, writing in two threads, and finally, reading and writing simultaneously in two threads. Table 1 lacks data on the total bandwidth of the shared data interface, because it is not available in this case – each L2 Cache has its own independent interface.
Note that absolute bandwidth values (GB/s) are not relevant in our analysis, because they depend on a given sample of a processor (first of all, its clock frequency). We are interested in relative results of shared data access relative to the corresponding single data access. You can easily see that these very relative results equal 100% in all three cases of shared access with a given processor (AMD Athlon 64 X2) and a given data size (512 KB). There is nothing surprising about it, as this fact speaks of fully independent L2 Caches in CPU cores.
And now let's see how the shared data interface behaves in a given processor – interface of the integrated memory controller, shared by both cores, and memory interface (Dual Channel DDR-400). In this article we publish only preliminary information on the analysis of shared access of both cores to system memory. The next article will be devoted to a thorough analysis of this problem.
Table 2. AMD Athlon 64 X2, Data in System Memory
In this case (see Table 2), the total number of tests is increased to seven due to a greater number of "reference" single-thread tests – the first two of them were carried out with the 32 MB data block, the other two – with the 64 MB block. It has to do with the fact that the total size of processed data is doubled during shared access to a data bus. To get relative values for each core (a slowdown factor, so to speak), individual absolute values should be compared to the results of single-thread access calls in the same conditions (32 MB data block). Relative evaluation of a total interface bandwidth requires comparing its results with single-thread access to a doubled data volume (64 MB data block). Nevertheless, in this case the results of single-thread reading/writing 32 MB and 64 MB blocks are identical.
As you can see in Table 2, simultaneous reading of data from different memory areas results in slowing down the per-core bandwidth (76.2% of the single-thread access). Nevertheless, the total bandwidth of a memory bus during simultaneous memory access from both cores reaches 152.4% of the single-thread access.
Simultaneous writing of data is accompanied by a much more noticeable slowdown in memory access in each core – the corresponding relative values go down to 51.1% of the single-thread writing. At the same time, the total bandwidth for writing is again no worse, even a tad higher than in case of the single-thread data access (102.2%).
The most interesting case is simultaneous reading by Core 1 and writing by Core 2. This case is interesting, because the resulting memory bandwidths for reading and writing are getting practically the same (to the level of ~1.55 GB/s). As single-threaded reading is characterized by a noticeably higher memory bandwidth (3.28 GB/s) versus single-threaded writing (2.23 GB/s), a slowdown in memory access is much more pronounced for the first core, which reads data from memory (down to 47.6%), compared to the second core, which writes data into memory (down to 69.1%). The total bandwidth in this mode is 3.1 GB/s. It's 112.5% in terms of an "average" read/write operation (3.28 / 2 + 2.23 / 2 = 2.755 GB/s). That is the shared memory bus usage in this mode is still more efficient compared to single-threaded access.
Shared L2 Cache, Intel Core 2 Extreme X6800
The second processor to be reviewed is Intel Core 2 Extreme X6800 (Testbed 2) with a relatively large shared 4 MB L2 D-Cache. We'll try to evaluate efficiency of simultaneous data access from both cores in case of sufficient L2 Cache (the total data size will not exceed 4 MB), partial cache cram (data size – 5 MB, different data size for each core), as well as in case of accessing data in system memory (data size – 32 MB).
Table 3. Intel Core 2 Extreme, Data in L2 Cache
Testing the shared interface of the shared L2 Cache in Intel Core 2 Extreme cores is similar to testing the shared interface of the integrated memory controller in AMD Athlon 64 X2, reviewed above.
Table 3 contains the results obtained in conditions of maximum utilization of L2 Cache, but not exceeding its limits (the total data size is 4 MB, 2 MB for each core). Simultaneous reading from L2 Cache reduces its bandwidth per each core. But we cannot say that the reduction is very large (down to 82.9% of the initial level that corresponds to exclusive access). Especially if we take into account how high are the absolute figures of L2 Cache bandwidth even in these conditions (about 19 GB/s per core). At the same time, the total bandwidth of L2 Cache in this mode grows much – 173.8% of the single-core access to the 4 MB data block.
The reduction in L2 Cache efficiency is more prominent for simultaneous writing – in this case, bandwidth values go down to the average of 58% relative to writing 2 MB of data by a single core. Nevertheless, the total bandwidth for writing is again 1.26 times as high as writing the same volume of data into L2 Cache by a single core.
The "hardest" mode of accessing shared L2 Cache is simultaneous reading and writing data. As you can see in Table 3, shared L2 Cache in the Intel Core 2 Extreme processor copes with this task quite efficiently – its bandwidth for reading goes down only to 86.2% of the nominal value, bandwidth for writing – to 63.9% (that is it stays even higher than for writing data simultaneously in two threads). The total bandwidth of L2 Cache interface relative to a single average read/write operation is 161.4%.
Table 4. Intel Core 2 Extreme, Data in System Memory
And now let's analyze shared access solely to system memory through the memory controller in Northbridge of the chipset with a relatively large data block (32-64 MB).
Simultaneous access of two cores to system memory through the shared FSB noticeably reduces its per-core bandwidth to about 55% of single-core access (see Table 4). Note that in this case memory bandwidth is again distributed evenly between the cores, and its total is 110% of the nominal value, obtained by single-core access. That is the overall gain is very small.
Interestingly, writing data into memory demonstrates higher results – in this case, per-core bandwidth goes down to about 63-65% and the total bandwidth of the interface reaches 129% compared to writing data by a single core.
Simultaneous reading and writing show a different picture than we saw for the AMD Athlon 64 X2 processor. Namely, bandwidths for reading and writing are not getting even here, though the reduction in bandwidth for reading is still more pronounced (to 40.9%) compared to the reduction in bandwidth for writing (to 90.2% of the nominal value).
And finally, let's analyze the most interesting Shared L2 Cache mode of Intel Core 2 processors, which can be called "competition" of the cores for this resource in case of partial cache cram. The objective of tests in this mode is to answer two questions: how much is the bandwidth reduced during shared access of two cores to L2 Cache, when it's partially "depleted"; and how efficiently is Shared L2 Cache distributed between the cores depending on their real needs? Answering both questions will require more than just a single test, when both threads access the cache equally (for example, when each thread requires 2.5 MB of L2 Cache), as we did before. It will require a series of tests, when the total data size remains the same in each test (for example, 5 MB), a share of this volume for one of the cores gradually growing.
Picture 1. Intel Core 2 Extreme, Reading Bandwidth, 5 MB data block
The results of this experiment for simultaneous reading are published in Picture 1. The reading bandwidth curve of the second core (Core1 Read) is almost an inverted reading bandwidth curve for the first core (Core0 Read), while the total bandwidth curve (Full Read) shows the sum of bandwidths of the first and the second cores. Thus, the efficiency of distributing L2 Cache between CPU cores and of caching data can be easily tracked by any of these curves – for example, by the Core0 Read curve. It stays on the typical level for L2 Cache of a given processor with 1.0-1.25 MB of data for the first core. The bandwidth drops, when the data volume reaches 1.5 MB. Then it keeps on the minimal level until 3.5 MB of data, when the second core starts operating with 1.5 MB. As the data volume for the first core grows, it is increased to typical RAM bandwidth values.
Picture 2. Intel Core 2 Extreme, Writing Bandwidth, 5 MB data block
Practically the same curves are demonstrated in case of writing 5 MB of data simultaneously by both cores, to within the smallest absolute bandwidth values (Picture 2).
Picture 3. Intel Core 2 Extreme, Reading+Writing Bandwidth, 5 MB data block
And finally, the situation, when the first core is reading data while the second core is writing data, is not an exception (Picture 3). The only peculiarity of this situation is asymmetric curves on the ordinate. It has to do with differences in L2 Cache and RAM bandwidths for reading and writing.
Thus, when two cores compete for Shared 4 MB L2 D-Cache in Intel Core 2 processors for any access type (reading only, writing only, or simultaneous reading and writing), it can efficiently cache data, which size does not exceed 1.25 MB, that is approximately a quarter of L2 Cache (!). In these conditions (when one of the cores uses this cache volume, while the other larger part of the cache is used by the second core) we can see maximum efficiency of using L2 Cache data bus by one of the cores as well as of using memory data bus by the second core. As a result, maximum total bandwidth is demonstrated in this area. But in case of equal distribution of data, which do not fit into L2 Cache, between CPU cores, per-core bandwidth turns out to be lower even than RAM bandwidth in case of single-thread access. Thus, the efficiency of caching data in this "zone of conflict" is minimal – we can say that data are not cached at all.
This problem pertains solely to the architecture of Shared L2 Cache, at least to its present implementation in Intel Core 2 processors. The ideal case might have looked like Picture 4 - two threads access 5 MB of data either when 2 MB of L2 Cache is allocated to each core, or when 4 MB Shared L2 Cache is distributed more efficiently.
Picture 4. Ideal processor with 2+2 MB L2 Cache, Reading Bandwidth, 5 MB data block
In this ideal case "the problem zone" is minimized – in fact, its size is 1 MB, that is that very data block size, which does not fit into L2 Cache of the processor. Minimal efficiency of L2 Cache is demonstrated here only when both cores operate with data, which size exceeds 2 MB (either half of the shared cache, or the size of a dedicated cache). But when the first core accesses a 2 MB data block, while the second core accesses a 3 MB data block, data exchange rate of the first core will equal L2 Cache bandwidth, while the second core will exceed its limits and will deservedly reach the exchange rate similar to system memory bandwidth. At the same time, you can see on Picture 1 that the current implementation of Shared L2 Cache in Intel Core 2 processors demonstrates quite a different situation: the first as well as the second cores reach similar mediocre data exchange rates, which are even lower than system memory bandwidth.
Our analysis revealed complete independence of isolated L2 Caches in AMD Athlon 64 X2 cores. This processor shows no reductions in L2 Cache bandwidth in case of shared data access by both cores.
In case of simultaneous access to sufficient L2 Cache by both cores, Shared L2 Cache in Intel Core 2 Extreme processors reduces its per-core bandwidth to 57-83% of the initial value, depending on the access type (the highest reductions are demonstrated for writing, the lowest ones – for reading data). Though such a reduction may seem significant, absolute values of L2 Cache bandwidth in a given processor in these conditions remain on a high level of 10-19 GB/s. That is, running two real single-thread applications simultaneously (which data fit into L2 Cache of a processor) may result in some performance drop, but only if these applications are very critical to L2 Cache bandwidth (like our synthetic test).
The situation is much worse, when processor cores have to compete for Shared L2 Cache, that is when the total size of data processed by both threads (or two single-thread applications) exceeds the size of Shared L2 Cache. Data exchange rate of the core depends much on the data volume, accessed by this core. When this volume is relatively small and does not exceed 1/4 of the total L2 Cache size (1.0 – 1.25 MB for our experiment), efficiency of data exchange rate of the core remains quite high and is comparable to L2 Cache bandwidth for single-thread access. Such an application just "doesn't see" other applications that potentially compete for L2 Cache. Data exchange rate drops (to the level of system memory bandwidth and lower), when cache requirements of a thread/application grow, that is with the increase in the volume of processed data. In our conditions, this situation is demonstrated with a per-thread data block of 1.5 MB and higher. The following situation is quite possible here: an unaware application that uses only half of Shared L2 Cache (2 MB) may lose much of its data exchange rate just because of another application (even if it's not critical to memory bandwidth), which operates with a larger data block (3 MB.) This application-aggressor will not only be executed inefficiently by Intel Core 2 processors with Shared L2 Cache (as its data do not fit into its part of L2 Cache), but it will also significantly reduce the efficiency of the first application, even though L2 Cache size seems more than sufficient for it.
Thus, in our opinion, the system of distributing Shared L2 Cache in Intel Core 2 Duo / Core 2 Extreme processors is not very efficient in "the conflict zone", where requirements of each core to L2 Cache size are more or less identical. In fact, this inefficiency consists in a relatively wide "conflict zone", which undeservedly covers about 2 MB, that is half of L2 Cache. Let's hope that the next implementations of Shared L2 Cache in Intel processors will demonstrate better efficiency of L2 Cache distribution between processor cores depending on their requirements.
Dmitri Besedin (firstname.lastname@example.org)
October 23, 2006.
Write a comment below. No registration needed!