Not long ago we analyzed a theoretical side of NUMA memory operation in dual processor AMD Opteron platforms and compared it to traditional SMP solutions. It's time to find out how efficient this memory type is in practice — in real calculations represented by SPEC CPU2000 tests. As well as to compare it with a classic SMP-like solution, implemented with none other than AMD K8 cores. As you may have already guessed, we mean dual core Opteron processors.
In our analysis we shall use SPEC CPU 2000 tests, compiled by the new Intel C++/Fortran Compiler 9.0, which efficiency versus the previous version of compilers has been just reviewed.
We used the rate metric in SPEC tests. Remember that it measures "the number of executed tasks per unit of time" and thus suits better for evaluating the performance of multi-processor systems, than the classic "execution rate of a single task". Taking into account that the tested platforms are based on AMD processors, all test binaries are taken from our previous review, where they had been modified to remove checks for Intel manufacturer :).
Dual processor Opteron: symmetric and asymmetric NUMA, Node Interleave
We have already analyzed dual processor Opteron systems in SPEC CPU2000 before. Of course, we used different system configurations, different compilers, but nevertheless we shan't repeat ourselves here. Namely, we shall not pay attention to the issue of comparing 1 CPU vs. 2 CPU performance. We shall take up exactly what we wanted — efficiency comparison of various NUMA memory configurations — symmetric (2+2), in Node Interleave mode and in regular mode, and asymmetric (4+0).
It didn't take us long to decide upon the standard (that is 100%). We have chosen the following configuration: symmetric NUMA in Node Interleave mode. The reason for symmetric is quite clear, most normal (not Low-End) dual-processor Opteron platforms are based on this configuration. It's also clear why Node Interleave, considering operating peculiarities of Windows NT systems in multi-processor systems. We wrote about it in our article on NUMA. Here is the main idea: if processes are not assigned to physical processors (which is unfortunately cannot be done by SPEC rate tests), memory for data is always allocated for one of the processors and the code is executed equally by all processors. As a result, a task will access local memory with lower latencies, while remote memory will be accessed with higher latencies, resulting in unstable test results at minimum. Node Interleave mode (reviewed in the same article) should eliminate this nuisance. It will actually even positive as well as negative aspects of NUMA and turn it into "pseudo-symmetric" memory architecture — in this case, any processor accesses any memory area with the same latencies, like in traditional SMP systems exactly.
So, we've straightened out with what we'll compare the results. Let's now proceed to their analysis. At first, let's evaluate the symmetric NUMA organization without Node Interleave.
As you can see, disabling Node Interleave in integer tests (SPECint_rate2000), that is turning a memory system into "true NUMA", may be either to advantage (2.3 — 3.0%, code options: non-optimized as well as relatively "old" optimizations -QxK and -QxW), or to disadvantage (-1.1 — -1.7%, in both new optimization variants, -QxN and -QxB). The defeat has to do with the performance drop in just three tasks — 164.gzip, 176.gcc, and 256.bzip2. But this drop is rather noticeable (from -5.6% to -29.2%) and persistent (with different code optimizations — that is it doesn't have to do solely with "ambiguities" of operating systems and regular applications in NUMA environment).
Nevertheless, in all other cases we can see either performance parity or advantage, sometimes even quite significant (the best result is demonstrated by 181.mcf - from 14.0% to 17.2% of gain) — we cannot write it off to operating peculiarities of applications and OS.
The situation with real numbers is different. We can see noticeable performance gain practically in all cases, except for -QxB. But 16-17% performance drops look very strange - in all other cases there are performance gains in these places. It's probably the fault of behavioral "ambiguities" of the operating system on NUMA platforms. The total gain in SPECfp_rate2000 (except for the last case) is from 5.5 to 6.6%. In some cases it may be significant (for example, non-optimized 179.art — 26.9%).
Thus, from the point of view of calculation performance, the symmetric NUMA as such is better than its "balanced" Node Interleave mode. Nevertheless, the performance gain is far from what it might have been, given the task had been completely optimized for NUMA and its memory bandwidth requirements. Theoretically, it might have been 1.5 times as high (if compared by memory bandwidth: 12.8 GB/s versus 8.4 GB/s). But in practice the gain does not exceed 3-5%. I repeat that it may be noticeably higher in some cases.
Let's proceed to the cheaper asymmetric NUMA "4+0" configuration. Of course, we just correspondingly modified our testbed instead of switching to another motherboard for this purpose, which would allow to install memory modules only for one of the processors.
Here are the results: when "pseudo-homogeneous" NUMA (symmetric with Node Interleave) turns into asymmetric, it results in noticeable performance deterioration almost in all cases. The result is quite stable, the reasons for it are clear, as the random behavior of the operating system towards NUMA does not manifest itself in this case — the entire memory space belongs to one processor and each task accesses local memory and remote memory as 50/50, like in Node Interleave mode of symmetric NUMA.
This comparison can be actually considered the direct evaluation of memory bandwidth requirements of some tasks. The most exacting tasks in this respect are (in descending order): 300.twolf, 175.vpr, 181.mcf, 176.gcc, 256.bzip2. Only one task practically doesn't depend on it — 252.eon, 253.perlbmk demonstrates only slight dependence. Expectedly, these very tasks benefit almost nothing from the true NUMA (see above). But we cannot say the same about the tasks mentioned above — there is no obvious accordance between the loss in asymmetric NUMA and the gain in the symmetric configuration. Here is the total score in SPECint_rate2000: asymmetric NUMA configuration is outperformed by symmetric NUMA configuration with Node Interleave by 21-27% in terms of integer performance.
Tests with real numbers. There are no principal differences from integer tests — the majority of tasks still suffer from the asymmetric memory organization (in fact, from the reduction of its peak bandwidth.) The total loss is similar to that in integer tasks, but with narrower spread — 22-23%. Only one task practically does not depend on memory bandwidth — 200.sixtrack, 177.mesa depends on it very little. There is no clear correlation between gain in the symmetric configuration and loss in the asymmetric configuration in these tests. We have noted the reason for this phenomenon many times — it's the chaotic behavior of the operating system and single-thread tests in NUMA environment.
Dual-processor dual-core Opteron: dual-core vs. 2x single-core and 2x dual-core
Let's proceed to the second part of our review. It's actually a logical continuation of the first part, as it will also deal with comparing different memory optimizations in multi-processor Opteron platforms. This time we'll take the most natural configuration for the standard — a single dual-core processor. We assigned the test application to system processors 0 and 1 (two cores of the first physical processor) to test dual-processor dual-core platform in this mode. First of all, this configuration will be compared to the equally-clocked dual-processor platform. We assigned the test to Processors 0 and 2, which correspond to the first core of each physical processor (unfortunately, it does not imply assigning individual threads of the test strictly to one of the processors). The second comparison — two dual-core processors versus one dual-core processor. In this case the test was started in rate mode with the "number of users" of four (by the total number of available processor cores).
So, let's proceed to the results of integer tests in the "2x Single-core vs. Dual-core" mode. This case is actually a comparison between the true SMP system on AMD K8 cores with the true NUMA system on the same cores.
The measurement results were expectedly chaotic. It's absolutely impossible to detect any logic — for example, 300.twolf, being quite exacting to memory bandwidth in previous tests, gets practically no benefits from full NUMA. 256.bzip suddenly gains up to 84.6% in some optimizations versus the SMP-like memory organization. We can comment only on the total average score — the dual-processor configuration outperforms the equally-clocked dual-core configuration in integer performance by 6-14%.
The tests with real numbers demonstrate no less a spread. Except for the expected lack of gain in 177.mesa and 200.sixtrack, which agrees with the data above. In all the other cases we can see significant performance gain, which sometimes reaches fantastic values — for example, 124.7% in 178.galgel, —QxW optimization (with the expected theoretical limit of just 100%, considering the memory bandwidth doubling when we switch from SMP to NUMA). The general performance gain in SPECfp_rate2000 is from 16 to 40%. It speaks in favor of true dual-processor systems based on Opteron (and NUMA in particular) versus dual-core (SMP-like) solutions.
And in conclusion we offer a comparison of a dual-processor system on dual-core processors and a single dual-core processor.
Expectedly, the second processor provides some performance gain. It's hard to say how much it can maximum be. But it's quite clear that it may exceed 100%, considering that besides adding the second calculating device, we potentially double the memory bandwidth (which is not the case, if we compared a dual-core processor with a hypothetic quadri-core processor). That's what we witness in practice, but not in all cases. The leaders here are 164.gzip and 256.bzip2. The majority of other tasks show the standard 100% gain except for 176.gcc and 181.mcf, where it's even lower — about 70-80%. As a result of mutual spread compensation in these tasks, the total score gain amounts to nearly the same 100% (except for -QxW, which offers a tad better result and demonstrates 112.8% gain).
The "2x Dual-core vs. Dual-Core" situation in floating point tests is more spread. The majority of tasks gain less than 100% of performance here, except for 301.aspi. Nevertheless, even here we can see fantastic surges in the same —QxW optimization option (for example, 178.galgel gains 253%!). The average performance gain in these tasks is about 80% (the spread is from 66 to 114%).
We can draw only three clear conclusions about the performance in "more or less real" applications, not in purely synthetic tests like RMMA. They will be all opinions on the qualitative rather than quantitative level:
The direct offshoot from the last conclusion — it's impossible to evaluate strictly on the quantitative level the advantage of NUMA on the whole and of dual-processor Opteron systems versus dual core configurations in particular. To be more exact, we tried to evaluate it (see the preliminary results in the article). But considering the current state of affairs, we may just as well get quite a different picture, if we repeat the tests. It will be similar to the previous situation in general outline, on the qualitative level.
Dmitri Besedin (email@example.com)
September 15, 2005.
Write a comment below. No registration needed!