In our previous reviews we examined various multi-processor and/or multi-core AMD Opteron configurations many times (you can learn the results of these tests in the following articles: SPEC CPU2000. Part 12.1. AMD Opteron: Integrated Memory Controller and Dual-Processor Configurations and SPEC CPU2000. Part 21. AMD Opteron: Dual-processor and "Dual-processor - Dual-core" Systems). In this article we again return to these platforms, represented this time by the STSS QDD4400 server, based on four dual-core AMD Opteron 875 processors. A very important peculiarity of this platform lies in a more complex 4-node NUMA configuration, which theoretical part was reviewed in this article. And now we shall evaluate efficiency of the 8-core AMD Opteron platform in real computational tasks, such as SPEC CPU2000 suite tasks.
Testbed configuration and software
We are going to use new official Version 1.3 of SPEC CPU 2000 tests, compiled in Intel C++/Fortran Compiler 9.0 (both of them - Version 9.0.024). Taking into account that this platform is based on AMD processors, we have used our ICC Patcher in order to remove Intel processor checks from all libraries of Intel C++/Fortran compilers.
We used the rate metric in SPEC tests. Remember that it measures "the number of executed tasks per unit of time" and thus suits better for evaluating the performance of multi-processor systems, than the classic "execution rate of a single task".
As we were limited in test time, we carried out only the minimum set of tests - namely, running 1, 2, 4 and 8 instants of test tasks (startup options: --rate 1, --rate 2, --rate 4, and --rate 8, correspondingly). In all cases NUMA settings included SRAT mode, which is enabled on this platform by default. SRAT (System Resource Affinity Table) consists in creating a special cognominal table in ACPI data, which allows OS to correctly associate processors with their memory areas — a very useful thing in case of NUMA systems. It happens in case of a NUMA-aware OS. Windows Server 2003 SP1, installed on this server, is one of such operating systems.
Note that SRAT support in Windows Server 2003 SP1 is very explicit — it's evident in CPU load results shown in Windows Task Manager. As is known, usual systems like Windows NT distribute processor time equally between all the processors. For example, if the system contains eight processors (no matter what: physical or logical) and four active applications are running, each of the processors will be loaded by 50%. We have found out experimentally that Windows Server 2003 with SRAT support does not distribute the load evenly. For example, CPU 1, 2, 5, and 7 are fully loaded (100%) at one moment. Approximately in 5 seconds — CPU 3, 4, 6, and 8. After that the load may again be redistributed. As a result (averaging for a relatively large time span), all processor still get balanced load, but it's obtained by a different method. As each thread/application is not spread across all processors, but runs on one of them for a long period of time, NUMA systems have higher chances for this thread/application to access local memory in the address space of the controller in this processor. Of course, we can just as well speak of a no-win option, when an application is assigned to one processor, but works with the data in the address space of another processor's memory controller (or even "double remote" data). But let's hope that OS with SRAT support must understand how to distribute applications to processors from the point of view of memory access.
OK, let's proceed to our SPEC CPU2000 test results of the 8-core AMD Opteron 875 platform. A single instance of the task is a logical reference point (--rate 1 startup option). Note that SPEC CPU2000 1.3 has presented an interesting surprise (as least with Intel compilers that we use) - 255.vortex refused to run correctly (its result did not match the reference result) in all code optimization options. That's why this task will not be included into the graphs below.
Let's analyze the first case: running two concurrent tasks (as we have already mentioned above, at a given moment they can run on both independent physical processors or on both cores of the same processor; this situation changes in time so that we could "cover" all available execution units), SPECint2000 integer tests.
Strange as it may seem, the average SPECfp_base2000 result is characterized by a very small spread in values — the result falls within 1.89-1.96 times. Let's see whether we can reach the same effect in case of four and eight concurrent instances of the tasks.
Let's proceed to the next case — running four concurrent instances of SPEC CPU2000 test tasks.
And finally, let's analyze the last case — running eight concurrent instances of the tasks, which equals the number of processor cores in the system under review. Despite the strict correspondence between the number of tasks and processor cores, the expected performance gain, one way or another connected with memory access, cannot be 8 times sharp, as the number of memory controllers is twice as few. Thus, each two tasks running on two cores of the same processor will be inevitably limited by the shared memory controller throughput, resulting in the SMP-like scenario. We already wrote about the advantageous nature of a true NUMA configuration versus classic SMP and SMP-like dual core configurations.
Nevertheless, let's see what happens in real conditions of our experiment. Note that the results below do not correspond to all possible code optimizations. We couldn't test the other options for this case, as we were strictly limited in time.
In conclusion, let's evaluate "scalability" of SPEC CPU2000 tasks (it's in double quotes because scalability in the strict sense of this word is out of the question, as running several concurrent instances of a task does not mean its parallelism) — the average performance gain depending on the number of processors. Gains per a single processor are given in brackets (efficiency of a given processor.)
We can easily see that the SPEC CPU2000 tasks, both integer and real, are characterized by practically 100% "scalability" in case of 2-8 processor cores. High "scalability" of SPECint CPU2000 integer tasks also remains in case of 4 and 8 processor cores, while the efficiency of a single processor core for real tasks goes noticeably down as the number of concurrent instances of the running tasks grow. As we have already noted, it has to do with higher requirements of SPECfp CPU2000 tasks to memory bandwidth compared to integer SPEC tests.
Dmitri Besedin (email@example.com)
January 27, 2006.
Write a comment below. No registration needed!