SPEC CPU2000. Part 23. Four-processor Dual-core AMD Opteron 875

In our previous reviews we examined various multi-processor and/or multi-core AMD Opteron configurations many times (you can learn the results of these tests in the following articles: SPEC CPU2000. Part 12.1. AMD Opteron: Integrated Memory Controller and Dual-Processor Configurations and SPEC CPU2000. Part 21. AMD Opteron: Dual-processor and "Dual-processor - Dual-core" Systems). In this article we again return to these platforms, represented this time by the STSS QDD4400 server, based on four dual-core AMD Opteron 875 processors. A very important peculiarity of this platform lies in a more complex 4-node NUMA configuration, which theoretical part was reviewed in this article. And now we shall evaluate efficiency of the 8-core AMD Opteron platform in real computational tasks, such as SPEC CPU2000 suite tasks.

Testbed configuration and software

Platform: STSS QDD4400
Processors: 4x AMD Dual Core Opteron 875, 2.2 GHz, E1 core revision
Motherboard: NEWISYS 3U4P, BIOS V2.34.5.1, 11/08/2005
Chipset: AMD 8111 & 8131 PCI-X Tunnel
Memory: 16x Corsair 1GB DDR-333 ECC Registered
Video: Trident Blade3D
HDD: Seagate ST336807LC, Ultra320 SCSI, 10000 rpm, 36Gb
OS: Windows Server 2003 SP1

We are going to use new official Version 1.3 of SPEC CPU 2000 tests, compiled in Intel C++/Fortran Compiler 9.0 (both of them - Version 9.0.024). Taking into account that this platform is based on AMD processors, we have used our ICC Patcher in order to remove Intel processor checks from all libraries of Intel C++/Fortran compilers.

We used the rate metric in SPEC tests. Remember that it measures "the number of executed tasks per unit of time" and thus suits better for evaluating the performance of multi-processor systems, than the classic "execution rate of a single task".

As we were limited in test time, we carried out only the minimum set of tests - namely, running 1, 2, 4 and 8 instants of test tasks (startup options: --rate 1, --rate 2, --rate 4, and --rate 8, correspondingly). In all cases NUMA settings included SRAT mode, which is enabled on this platform by default. SRAT (System Resource Affinity Table) consists in creating a special cognominal table in ACPI data, which allows OS to correctly associate processors with their memory areas — a very useful thing in case of NUMA systems. It happens in case of a NUMA-aware OS. Windows Server 2003 SP1, installed on this server, is one of such operating systems.

Note that SRAT support in Windows Server 2003 SP1 is very explicit — it's evident in CPU load results shown in Windows Task Manager. As is known, usual systems like Windows NT distribute processor time equally between all the processors. For example, if the system contains eight processors (no matter what: physical or logical) and four active applications are running, each of the processors will be loaded by 50%. We have found out experimentally that Windows Server 2003 with SRAT support does not distribute the load evenly. For example, CPU 1, 2, 5, and 7 are fully loaded (100%) at one moment. Approximately in 5 seconds — CPU 3, 4, 6, and 8. After that the load may again be redistributed. As a result (averaging for a relatively large time span), all processor still get balanced load, but it's obtained by a different method. As each thread/application is not spread across all processors, but runs on one of them for a long period of time, NUMA systems have higher chances for this thread/application to access local memory in the address space of the controller in this processor. Of course, we can just as well speak of a no-win option, when an application is assigned to one processor, but works with the data in the address space of another processor's memory controller (or even "double remote" data). But let's hope that OS with SRAT support must understand how to distribute applications to processors from the point of view of memory access.

Test results

OK, let's proceed to our SPEC CPU2000 test results of the 8-core AMD Opteron 875 platform. A single instance of the task is a logical reference point (--rate 1 startup option). Note that SPEC CPU2000 1.3 has presented an interesting surprise (as least with Intel compilers that we use) - 255.vortex refused to run correctly (its result did not match the reference result) in all code optimization options. That's why this task will not be included into the graphs below.

Two tasks

Let's analyze the first case: running two concurrent tasks (as we have already mentioned above, at a given moment they can run on both independent physical processors or on both cores of the same processor; this situation changes in time so that we could "cover" all available execution units), SPECint2000 integer tests.

Most tasks act quite adequately and demonstrate twofold gain (or very close to it) versus running a single instance of the task. Exceptions are: 181.mcf (demonstrates the largest spread, depending on code optimization — from 1.50 to 2.06 times), 254.gap and 300.twolf (narrower spread, from ~1.9 to 2.10 times). 175.vpr also demonstrates some surges (-QxW option, 1.87-fold gain). According to SPECint_base2000, the average gain is from 1.93 to 2.01 times. For some unknown reasons, SPEC tests failed to give an integral mark for the -QxP optimization option (SSE3 instructions, though they make no difference for integer code) even though all integer tests started and were executed successfully (except for 255.vortex that provided correct results in none of the options).

While a two-fold gain was a rule rather than an exception in integer tests, tests with real numbers demonstrate an opposite situation: a noticeable spread in almost all tests. 177.mesa (1.95 — 2.00 times) and 200.sixtrack (1.98 — 2.00 times) offer maximum stability. 187.facerec (1.88 — 1.97 times) and 301.aspi (1.89 — 2.02 times) are somewhat less stable. In other cases we can see surges in both directions in a couple of code optimization options (for example 171.swim, 172.mgrid, 173.applu, 178.galgel, 179.art, 191.fma3d). Besides, there are tasks with chaotic results, for example 183.equake (from 1.43 to 1.97 times) and 189.lucas (from 1.64 to 1.97 times). On the whole, the minimum gain in all tasks/optimizations is just 1.43 times, maximum — 2.42 times. Evidently, like all over two-fold performance gains, the last result has to do with relatively no-win execution of the initial configuration (one instance) instead of the performance gain in case of two running instances. Alas, we seem to be not guaranteed against unstable results on multi-processor NUMA platforms even in case of a SRAT-compatible OS.

Strange as it may seem, the average SPECfp_base2000 result is characterized by a very small spread in values — the result falls within 1.89-1.96 times. Let's see whether we can reach the same effect in case of four and eight concurrent instances of the tasks.

Four tasks

Let's proceed to the next case — running four concurrent instances of SPEC CPU2000 test tasks.

Integer tests: the above situation also remains in case of running four concurrent instances of the tasks. Most tests still get performance gains equal to the number of instances (in this case it's close to 4.00). As before, exceptions are 181.mcf (the gain is from 3.05 to 3.74 times), 175.vpr (reduced gain, up to 3.72 times, -QxW option). 197.parser, 254.gap, 256.bzip2, and 300.twolf also demonstrate some reduction in performance gain. The average gain in all the tasks (SPECint_base2000) is from 3.85 to 3.94 times, which looks good.

As always, floating point tests spoil the excellent picture. As in case of two instances of the tasks, we can see a noticeable spread in results nearly in all tasks, except for 168.wupwise, 177.mesa, and 200.sixtrack. The minimum gain is demonstrated by 171.swim, -QxW option (2.78 times). Strange as it may seem, the maximum gain is demonstrated by the same task, but in its -QxN modification (3.99 times). This time we have seen no gains, exceeding the theoretical maximum (4.00 times). Strange as it may seem, the average performance gain in all SPECfp_base2000 real tests is notable for small variance as well: the values fall within 3.46-3.54 times.

Eight tasks

And finally, let's analyze the last case — running eight concurrent instances of the tasks, which equals the number of processor cores in the system under review. Despite the strict correspondence between the number of tasks and processor cores, the expected performance gain, one way or another connected with memory access, cannot be 8 times sharp, as the number of memory controllers is twice as few. Thus, each two tasks running on two cores of the same processor will be inevitably limited by the shared memory controller throughput, resulting in the SMP-like scenario. We already wrote about the advantageous nature of a true NUMA configuration versus classic SMP and SMP-like dual core configurations.

Nevertheless, let's see what happens in real conditions of our experiment. Note that the results below do not correspond to all possible code optimizations. We couldn't test the other options for this case, as we were strictly limited in time.

Integer tests again demonstrate stable and well reproducible results in the majority of tasks. In some cases (300.twolf) we can see even more than 8-fold performance gain, which has to do with the fact that a single task gets less performance gains. As before, the worse result is demonstrated by 181.mcf - just 4.82-5.56 times. It would be logical to assume that this very task has the highest requirements to memory bandwidth, while the other integer tasks are characterized by high data locality. Considering the negative result in this task, the average result in SPECint_base2000 is about 7.6-fold performance gain.

As usual, real tasks are notable for noticeably worse performance gain. 168.wupwise, 177.mesa, 200.sixtrack, and 301.aspi tasks are relatively insensitive to a number of running instances — we can reasonably assume that these tasks, like the majority of SPEC CPU2000 integer tests, are notable for high data locality. On the contrary, most critical to memory bandwidth tasks include 179.art (the gain is just from 2.52(!) to 3.82 times) and 171.swim (4.47 — 4.68 times). The other tasks are only moderately critical to memory bandwidth. The average performance gain in all SPECfp_base2000 tasks is about 6.0—6.3 times.

Conclusion

In conclusion, let's evaluate "scalability" of SPEC CPU2000 tasks (it's in double quotes because scalability in the strict sense of this word is out of the question, as running several concurrent instances of a task does not mean its parallelism) — the average performance gain depending on the number of processors. Gains per a single processor are given in brackets (efficiency of a given processor.)

Value	Two instances	Four instances	Eight instances
SPECint_base2000	1.97 (98.5%)	3.88 (97.0%)	7.55 (94.4%)
SPECfp_base2000	1.92 (96.0%)	3.50 (87.5%)	6.21 (77.6%)

We can easily see that the SPEC CPU2000 tasks, both integer and real, are characterized by practically 100% "scalability" in case of 2-8 processor cores. High "scalability" of SPECint CPU2000 integer tasks also remains in case of 4 and 8 processor cores, while the efficiency of a single processor core for real tasks goes noticeably down as the number of concurrent instances of the running tasks grow. As we have already noted, it has to do with higher requirements of SPECfp CPU2000 tasks to memory bandwidth compared to integer SPEC tests.

Dmitri Besedin (dmitri_b@ixbt.com)
January 27, 2006.

Write a comment below. No registration needed!