SPEC CPU2000. Part 12.2. AMD Opteron Architecture. AMD64 Technology & Compilers.

In this article we continue examination of the AMD Opteron processors, as well as the AMD64 architecture (former x86-64). Note that the summary will be also partially applicable to the Athlon64 which is to be launched this fall.

You can get the detailed info on the AMD64 at http://www.x86-64.org. Here are the key features:

full compatibility with the modern 32bit code;
64bit registers and operands supported;
POH and SSE registers doubled:
RAM linear addressing;
4GB mark of the memory addressed surmounted.

The key element of the AMD64 software support is an operating system. Certainly, you can install 32bit OSs on the Opteron such as Microsoft Windows XP and RadHat Linux (as well as MS-DOS, Windows 98 and other linuxes :), but in this case 64bit extensions of the CPU won't be made use of. 64bit applications need a 64bit OS. And if Microsoft is still developing its product, the Linux version is already finished. One of them is SLES (SuSE Linux Enterprise Server) for the AMD64.

Certainly, an operating system is of no interest without applications because it accumulates numerous user tasks with time.

Applications can be ported to the AMD64 in different ways. The simplest one is to try to recompile it with a 64bit compiler (if such application comes with source codes). However, it's difficult to reach high efficiency this way because it may need fine tweaking to the new architecture in spite of the standards of the higher-level programming languages. In particular, porting can include the check for links stored in 32-bit variables, bit masks and binary offsets.

Many modern algorithms which would be handier in using 64bit variables are specially written for the 32bit medium since it's more popular. In such cases the code should be rewritten for higher efficiency.

That is why the next important factor for the AMD64, after OS, is compilers. They determine how it's better to port ready applications to the new architecture, how effectively they can use its potential.

This review will cover the existent compilers for the AMD64 and effectiveness of transition to the new architecture in the SPEC CPU2000 tasks. The information will then help us test the Opteron in real applications, in particular, in the 64bit mode.

Testbed

AMD Opteron 240 (1.4 GHz)
Rioworks HDAMA (AMD 8000 chipset)
512 MB PC2700 Reg Transcend x 4

In the tests we used two OSs from the same supplier - SuSE SLES 8.1 i386 and SLES 8 AMD64.

Compilers tested:

Intel C/C++/Fortran 7.1
GNU gcc 3.3.1
Portland Group Compiler Technology PGI Workstation 5.0-1

Intel's compiler is of the latest version. gcc was first tested in the version 3.3, but then we also ran v3.3.1 released on August 8. As to PGI, its version 5.0 had crippled libraries. The v5.0-1 had problems with scripts. The following v5.0-2 couldn't compile one of the SPEC CPU2000 benchmarks. And now on their site you can find newer files of an unknown version. That is why we decided on altering the scripts and using v5.0-1. Note that v5.0-2 just slightly differs from it in the tests.

Another important factor is optimization keys for compilers. Intel's solution is very transparent, while gcc and PGI require some more investigation. They were run under the SLES 8 AMD64 with the SPEC CPU2000 benchmarks run once.

gcc was tested 12 times with different keys. "-O3 -funroll-all-loops +FDO" (FDO - two-pass compiling/optimization) was selected in the end. In general, all the variants were close in speed but the greatest effect (negative) was when "-funroll-all-loops" key was excluded. 252.eon C++ was used with "-ffast-math" key without which it didn't pass the result accuracy test. "-m32" key was used for compilation into the 32bit code under the 64bit OS. Unfortunately, it can be difficult to choose keys for gcc as this compiler uses different default key values depending on the architecture. That is why keys found in the 64bit medium may be not that effective when used in the 32bit one. But the idea of using identical optimizations (except those used by the compiler depending on the architecture) in all configurations meets our purpose to compare the same compiler in different media.

The x86-64 version of the PGI compiler includes 32bit and 64bit parts. We used only the first version under the SLES 8.1 i386 and both under the SLES 8 AMD64. Five key versions were tested in all (in each of two versions), and "-fastsse -Mipa=fast" was singled out (it also includes two-pass compiling with profiling). In this case all tests were compiled, and no drops were noticed.

Note that we work only with the base metrics of the SPEC CPU2000. It implies using the same compiler and the same optimization keys (4 at most) for each programming language. Testing is considered from the standpoint of an average user who doesn't have much time for searching best options and who tries to find something better than default options according to the documentation and test results of his own program.

Code performance in different conditions

Here are the compilers in different OSs and modes. Let's start with Intel's solution.

In all tests except 172.mgrid the code works slower under the AMD64 than under the i386. Since the compiler was the same in both cases, the speed drop in 164.gzip (10%), 176.gcc (15%) and 253.perlbmk (11%) can be probably explained by different libraries (installed in the system). The other tests had much closer scores. In the CFP2000 test suite the speed drop didn't exceed 3%.

For gcc the tests were carried out in three combinations:

32bit OS + gcc
64bit OS + gcc with -m32 key
64bit OS + gcc

In the second case -m32 key was used to make the compiler generate the 32bit code in spite of the 64bit medium.

On one hand, two subtests got almost a 50% gain - 186.crafty and 252.eon. But in some other tests the scores considerably fall down (181.mcf (-25%) and 197.parser (-15%)). While the former is known as the most memory-sensitive test in CINT2000, the latter hasn't shown any distinguishing signs so far. We suppose that the data like 'long' and 'pointer' turn into the 64bit format in the 64bit medium, it makes the data structures larger and thus affects their processing. 186.crafty is the only test which gains from 64bit OSs and the compiler because the original uses exactly 64bit variables and gets adapted for the 32bit medium. This is also the only test whose executable code got shorter with the transition to SLES AMD64 (by 18%). In other CINT2000 subtests it got even longer by 3-15% (and by 48% in 181.mcf). So, testing gcc in the CINT2000 reveals that applications are to be purposely developed/redesigned/adapted to gain from the transition to the 64bit medium. In other cases the performance can even fall down. However, we can compile them into the 32bit code to maintain performance at the same level.

The tests written in Fortran 90 the compiler of which is not included into gcc don't have their results shown (four tests). Like in the CINT2000, the speed of the 32bit code is different in the i386 and AMD64 versions of the SLES. The scores fall down only in 171.swim and 177.mesa (by 9% and 7% respectively). But since the AMD64 version has low scores in 171.swim as well, the performance drop is most probably caused by the kernel of the 64bit OS. Tasks dealing with real numbers will have no effect from the transition to 64 bits because the bit capacity of real variables doesn't change, and the scores are not so thrilling like in 186.crafty. However, the doubled number of registers and SSE2 instructions can bring some gain (the current version of gcc can use additional instructions but doesn't support vectoring, which reduces the effect). Nevertheless, the scores markedly grow up in 8 tests out of 10. In particular, in 179.art the growth makes 94%! It implies that the gcc effectively use advantages of the AMD64 architecture in these tasks. On the other hand, Intel's 32bit compiler also scores about 1100.

The compiler from Portland Group has versions for i386 (x86) and x86-64. The former also supports 64bit media. We get 3 combinations again.

In the CINT2000 the compiler performs close to gcc - there's almost no difference in operation of the 32bit code in different OSs, but in case of the 64bit code the scores grow by 91% in 186.crafty and fall down in the other subtests. But the 64bit code looks better as compared to gcc as the average growth comes to 14% against 7% of gcc.

In the CFP2000 the 32bit code have similar scores in the 32bit and 64bit OSs. The 64bit code brings the scores up in all the tests except 179.art. The maximum growth is fixed in 172.mgrid (51%), 177.mesa (28%), 187.facerec (24%) and 178.galgel (19%).

Compilers comparison

Now let's compare performance of the compilers in the i386 and AMD64 operating systems on the AMD Opteron platform.

On the 32bit platform Intel's compiler takes the lead in the CINT2000. It loses only in 176.gcc. The non-commercial gcc got stronger after the last testing and almost catches up with its competitor in some tests. Portland Group's product can't boast of its speed. Its scores are close to the other contestants only in 4 subtests. 252.eon has very low scores which badly affects the integral results.

As expected, Intel's compiler is ahead in the CFP2000. gcc is either very close to the leader or far behind. Here pgi grows stronger and runs close on the heals of Intel's solution in 9 tests. The gap shrinks from 34% in CINT2000 to 14% in CFP2000.

Now, for the next two diagrams remember that Intel's compiler doesn't use the AMD64 architecture. But it successfully competes against the 64bit products thanks to its high efficiency.

The high scores of gcc in 176.crafty and 252.eon let it outscore Intel's solution by 0.25% in the integer scores. In 3 out of 12 subtests it easily beats the competitor and in 2 more they are almost equal. pgi falls into the last place again though it gains 11% as compared to the 32bit version.

This is the only configuration where the 64bit compiler from pgi catches up with the leader. In 9 subtests they have similar scores. gcc has a noticeable drop in 171.swim, though it sweeps the floor in 179.art. By the way, it goes ahead in two other subtests - 183.equake and 188.ammp. Unfortunately, the lack of support for the Fortran 90 makes it weaker than the others, but such scores are not that bad either.

Conclusion

All the compilers tested proved that they are able to work well in a wide range of computational problems and under various OSs (especially considering that there are over 1400 source files in the SPEC CPU2000). We expected it from Intel but the other solutions were new. (Frankly speaking, I was quite surprised seeing that gcc was compiled at once :) ). So, let's sum it up:

the 64bit code is already applicable, the AMD64 architecture is properly supported by gcc and pgi compilers;
Intel's compiler, though made by AMD's enemy, perfectly works on the AMD Opteron and shows the best scores in a wide range of tasks as compared to the other compilers;
with the transition to the 64bit architecture the scores grow up only on specially developed/ prepared/ ported programs, but it's rather due to the elimination of the downsides of the 32bit code than to the 64bit variables;
in other tasks the scores are also higher but this effect is related with other features of the AMD64, not with its 64bit capacity;
in certain tasks the performance can even fall down with the transition to the AMD64, but it can be avoided with 32bit compilers;
gcc wins among the 64bit compilers in case of CINT2000-type tasks, and pgi looks preferable in the CFP2000.

The AMD64 architecture has made one more step to enter the market (after the launch of 64bit OSs), i.e. now it has compilers with the AMD64 support. But it's not a revolutionary breakthrough, at least in the sphere of computation tasks. I hope that when the AMD64 architecture gets popular enough (certainly, together with AMD Opteron and Athlon64 processors), the number of applications that can benefit from it will be much greater. Anyway, you can already write, compile and test :)

Kirill Kochetkov kochet@ixbt.com,

Write a comment below. No registration needed!