SPEC CPU2000. Part 5

By Kirill Kochetkov

In the fifth part of the SPEC CPU2000 review we will examine how the test depends on compilers used for creation of executable modules. We will also define the effect of the automatic optimization (by means of the compiler) for SIMD instructions of modern processors.

The SPEC CPU2000 tests come in source codes written in higher-level programming languages C, C++, Fortran. The work with the test includes the following stages:

writing of configuration files
test compilation
start-up

The first stage is the most important. The configuration files contain rules for test compilation. Here are the basic ones:

compiler used
compiler's optimization keys
options for making tests compatible

Now we are examining metrics in the base which have stronger (than peak) restrictions for compilation, such as:

only one compiler for each programming language
not more than 4 optimization options

Remember that we use different methods to prevent falsification of the results - checksums of configuration and executable files, signature of the results and test correctness control (comparison of the output data with the reference ones). It's necessary to meet the requirements to obtain the official data. However, it it possible to break everything in course of debugging, testing, adjusting and searching for optimal solutions :)

Today we will compare quality of different compilers. The word "quality" implies presence of the environment, convenient handling, integrated help files, and compilation speed. But in this review the speed of operation of the obtained code in the SPEC CPU2000 tests will be the main criterion.

The compilers will be compared on several platforms based on modern processors. The Intel Pentium 4 comes on the i850 based mainboard. The AMD Athlon and Athlon XP are on the VIA KT333 based board. We will also show the results of the Intel Pentium III processor affixed to the i815 based board. Note that platforms do not affect much examination of peculiarities of the compilers.

The most frequently used compilers (for x86) in configurations at http://www.spec.org/ are, of course, products from Intel. First, because they use all possible optimizations for Pentium III and Pentium 4 processors (including MMX/SSE/SSE2); and secondly, they are available as trial versions at the manufacturer's site. And it is them AMD uses to get results of its latest processors.

We will take only the Win32 versions of the compilers, though many companies also offer versions for other operating systems, including Linux..

The SPEC CPU2000 contains a configuration file for Microsoft Visual C++ 6.0 SP5 and Compaq Visual Fortran 6.0 compilers, and it can be used as a simple example in operation of the test. Together with those published at http://www.spec.org/, it was used for developing our own files.

The configuration files for all compilers are based on a single principle: first come general settings, then compilation flags and then compatibility options (though this order is not fixed). And for each compiler we created versions with optimization for different processors/SIMD instructions, when it was possible.

The optimization options can be divided into two classes: general high-level code optimization and usage of special SIMD instructions for a certain processor. The first one works on almost all processors and is desirable always, while the second one is a particular case, that is why in the tests for each processor we used all possible instruction suites.

Let's take as an example a configuration file for the Compaq Visual Fortran 6.5 with optimization for Pentium III:

tune= base
output_format= asc
ext=ixbt.cf65p3
check_md5=1

default=default=default=default:
FC=f90
ONESTEP=YES

default=base=default=default:
FOPTIMIZE=-fast -optimize:5 -tune:pn3 -architecture:pn3

178.galgel=default=default=default:
EXTRA_FFLAGS=-fixed
LDOPT=-Fe$@ -link -stack:300000000

The versions for the Pentium 4 and the AMD K7 will have different "ext" and "FOPTIMIZE" lines.

For the Microsoft Visual C++ we used -Ox -G6 optimization keys:

tune= base
output_format=asc
ext=ixbt.msvc6
check_md5=1

default=default=default=default:
CC= cl
CXX=cl

default=base=default=default:
COPTIMIZE=-Ox -G6
CXXOPTIMIZE=-Ox -G6

176.gcc=default=default=default:
EXTRA_CFLAGS=-Dalloca=_alloca -Op
EXTRA_LDFLAGS= -F10000000

178.galgel=default=default=default:
EXTRA_FFLAGS=-fixed
LDOPT=-Fe$@ -link -stack:300000000

186.crafty=default=default=default:
EXTRA_CFLAGS= -DNT_i386

252.eon=default=default=default:
SOURCE_PREFIX_CXX=-Tp

253.perlbmk=default=default=default:
EXTRA_CFLAGS= -DSPEC_CPU2000_NTOS -DPERLDLL /MT

254.gap=default=default=default:
EXTRA_CFLAGS=-DSYS_HAS_MALLOC_PROTO -DSYS_HAS_CALLOC_PROTO

The file for Intel is more complicated as it uses compilation in two passes for intermodule optimization (version for SSE2):

tune= base
ext=ixbt.060202.sse2

check_md5=1
reportable=1

default=default=default=default:
CC= icl
CXX=icl
F77=ifl
FC=ifl
OBJ=.obj

int=base=default=default:
PASS1_CFLAGS=-Qprof_gen
PASS2_CFLAGS=-QxW -Qipo -Qprof_use
PASS1_LDFLAGS=-Qprof_gen
PASS2_LDFLAGS=-QxW -Qipo -Qprof_use

fp=base=default=default:
PASS1_CFLAGS=-Qprof_gen
PASS2_CFLAGS=-Qipo -QxW -O3 -Qprof_use
PASS1_FFLAGS=-Qprof_gen
PASS2_FFLAGS=-Qipo -QxW -O3 -Qprof_use
PASS1_LDFLAGS=-Qprof_gen
PASS2_LDFLAGS=-Qipo -QxW -O3 -Qprof_use

176.gcc=default=default=default:
CPORTABILITY=-Dalloca=_alloca /F10000000
EXTRA_LDFLAGS= /F10000000

178.galgel=default=default=default:
EXTRA_FFLAGS=-FI
EXTRA_LDFLAGS= /F32000000

186.crafty=default=default=default:
CPORTABILITY= -DNT_i386

253.perlbmk=default=default=default:
CPORTABILITY= -DSPEC_CPU2000_NTOS -DPERLDLL /MT
EXTRA_LDFLAGS=/MT

254.gap=default=default=default:
CPORTABILITY=-DSYS_HAS_CALLOC_PROTO -DSYS_HAS_MALLOC_PROTO

252.eon=base=default=default:
OPTIMIZE=-QxW -Qipo -GX -GR
feedback=no

Fortran compilers

Despite its age, Fortran remains one of the most popular solutions for computational problems which involve real-number operations. Fifteen years ago there were a lot of interesting compilers for it. And Microsoft products were not the fastest solutions. At present, the most popular are Compaq and Intel compilers. Though the rumor has it that they will soon join together.

The first can be optimized both for Intel's and for AMD's architectures, at the expense of -tune and -architecture keys. We will compare Generic, P2, P3, P4 and K7 (for AMD K6 and higher). The Intel compilers can also use SIMD in the generated code (without SIMD, MMX, SSE, SSE2). It should be noted that the AMD's 3DNow! support is not indicated explicitly.

We used 6.6 and 5.01 versions of the compilers, respectively. At present, the latest versions are 6.6A and 6.0, but they do not differ much in speed. We will use them next time; maybe the PGI Workstation 4.0 (which we expect good results from in operation with AMD processors) will be already available.

Let's start with the Intel Pentium 4 platform. The latest set (GEOMEAN) is a geometrical mean for all tests in Fortran.

First, it is not clear why the compilers have so different results regarding the architectures. While the Intel solution has improved results with the SSE/SSE2, the Compaq is indifferent to it.

Secondly, there are subtests when a speed of operation doesn't depend on usage of the SIMD (for example, 189.lucas, 301.apsi). Besides, in most cases the Intel's code is faster than the Compaq's one. But in 178.galgel the CVF is 50% faster without SSE/SSE2. It also concerns 171.swim, where only the SSE/SSE2 allow Intel Fortran to catch up with the Compaq Visual Fortran. It is probable that these SIMDs involve special instructions for memory operation, because this test depends much on memory.

However, the integer scores of the Intel's compiler are better by 17% and 27% in code generation with SSE and SSE2. Without the SIMD the results are almost equal.

The final average scores almost coincide again. In the 171.swim test the Compaq's compiler shows unexpectedly a 8% boost with the "k7" argument. I think it is because of the memory optimization (not due to the arithmetic instructions of the 3DNow!). The Intel's code goes ahead considerably with the SSE (though it is relatively less than on the Pentium 4). That is why the SSE is quite useful for the Athlon XP though the code was created by Intel :)

In the 178.galgel the Compaq's compilers perform better than the Intel's ones. It is better noticeable on the Athlon XP, where usage of the SSE (in the Intel's code) has a less effect on the Athlon XP's speed, while on the Pentium 4 it helped the IFC. Also note that the CVF code works faster by 11% when the Athlon optimization is used. That is why AMD uses CVF 6.6 for the 178.galgel when calculating peak metrics for publication at http://www.spec.org/.

The situation in the 200.sixtrack and 301.apsi tests has changed as compared with the first diagram. While the compilers have equal results on the Pentium 4, here the Intel outscores the Compaq.

Although the Pentium III is quite old, it is still used for scientific calculations. With this processor the compilers give clearer results. Earlier usage of the partially supported SIMD made the results worse. Especially it refers to the MMX optimization - sometimes it is even harmful.

The average data of both compilers do not differ much, except the advantage of the SSE of the optimized Intel's code.

Here Compaq comes with better scores with the optimization keys for the Pentium III, though it wasn't so before. However, the tests where they behave such a way do not always coincide with the similar ones of Intel.

The last diagram shows that it is better to use the Compaq compiler for the AMD Athlon for scientific calculations, especially with the "k7" optimization. The boost can reach 55%.

Summary on the Fortran compilers

The test show that much depends on compilers. To get higher scores it is possible to increase power of central processors and memory and optimize algorithms. But only a choice of a compiler can improve a speed twice. As far as the SIMD is concerned, the maximum gain was obtained with the code using SSE2 for the Intel Pentium 4 in 171.swim and 187.facerec. The former shows such a boost because of the memory optimization, while the latter due to vectorization of calculations. In case of manual programming of algorithms for SIMD the effect can be greater, but even the growth obtained today demonstrates high quality of the Intel's compiler.

Usage of the SSE with the Athlon XP is also successful, though the "k7" option of the CVF compiler makes a positive effect. Remember that the SSE means operation with single precision real numbers, while most scientific calculations require double precision.

I can recommend the Compaq compiler for calculations on the AMD platforms.

C and C++ compilers

In this part we will also have two participants - Microsoft Visual C++ 6.0 SP5 and Intel C/C++ Compiler 5.01. We will update the results when new interesting compilers appear. By the way, taking into account that the Visual C++ is the most popular C/C++ compiler for today, it becomes interesting whether it is worth trusting Microsoft's quality...

Well, the speed of the code generated by the Microsoft compiler is not impressive at all - the Intel's advantage is 13 to 350(!)%. The minimal difference is achieved in the 179.art test which is very memory intensive. Intel wins even without SIMD.

Now look at the 252.eon test. This is the only test that uses C++, and its speed depends only on a processor. That is why it is so sad to have such a difference. Hmm, it is interesting what compiler was used for the Microsoft Office.. :) Maybe, that is why the company doesn't want to open the Windows code - it is afraid of faster clones...

Taking into account that the most part of time of modern office PCs is spent on processing of integer numbers, I fear for the industry as the whole. I understand that archivers, video and audio processing programs and games are written in Assembler most of all, I still feel the unpleasant impression.

The SIMD used by the Intel's compiler has the greatest effect only in 177.mesa, while tasks of the CINT2000 are less dependent on additional instruction sets.

The Intel's integer advantage is 94% for integer calculations and 76% for real number operations.

For the AMD Athlon XP the situation is similar - the Intel's solution is ahead. Note that the effect made by the SIMD is almost the same for the Pentium 4. That is why I can say that the code of the C/C++ compiler from Intel is not harmful for Athlon XP.

In the tests on the Intel Pentium III and AMD Athlon processors the Intel's compiler performs also better. Note that the MMX has a great effect in the 252.eon on the Athlon processor.

Summary on the C/C++ compilers

The tests of the C/C++ compilers show that the Intel's solutions allows for a much faster code than the Microsoft Visual C++. But remember that the latter wasn't updated for a long time already (the Service Pack 5 was released more than a year ago) and can't be optimized for SIMDs of modern processors. On the other hand, it has a better development environment (by the way, Intel's compilers can also be integrated into it) and a high code generation speed.

Usage of the SSE2 with the Pentium 4 noticeably improves scores of the Intel C/C++ Compiler in some tests (in 252.eon by 26%, in 177.mesa by 77%). It is quite good considering that the source texts were not adapted for convenient vertorization.

The MMX/SSE instructions executed by the Intel compiler look excellent as well on the Athlon XP processor.

Memory allocation optimization

If you look attentively at the configuration files at http://www.spec.org/ you will see that many of them use shlW32M.lib. This library is a product of http://www.microquill.com/ and is used for optimization of memory control. It costs over $700. We managed to get its trial version and carried out several tests. As the library can be used only 60 days, the tests were conducted only on one PC configuration in order to find a real effect of the shlW32M.lib. We used our configuration for the Intel's compilers as a base one. Note that usage of the SP5 for the MSVC changes the situation - if it is not enabled, the SmartHeap has a greater effect.

So, you will see results of the system based on the Intel Pentium 4 1.7 i850 RDRAM with and without this library. The diagrams show the difference of the results when the Smartheap is used (percentage-wise):

As you can see, in some tests there is a boost, in others - decrease. For the peak metric the shlW32M.lib can be used only for those tests where it allows for better scores in order to improve the CINT2000 data by a dozen of scores...

Taking into account its price, we won't use this product any more. This is one more explanation why our results differ from those published by Intel and AMD at http://www.spec.org/ .

Conclusion

We have studied several popular C/C++/Fortran compilers in SPEC CPU2000. This is the last thing the results of this packet depend on. We, like other SPEC members, will keep on using Intel's compilers for publication of data, but we'd got to find out why it was worth doing :)

Well, there are several concluding points, and some refer not only to SPEC CPU2000:

the results of the SPEC CPU2000 depend on compilers (and their settings) most of all
a good compiler (i.e. from Intel :) allows for a considerable boost with the SIMD even on a nonoptimized source code
AMD 3DNow! is not widely supported by compiler developers
Microsoft Visual C++ is a good and handy compiler, but there are others out there as well

Write a comment below. No registration needed!