SPEC CPU2000. Part 24. Efficiency of Profile-Guided Optimizations in Intel C++/Fortran Compiler 9.0, Intel Pentium 4 660 and Athlon 64 4000+ Processors

We have been benchmarking performance of SPEC CPU2000 tasks on various platforms, using the so-called profile-guided optimizations during compilation of test tasks for quite a long time (probably since Intel C++/Fortran Compilers 7.0). We have been using it by default, that is we assume that it will inevitably lead to higher-performance machine code. Nevertheless, it will do no harm to make sure at least once that it's really true. Take also into account that Intel compilers are initially intended for maximum code performance only on processors from the cognominal manufacturer (it would have been strange otherwise), while we use these tasks for testing competing processors as well :). In this respect, our today's contenders will include two "more or less top" single core processors from the leading competing manufacturers — Intel Pentium 4 660 and AMD Athlon 64 4000+. And now I'll give a brief account of the optimization method, which is called Profile-guided optimization.

Profile-guided optimizations in Intel C++/Fortran Compiler 9.0

Profile-guided optimizations (PGO) hint the compiler at the most heavily traveled paths of an application. Taking this information into account, a compiler can optimize these code fragments more selectively and specifically. PGO offers the following advantages in Intel C++/Fortran 9.0 compilers:

Optimization of instruction cache usage
Optimization of branch prediction by experimental detection of the most typical execution paths (which cannot be determined during compilation)
Prevention of unrolling of small loops, which execution requires just a few iterations
Optimization of automatic function inlining.

Profiled applications are generated in three phases. Phase I — Instrumented Compilation. The compiler creates an instrumented program from the source code. Phase II — Instrumented Execution. The developer runs the instrumented program. Each time it runs (with input data set), it creates an execution profile, which is used in Phase III — Feedback Compilation. During this phase the compiler uses the profile summary file to optimize execution of the most heavily traveled paths in the finished application.

So, the key factor of PGO is detecting the most frequently executed fragments of code (conditionally as well as unconditionally). It's quite obvious that PGO success thus depends much on similarity of tasks (sets of input data), executed in the application to be optimized. Indeed, similar tasks will most likely frequently use the same ways of code execution, while different tasks may theoretically use quite different procedures (depending on complexity of the compiled application and its usage). What concerns SPEC tasks, we can see a practically ideal scenario here: input data used for PGO are reduced input data sets for subsequent measurement of test task performance. Nevertheless, don't forget that the situation may be different in the general case.

Test Results

We used the following compiler versions in our tests:

Intel(R) C++ Compiler for 32-bit applications, Version 9.0 Build 20050912Z Package ID: W_CC_C_9.0.024
Intel(R) Fortran Compiler for 32-bit applications, Version 9.0 Build 20050912Z Package ID: W_FC_C_9.0.024

As usual, in case of PGO we used the following compiler switches:

PASS1_CFLAGS= -Qipo -O3 -Qprof_gen
PASS2_CFLAGS= -Qipo -O3 -Qprof_use

Switches for compilation without PGO were simpler:

COPTIMIZE= -Qipo -O3

Pentium 4 660

Let's start with the results obtained with Pentium 4 660, the native processor for Intel compilers. We can reasonably expect maximum PGO gains in this very case.

Practically in all cases, integer SPEC CPU2000 tasks gain from PGO (up to 30%). The only exceptions are 164.gzip and 256.bzip2 (they almost always demonstrate insignificant performance drops), 175.vpr (just as small performance gain), as well as 181.mcf, which PGO results depend on a code optimization profile. Non optimized code and code optimized for SSE (-QxK) and SSE2 (-QxW) (we cannot speak of pure SSE/SSE2 in this case — only integer instructions are used here) demonstrate PGO gains not exceeding 0.7%. But enabling Pentium 4 specific optimizations (-QxN and -QxP) improves the result by up to 25%. It will be reasonable to assume that such results have to do with extremely good usage of branch hit prefixes (2Eh and 3Eh) in this task. 252.eon also demonstrates some additional gain (up to 10%) in case of -QxN and -QxP optimizations.

Average test results — SPECint_base2000 relative to non-optimized code performance illustrate PGO advantages for integer SPEC CPU2000 tasks with Pentium 4 processors — performance gain amounts to 7-10%, depending on a code optimization (of course, in favour of specific -QxN and -QxP optimizations).

We know well from our previous reviews of various platforms in SPEC CPU2000 that unlike integer tasks, SPEC tasks with real numbers usually demonstrate a more ambiguous picture. The same concerns the comparison of non-profiled and profiled code, even on the native Pentium 4 processor.

Nevertheless, even in this case we can get a more or less simple situation. So, the maximum PGO gain is demonstrated by 177.mesa (up to 22%), 168.wupwise (up to 16%, except for the negative results in -QxK code), and 187.facerec (up to 11%). 191.fma3d and 301.apsi (up to 3.5%) also demonstrate some performance gains. In other cases we can see practically zero or ... paradoxical results — for example, 179.art suddenly demonstrates 20% performance drop in one of the closest optimizations for Pentium 4 (-QxN) and 8% gain in the other optimization (-QxP).

Thanks to practically no dips in separate tests, SPECfp_base2000 total score is also advantageous to some degree — from 1.3% (the worse result, -QxK) to 3.9% (-QxP, corresponds directly to Prescott core).

Expectedly, Intel C++/Fortran Compiler 9.0 PGO does demonstrate performance gains in code execution on a processor from the same manufacturer — Intel Pentium 4 660. Let's see what happens with the competing processor — AMD Athlon 64 4000+. Diagrams below do not include the -QxP code, as SSE3 instructions are not supported by this processor. Nevertheless, a similar -QxN version will suffice to get the whole picture.

Strange as it may seem, integer SPEC CPU2000 tests demonstrate on the qualitative level a similar situation as with the Pentium 4 processor. On the quantitative level, PGO gains even exceed the results obtained with Pentium 4. For example, the maximum gain amounts to 49%. Task ranks in the diagrams are the same: the least gain is demonstrated by 164.gzip, 175.vpr, and 256.bzip2. 181.mcf again demonstrates performance gains only in specific -QxN code, which uses conditional branch hints. The same code shows the best results for the majority of other tasks as well as SPECint_base2000 total score — 15% gain versus 9-10% gains in other cases.

Athlon 64 test results with real figures are more ambiguous. In particular, we can note that results of one of code optimizations often fall out of the general line (especially, "no optimization" variant). We cannot come out with a rational explanation of this phenomenon. If we speak of the best and the worst results, they generally coincide with those obtained with the Pentium 4 processor. Namely, on the average, the highest performance gain is enjoyed by 177.mesa, while the maximum performance drop is demonstrated by 179.art.

We cannot give a simple assessment of the average SPECfp_base2000 result — it depends much on a code optimization and falls within -3.3% - 1.4%.

Conclusion

The results of our tests reveal that two-pass compilation of SPEC CPU2000 code in Intel C++/Fortran Compiler 9.0 with PGO is generally more advantageous in terms of performance than usual single-pass compilation. PGO gains are demonstrated when the code is executed by the "native" for Intel compilers Intel Pentium 4 as well as by the AMD Athlon 64. The largest performance gain is demonstrated in integer SPEC CPU2000 tasks, while tasks with real figures do not demonstrate single-valued results.

Thus, using profiled code of SPEC CPU2000 tests for benchmarking performance of various platforms with processors from different manufacturers can be justified, the results of such tests are reliable.

Dmitri Besedin (dmitri_b@ixbt.com)
February 20, 2006

Write a comment below. No registration needed!