iXBT Labs - Computer Hardware in Detail






SPEC CPU2000. Part 24. Efficiency of Profile-Guided Optimizations in Intel C++/Fortran Compiler 9.0, Intel Pentium 4 660 and Athlon 64 4000+ Processors

We have been benchmarking performance of SPEC CPU2000 tasks on various platforms, using the so-called profile-guided optimizations during compilation of test tasks for quite a long time (probably since Intel C++/Fortran Compilers 7.0). We have been using it by default, that is we assume that it will inevitably lead to higher-performance machine code. Nevertheless, it will do no harm to make sure at least once that it's really true. Take also into account that Intel compilers are initially intended for maximum code performance only on processors from the cognominal manufacturer (it would have been strange otherwise), while we use these tasks for testing competing processors as well :). In this respect, our today's contenders will include two "more or less top" single core processors from the leading competing manufacturers — Intel Pentium 4 660 and AMD Athlon 64 4000+. And now I'll give a brief account of the optimization method, which is called Profile-guided optimization.

Profile-guided optimizations in Intel C++/Fortran Compiler 9.0

Profile-guided optimizations (PGO) hint the compiler at the most heavily traveled paths of an application. Taking this information into account, a compiler can optimize these code fragments more selectively and specifically. PGO offers the following advantages in Intel C++/Fortran 9.0 compilers:

  • Optimization of instruction cache usage
  • Optimization of branch prediction by experimental detection of the most typical execution paths (which cannot be determined during compilation)
  • Prevention of unrolling of small loops, which execution requires just a few iterations
  • Optimization of automatic function inlining.

Profiled applications are generated in three phases. Phase I — Instrumented Compilation. The compiler creates an instrumented program from the source code. Phase II — Instrumented Execution. The developer runs the instrumented program. Each time it runs (with input data set), it creates an execution profile, which is used in Phase III — Feedback Compilation. During this phase the compiler uses the profile summary file to optimize execution of the most heavily traveled paths in the finished application.

So, the key factor of PGO is detecting the most frequently executed fragments of code (conditionally as well as unconditionally). It's quite obvious that PGO success thus depends much on similarity of tasks (sets of input data), executed in the application to be optimized. Indeed, similar tasks will most likely frequently use the same ways of code execution, while different tasks may theoretically use quite different procedures (depending on complexity of the compiled application and its usage). What concerns SPEC tasks, we can see a practically ideal scenario here: input data used for PGO are reduced input data sets for subsequent measurement of test task performance. Nevertheless, don't forget that the situation may be different in the general case.

Test Results

We used the following compiler versions in our tests:

  • Intel(R) C++ Compiler for 32-bit applications, Version 9.0 Build 20050912Z Package ID: W_CC_C_9.0.024
  • Intel(R) Fortran Compiler for 32-bit applications, Version 9.0 Build 20050912Z Package ID: W_FC_C_9.0.024

As usual, in case of PGO we used the following compiler switches:

PASS1_CFLAGS= -Qipo -O3 -Qprof_gen
PASS2_CFLAGS= -Qipo -O3 -Qprof_use

Switches for compilation without PGO were simpler:


Pentium 4 660

Let's start with the results obtained with Pentium 4 660, the native processor for Intel compilers. We can reasonably expect maximum PGO gains in this very case.

Practically in all cases, integer SPEC CPU2000 tasks gain from PGO (up to 30%). The only exceptions are 164.gzip and 256.bzip2 (they almost always demonstrate insignificant performance drops), 175.vpr (just as small performance gain), as well as 181.mcf, which PGO results depend on a code optimization profile. Non optimized code and code optimized for SSE (-QxK) and SSE2 (-QxW) (we cannot speak of pure SSE/SSE2 in this case — only integer instructions are used here) demonstrate PGO gains not exceeding 0.7%. But enabling Pentium 4 specific optimizations (-QxN and -QxP) improves the result by up to 25%. It will be reasonable to assume that such results have to do with extremely good usage of branch hit prefixes (2Eh and 3Eh) in this task. 252.eon also demonstrates some additional gain (up to 10%) in case of -QxN and -QxP optimizations.

Average test results — SPECint_base2000 relative to non-optimized code performance illustrate PGO advantages for integer SPEC CPU2000 tasks with Pentium 4 processors — performance gain amounts to 7-10%, depending on a code optimization (of course, in favour of specific -QxN and -QxP optimizations).

We know well from our previous reviews of various platforms in SPEC CPU2000 that unlike integer tasks, SPEC tasks with real numbers usually demonstrate a more ambiguous picture. The same concerns the comparison of non-profiled and profiled code, even on the native Pentium 4 processor.

Nevertheless, even in this case we can get a more or less simple situation. So, the maximum PGO gain is demonstrated by 177.mesa (up to 22%), 168.wupwise (up to 16%, except for the negative results in -QxK code), and 187.facerec (up to 11%). 191.fma3d and 301.apsi (up to 3.5%) also demonstrate some performance gains. In other cases we can see practically zero or ... paradoxical results — for example, 179.art suddenly demonstrates 20% performance drop in one of the closest optimizations for Pentium 4 (-QxN) and 8% gain in the other optimization (-QxP).

Thanks to practically no dips in separate tests, SPECfp_base2000 total score is also advantageous to some degree — from 1.3% (the worse result, -QxK) to 3.9% (-QxP, corresponds directly to Prescott core).

Expectedly, Intel C++/Fortran Compiler 9.0 PGO does demonstrate performance gains in code execution on a processor from the same manufacturer — Intel Pentium 4 660. Let's see what happens with the competing processor — AMD Athlon 64 4000+. Diagrams below do not include the -QxP code, as SSE3 instructions are not supported by this processor. Nevertheless, a similar -QxN version will suffice to get the whole picture.

Strange as it may seem, integer SPEC CPU2000 tests demonstrate on the qualitative level a similar situation as with the Pentium 4 processor. On the quantitative level, PGO gains even exceed the results obtained with Pentium 4. For example, the maximum gain amounts to 49%. Task ranks in the diagrams are the same: the least gain is demonstrated by 164.gzip, 175.vpr, and 256.bzip2. 181.mcf again demonstrates performance gains only in specific -QxN code, which uses conditional branch hints. The same code shows the best results for the majority of other tasks as well as SPECint_base2000 total score — 15% gain versus 9-10% gains in other cases.

Athlon 64 test results with real figures are more ambiguous. In particular, we can note that results of one of code optimizations often fall out of the general line (especially, "no optimization" variant). We cannot come out with a rational explanation of this phenomenon. If we speak of the best and the worst results, they generally coincide with those obtained with the Pentium 4 processor. Namely, on the average, the highest performance gain is enjoyed by 177.mesa, while the maximum performance drop is demonstrated by 179.art.

We cannot give a simple assessment of the average SPECfp_base2000 result — it depends much on a code optimization and falls within -3.3% - 1.4%.


The results of our tests reveal that two-pass compilation of SPEC CPU2000 code in Intel C++/Fortran Compiler 9.0 with PGO is generally more advantageous in terms of performance than usual single-pass compilation. PGO gains are demonstrated when the code is executed by the "native" for Intel compilers Intel Pentium 4 as well as by the AMD Athlon 64. The largest performance gain is demonstrated in integer SPEC CPU2000 tasks, while tasks with real figures do not demonstrate single-valued results.

Thus, using profiled code of SPEC CPU2000 tests for benchmarking performance of various platforms with processors from different manufacturers can be justified, the results of such tests are reliable.

Dmitri Besedin (dmitri_b@ixbt.com)
February 20, 2006

Write a comment below. No registration needed!

Article navigation:

blog comments powered by Disqus

  Most Popular Reviews More    RSS  

AMD Phenom II X4 955, Phenom II X4 960T, Phenom II X6 1075T, and Intel Pentium G2120, Core i3-3220, Core i5-3330 Processors

Comparing old, cheap solutions from AMD with new, budget offerings from Intel.
February 1, 2013 · Processor Roundups

Inno3D GeForce GTX 670 iChill, Inno3D GeForce GTX 660 Ti Graphics Cards

A couple of mid-range adapters with original cooling systems.
January 30, 2013 · Video cards: NVIDIA GPUs

Creative Sound Blaster X-Fi Surround 5.1

An external X-Fi solution in tests.
September 9, 2008 · Sound Cards

AMD FX-8350 Processor

The first worthwhile Piledriver CPU.
September 11, 2012 · Processors: AMD

Consumed Power, Energy Consumption: Ivy Bridge vs. Sandy Bridge

Trying out the new method.
September 18, 2012 · Processors: Intel
  Latest Reviews More    RSS  

i3DSpeed, September 2013

Retested all graphics cards with the new drivers.
Oct 18, 2013 · 3Digests

i3DSpeed, August 2013

Added new benchmarks: BioShock Infinite and Metro: Last Light.
Sep 06, 2013 · 3Digests

i3DSpeed, July 2013

Added the test results of NVIDIA GeForce GTX 760 and AMD Radeon HD 7730.
Aug 05, 2013 · 3Digests

Gainward GeForce GTX 650 Ti BOOST 2GB Golden Sample Graphics Card

An excellent hybrid of GeForce GTX 650 Ti and GeForce GTX 660.
Jun 24, 2013 · Video cards: NVIDIA GPUs

i3DSpeed, May 2013

Added the test results of NVIDIA GeForce GTX 770/780.
Jun 03, 2013 · 3Digests
  Latest News More    RSS  

Platform  ·  Video  ·  Multimedia  ·  Mobile  ·  Other  ||  About us & Privacy policy  ·  Twitter  ·  Facebook

Copyright © Byrds Research & Publishing, Ltd., 1997–2011. All rights reserved.