In mid June Intel released Version 9 of its C++ and Fortran compilers. The new version of compilers is not principally different from previous Version 8.1. Its main features are compiler integration for IA-32, IA-64, and EM64T (x86-64) platforms into a unified package and additional options for processors with Hyper-Threading and multi-core processors as far as code optimizations are concerned. In particular, Software-based Speculative Pre-Computation (SSP).
In this article we shall analyze how fast the new version of compilers is compared to the previous version on top (or almost top) single core processors — both from Intel (Pentium 4 and Pentium M) as well as... from AMD (Athlon 64 FX-57 — of course, with some code adjustments, see below).
We used the following compilers:
As a reference, we used the test code compiled by Intel C++ Compiler 8.1.022 and Intel Fortran Compiler 8.1.025.
As usual, we used identical general compilation keys in all cases (Compilers 8.1 and 9.0, different code optimizations):
PASS1_CFLAGS= -Qipo -O3 -Qprof_gen
Pentium 4 670
Let's start with the results of the "native" processor — Pentium 4 670 (3.8 GHz) with Prescott core, which supports all necessary instruction sets and allows to execute code compiled with all possible specific optimization keys: -QxK, -QxW, -QxN, -QxB, and -QxP.
But nevertheless, we shall start with a non-optimized variant. Let's note an important moment: this code version, compiled both by the previous and the new compiler versions, caused errors in 175.vpr and 176.gcc sub-tests — irregardless of the processor type. That's why we used the --noreportable key to start the tests to ignore errors in some sub-tests (--ignore_errors). Integer tests. The new version demonstrates advantage in some sub-tests (252.eon, 253.perlbmk, 254.gap, 255.vortex), which is impossible to compensate by a significant performance drop in 197.parser (about 34%!) as well as 300.twolf. As a result, the total score in SPECint_base2000 = 1604, it's lower by 4.6% than the score obtained in Version 8.1 (1682). The new version demonstrates only a minor performance advantage in some tests with real numbers, but there are noticeable performance drops in some sub-tests (13.3% in 179.art). As a result, the total score in SPECfp_base2000 (1489) is lower by 1.5% than the result obtained with the previous version (1511).
The next optimization variant that uses SSE instructions (-QxK). The situation in integer tests is similar — insignificant advantage of the new version in some sub-tests and the 1.5-fold performance drop in 197.parser. Nevertheless, 300.twolf in this case is notable for better performance (2.2%). The integral score is lower approximately by 2.5% compared to Version 8.1. The situation in floating point tests is different — performance of most tasks grows when we switch to Version 9.0, the maximum gain can be seen in 171.swim (7.7%) and 187.facerec (5.6%) sub-tests. The integral score in SPECfp_base2000 is higher by 1.3% than in the previous version.
What concerns the rest of the code optimization variants (-QxW, -QxN, and -QxP), the situation in integer tests is similar to the -QxK variant: we can still see the 1.5-fold performance drop in 197.parser, resulting in a lower integral score in SPECint_base2000. There are some differences between these optimization variants in floating point tests — in the integral score as well as in some sub-tests. For example, SSE2/Willamette (-QxW) demonstrates a noticeable performance gain in 200.sixtrack (11.2%) and 187.facerec (5.0%) with the significant performance drop in 179.art (-10.5%). The new version wins just 0.9% in SPECfp_base2000. On the contrary, SSE2/Northwood (-QxN) is outperformed by the previous version in total score (by 0.5%), due to a significant performance drop in 168.wupwise (-21.1%) and 179.art (-11.1%), accompanied by some performance gain in a number of sub-tests (178.galgel, 187.facerec, 191.fma3d, and 200.sixtrack). And finally, the native variant for Prescott SSE3 (-QxP) wins 2.6% in total score due to the performance gain in 178.galgel (7.9%), 183.equake (12.2%), 187.facerec (5.0%), and 200.sixtrack (9.5%), accompanied by a nearly imperceptible drop in execution speed of few other sub-tests (maximum — 3.8% in 300.aspi).
Absolute performance in integer as well as real tasks on the whole (according to the integral readings) grows in the row -QxK < -QxB < -QxW < -QxN < -QxP, which is reasonable for Prescott core.
Pentium M 770
We proceed to the second "nearly flagship" from Intel — Pentium M 770 processor with Dothan core 2.13 GHz. Tests with this processor were carried out on a desktop-mobile system — DFI 855GME-MGF motherboard with not the fastest Intel 855GM chipset, to be more exact — not the fastest memory system (single channel DDR-333).
Integer tests without code optimizations: the new version demonstrates the highest gain in 254.gap (~10%), the lowest drop — in 197.parser again (it's a tad smaller in comparison with Pentium 4 — about 27%). At an average, the total score in SPECint_base2000 is lower than in the previous version by 3%. Floating point tests demonstrate a little spread in values — both upward and downward. But according to the integral score, the execution speed of the code, compiled in ICC/IFC 8.1 and 9.0, is practically identical. Surprisingly, the absolute results in some sub-tests and the total score in SPECfp_base2000 are too low in comparison with the Pentium 4 results, but integer test results are only a tad lower. It probably has to do with these tests being critical to memory bandwidth, which is much lower in case of a system based on Pentium M with single channel DDR-333 (2.67 GB/s versus 6.4 GB/s). It certainly has nothing to do with FPU, which is not only no worse in Pentium M than in Pentium 4, but rather much better.
Optimization keys (this processor allows -QxK, -QxW, -QxN, and -QxB) don't change the situation significantly, except for the increased overall performance (which grows exactly in the above mentioned row, that is the native code optimization for Banias core turns out the best for Dothan core as well.) Integer tests still demonstrate a tad lower results (approximately by 2.5%) than in the previous version (due to a noticeably reduced performance in 197.parser and the lack of noticeable gain in other sub-tests), while the tests with real numbers are practically equal to it in performance. But the latter effect is again achieved due to a compensating spread in results, both upwards and downwards, (especially prominent in case of -QxK and -QxN — up to 10% in some sub-tests), rather than by their complete identity.
Athlon 64 FX-57
The most interesting thing is reserved for the end of the article. Test results of Intel C++/Fortran Compiler 8.1/9.0 on the latest single core processor from the competitor — AMD Athlon 64 FX-57. You may wonder how we have done it. It very simple. All it has taken us is to study the algorithm of the processor type check in an application, compiled by Intel compilers. Here is how it looks like:
1. Vendor String validation for "GenuineIntel";
2. Detecting a processor model type (Pentium III/Pentium M — Model 6, or Pentium 4/Xeon — Model 15);
3. Determining the availability of necessary extended instruction sets (SSE, SSE2, SSE3).
Judging from this algorithm it's clear that all you should do is to remove Check #1 to make AMD processors execute the code, compiled in Intel C++/Fortran Compiler — given that the processor supports necessary instruction sets. It has to do with Intel and AMD processors having matching model numbers: Model 6 corresponds to AMD K7 processors (most of them support SSE), while Model 15 — AMD K8 processors (supporting SSE, SSE2, and their latest E core revision also supports SSE3). However, even if there had been no match, we could have just as well removed Check #2. In that case operability of applications would have depended solely on the lack/presence of necessary extensions in a processor.
Binary files can be corrected manually, but we have written a small utility — ICC Patcher (you can download it here). It scans a binary file for suspicious GenuineIntel validations and replaces them with NOPs. This utility can patch not only compiled executables, but also source libraries in Intel C++/Fortran Compiler, including those for EM64T. In this case, compiled applications would always run on processors both from Intel and AMD. I repeat that this patching is not "rude". For example, the code, compiled with the -QxP key, would run only on AMD Athlon64/Opteron processors, Core Revision E, and will pop up a warning that it cannot be executed on earlier core revisions and AMD K7 processors.
Let's proceed to test results. In order to save time, we decided not to recompile all test sources with "correct" Intel libraries, but to patch the existing binaries. Thus, we set the check_md5=0 option in config files of the tests, because patching executables changes their control sum.
Non-optimized code: 197.parser is noticeably slower in integer tests on this processor as well (27.3% — the same result was obtained for Pentium M). The same concerns 300.twolf (13.3%), which is compensated to some extent by the breakaway in 252.eon (9.3%) and 254.gap (10.2%) tasks. The total score in SPECint_base2000 is lower than in the previous compiler version approximately by 3%, which again reminds of Pentium M test results. Floating point test results are again close to those demonstrated by the previous version, again due to the self-compensating spread in results rather than by the same performance in sub-tests. As a result, the total score in SPECfp_base2000 is just 0.6% low compared to the code compiled in ICC/IFC 8.1.
Optimized variants of integer tests make no noticeable difference in the picture we got on other processors. Namely, the noticeable lag of 197.parser (27-28%) remains, while there is no breakaway in some sub-tests at all (as an exception, we can note the 253.perlbmk task, compiled with -QxP, which demonstrates 5.3% performance gain). The 197.parser lag conditions the 3% drop in the total score in SPECint_base2000 in all cases. What concerns the absolute performance values, they grow in the row -QxK < -QxW < -QxN < -QxP < -QxB. That is the best (not much though, only in some tests and the total score) optimization is for Banias core. Thus, such a result is not at all outstanding, considering that AMD K8 architecture is similar to Intel Pentium III/Pentium M, not to Pentium 4 (NetBurst).
Let's proceed to optimized SPECfp code. Like Intel processors, Athlon 64 FX-57 always demonstrates performance gain when the new compiler version is used. The relative gain value varies (it depends on an optimization type) as well as methods to obtain it. For example, SSE variant (-QxK) demonstrates a noticeable 8.7% drop in 171.swim (note that the Pentium 4 processor gained in this task), while 172.mgrid gains 10.6% and 187.facerec gains 6.8%, the total score in SPECfp_base2000 being 1.1%. In the old SSE2 variant for Willamette core (-QxW, which can run on AMD K8 even without patching), the obvious leadership is retained only in 187.facerec (6.7%), the overall advantage is just 0.6%. The new SSE2 variant for Northwood core differs by a small increase in SPECfp_base2000 (0.7%). But the spread in values is noticeable in some sub-tests (-15.1%(!) in 168.wupwise, +6.6% in 172.mgrid, and +19.8% in 178.galgel). And finally, the best optimization for SSE3 (Prescott core, -QxP) is characterized by almost complete lack of a performance drop (we should just mention the 3% drop in 301.aspi) and a considerable performance increase in a number of tasks (172.mgrid - 10.4%, 178.galgel - 17.8%, 179.art - 8.3%, 183.equake - 6.5%). As a result, the total score in SPECfp_base2000 is higher than in the previous version by 3%. What concerns code efficiency, we have already noted that it's the highest in case of SSE3. Then goes SSE2 for Banias core (-QxB), which again does not contradict to our idea of the AMD K8 architecture, followed by -QxN, -QxW, and -QxK.
The new Intel C++/Fortran Compiler 9.0 demonstrates an ambiguous picture in its "typical" code compilation (we mean compiling with profiles). In general, the resulting integer code is a tad slower (by 3-5%) than the code compiled in previous Version 8.1. Significant performance drop is demonstrated only in one task, but it's quite weighty — from 27 to 34% depending on a processor. You will be lucky, if your code does not resemble this task :).
Nevertheless, the new version of compilers demonstrates a number of advantages over the previous version in terms of calculations with real numbers (where SSE, SSE2, SSE3 instructions are used) — quite insignificant though (from 0 to 3%). The usage of optimization keys for a given micro architecture of a processor remains adequate (-QxP for Pentium 4/Prescott, -QxB for Pentium M/Dothan, we can recommend experimenting with QxB and -QxP for AMD K8 processors).
By the way, let's say several words on AMD processors. According to our research, both ICC/IFC versions (8.1 and 9.0) compile code that demonstrates very good (even the best in some cases) performance on AMD processors... in case we "patch" it :), or we "patch" compiler libraries. It would have been peachy, if Intel the manufacturer replaced the current check of a processor type for a wiser one — similar to what we have used.
This modification would be beneficial to end users in the first place. In this case, even if a software developer uses "automatic" optimizations like -Qax*, the most optimized code will be chosen for execution, depending only on availability of necessary extended instruction sets, not on a CPU manufacturer. Note that one of the points charged by AMD to Intel is that AMD processors may be much slower than their competing processors, when executing an "automatic" code, even though they have necessary extensions.
It would be no less beneficial to software developers and testers — there would be no need to use different compilers for different processors or to develop applications for processors of a given manufacturer.
And of course, AMD itself would profit much — there would be no need to develop its own compiler, which has been on the hook for a long time already :).
Dmitri Besedin (firstname.lastname@example.org)
September 5, 2005.
Write a comment below. No registration needed!