SPEC CPU2000. Part 14. Intel's C, C++, and Fortran Compilers Version 8.0 in SPEC CPU2000 Tests for Intel CPUs

Early in 2003 we started to use Intel's SPEC CPU2000 compilers version 7.0 for our tests, and then gradually passed to 7.1. It was not until the end of last year that Intel came up with a new (8.0) version of its compilers. One of the reasons for our rapid transition to 8.0 was the announcement of CPUs on a new Prescott core early in 2004. Apart from having significant architectural modifications, the new cores are also supplied with new SSE3 insructions.

This article will give an account of performance changes caused by using a new compiler version with Intel CPUs, including those on Northwood, Banias, and Prescott cores.

The following comiler versions are used in the tests:

Intel(R) C++ Compiler for 32-bit applications, Version 8.0 Build 20040125Z Package ID: W_CC_PC_8.0.041_PE044.1
Intel(R) Fortran Compiler for 32-bit applications, Version 8.0 Build 20040125Z Package ID: w_fc_pc_8.0.040

Besides, we continued our universalism way and went as far as using completely identical switches for all the compilers. They look like this now:

PASS1_CFLAGS=   -QxN -Qipo -O3 -Qprof_gen
PASS2_CFLAGS=   -QxN -Qipo -O3 -Qprof_use

Formally, a simpler variant can also be used:

PASS1_CFLAGS=   -Qprof_gen
PASS2_CFLAGS=   -QxN -Qipo -O3 -Qprof_use

considering that -Qipo doesn't work with -Qprof_gen and using -QxN and -O3 at the first pass may only slow down the code generation and won't increase performance of the compiled file. Especially as the comparison of the two variants showed no performance difference between them. However, the first variant is used by other companies for publishing results on www.spec.org and it's not necessary to look for one's own way here. But still, they should have checked if the combination is suitable :).

In respect of optimisation for various architectures, this version has new variants and it is suggested that the -Qx{W, N, B, P} switch be used for modern CPUs. Evidently, it contains the first letters of the core names. Noteworthy, the first three variants are operable on the whole modern line of Intel Pentium CPUs, which makes it all the more interesting to see if there's any difference in performance.

We will start with the test results of the new compiler and then proceed to the peculiarities of new optimisation variants.

Intel Pentium 4 2.4

On this CPU we'll try to compare performance of the new compiler's code with the previous version and also with different optimisation variants.

The system we have chosen for testing is not the fastest one, but is quite up to the mark. It is a SiS 645DX chipset with a DDR333 memory.

As usual, we will start with integer arithmetics.

In all the subtests except 197.parser the new compiler shows results several percent better than its predecessor (for the variant with the -QxW switch and not mentioning 252.eon where the increase was more than 35 percent probably due to switch change). The integral reading increased by more than 5 percent. But the picture gets more complicated if we compare different 8.0 optimisations. There are only two subtests that have a visible difference in performance: 176.gcc fell by 7 percent using -QxN and 181.mcf rose by 16 percent. The integral estimate only increased by 0.72 percent. So, it can be said that in general, the CPU in question gets virtually nothing from using the new optimisation. Taking into account that 181.mcf has manifested itself earlier as being "hungry for memory speed", maybe a new optimisation will use the peculiarities of cache organisation and memory access.

As for the Banias optimisation variant, it works perfectly on a standard Pentium 4 and proves to be more effective in some subtests than the original -QxN: 176.gcc +6.27%, 181.mcf +4.69%, and 197.parser +6.22%.

The real arithmetic task set gives mixed results: the old compiler wins 3-4 percent on subtests 191.fma3d, 301.apsi and loses 3-7 percent on 171.swim, 177.mesa, 178.galgel, and 200.sixtrack. Thus, according to the integral reading, the new compiler only has a less-than-one-percent advantage.

Comparing optimisations -QxN and -QxW reveals an insignificant difference of -0.17%..+0.22%. Whereas using -QxB makes things obviously worse: there's no test where this variant would be better than the original ones (decrease sometimes reaches 12 percent and about 3 percent by the integral reading).

Intel Pentium 4 3.06

This CPU will serve for checking the data for Pentium 2.4 and estimating the new compiler's effectiveness with the HT technology. The computer used in testing has an i875 chipset and a two-channel DDR333 memory.

In respect of compiler comparisons in speed tests, the general picture is identical to Pentium 2.4 results. The SPECint_base2000 integral reading rose by about 5 percent. As for real arithmetics, there are some minor differences there: growth in 171.swim was 7 percent against 2.7 percent on Pentium 2.4, and 187.facerec slowed down the speed by 9-10 percent against Pentium's 2.4 one-percent loss. But in SPECfp_base2000 the difference is still less than one percent.

The comparison of the -QxN and -QxW optimisation variants is virtually identical to the results in the case of Pentium 4 2.4. The only exceptions are that 181.mcf increased by 14 percent with -QxN, and 176.gcc lost 3 percent.

In order to estimate the HyperThreading effect, we compared reading changes in rate mode with parameters -users 1 and -users 2. The diagrams show the percentage of changes on two virtual CPUs. Last time we tested HT in SPEC CPU2000 was late in 2002 when we used compiler version 5.0.1 and a single-channel memory system, so the results may differ considerably. We'd also like to note that it is not completely correct to use the SPEC CPU2000 test for measuring HT performance, as the test runs two identical tasks simultaneously, while HT is mainly suitable for different-type programs.

The first diagram shows that compiler 8.0 beats 7.1 in some tasks and loses in others, so no definite conclusion can be made as to quality changes in the new compiler.

The situation is similar with the second test set. Noteworthy are improvements in 171.swim, 183.equake, 301.apsi and worse readings in177.mesa.

In general, there are no quality changes in any of the tests, that could be caused by using the new compiler.

Intel Pentium M

8.0 compilers were the first that enabled to optimise the code for a Banias core. In our tests we used a 1.5 GHz CPU installed in a Versia laptop. It will be recalled that Centrino uses an i855 chipset and a single-channel DDR266 memory. And as we have already seen, the -QxB switch is even harmful for standard Pentium 4 CPUs. Now let's see if it gives anything to its original core.

First of all, what we see on this core is that compiler 8.0 is faster than its predecessor with all optimisation variants. Also worthy of attention is a significant growth in 181.mcf (almost 30 percent). -QxB always proves to be the best choice for the CPU, though it usually makes no great difference. The integral reading has risen by more than 8 percent in version 8.0, and the original switch is about 4 percent better than the universal -QxW.

The real arithmetic task set shows a substantial increase in several subtests, such as 168.wupwise, 171.swim è 200.sixtrack, while Pentium 4 had no such changes in these tasks. 187.facerec have once again shown a performance fall comparing to the previous compiler version. And still, the integral reading rose by more than 4 percent this time, which is the greatest increase for SPECfp_base2000 among all the CPUs we tested.

Intel Pentium 3.2E (Prescott)

Intel presented its Pentium 4 CPU on a new core not long ago, and first performance tests gave a complicated picture. Thus, it turned out to be slower than its predecessor in many tests (at the same frequency). We'll publish their SPEC CPU2000 comparison in the next article, and right now let's see in what way the latest compiler version can be useful for the new core. Possessing this information, we'll have less problems analysing further SPEC CPU2000 results on the new core. The PC configuration is: chipset i875 with a two-channel DDR400 memory.

Here we can see that the new compiler is better than the previous one in all optimisation variants. All variants have almost the same performance except 181.mcf which has a worse reading with -QxW. According to the integral estimate, the new compiler has an advantage of around 8 percent. It shows a considerable increase in 181.mcf, 252.eon, and 255.vortex.

CFP2000 tests are marked by a 25-percent advantage of the -QxP-optimised version in 168.wupwise. That is very likely to be the result of using SSE3. Other noteworthy subtests are 171.swim (+15%), 178.galgel (+16%), 172.mgrid (-9%). Performance changes in these tasks differ greatly from those in identical tasks for Pentium 4 3.06 (the results were 171.swim (+7%), 178.galgel (+9%), 172.mgrid (+0.19%)). Interestingly, performance changes are manifested in all optimisation variants and thus, can't be caused by using SSE3. The SPECfp_base2000 integral reading only rose by 2.8 percent in this CPU.

Now we will try to construct HT effect diagrams analogical to those we made for Pentium 4 3.06.

It is also interesting to note that the type of changes has altered greatly. 164.gzip shows a significant (more than 30 percent) increase triggered by HT using, while Pentium 4 3.06 gave less than 20 percent (and even less on the new compiler, about 10 percent). 176.gcc and 255.vortex now constitute the "HT-unaffected" class. 186.mcf, on the contrary, has faced difficulties now and shown -10% (Pentium 4 3.06: +4%). 186.crafty, 252.eon have suffered grave losses, too. Thus, it can be concluded that HyperThreading functioning has changed greatly on the new core.

The same thing with CFP2000 tests, except that readings in 168.wupwise and 179.art have worsened, which has led to the evanescence of HT effect according to the integral reading.

Features of optimisations realised in Intel compilers

Since SPEC CPU2000 is, among everything, a compiler test, code optimisation possibilities are very interesting in this respect as they help explain test results on different CPUs.

Code optimisation in Intel compilers can divided into two parts: user code optimisation and optimised libraries.

But to achieve a faster user code execution various methods are used, both independent of the code architecture and closely related to it. The former include general optimisation for speed (-O3), interprocedure optimisation (-Qipo), and profiling (-Qprof_gen/prof_use). The latter, in fact, only comprises one switch -Q[a]x{K,W,N,B,P} (in version 8.0) which has the basic function to vectorise the code using SIMD instructions. Noteworthy, this version doesn't support optimisation for Pentium MMX/Pentium II. There are also other optimisation switches, but their use is far less frequent and less effective.

There are no problems concerning the first group as it works with all CPUs. But the second one is of great interest.

Further comparison will be made on the tasks from the SPEC CPU2000 test, therefore, some of the sentences below may imply the ending "on SPEC CPU2000 tasks". Of course, other source codes may give different results, but the fact that the test contains many types of tasks allows us to hope that we're not far away from the truth. It wasn't our aim to investigate thoroughly the working features of Intel compilers, but the research we've done can contribute to a better interpretation of performance test results.

Besides, one should take into account that tests were conducted exactly with the compiler builds mentioned at the beginning. The thing is that while the tests were in process, the builds changed a couple of times, and by the time the article appeared new updates had been released :). Sure, we try to use the latest versions for our tests but it's not always easy to keep up with Intel :).

First of all, let us see how the compiler identifies the CPU type and where the information is further used. For this purpose, a special function is invoked at the beginning of the program, or to be more precise, at any point of the program if the type is unknown at checking. This function sets an inner flag depending on the CPU type. 7.1 was the first compiler version where CPU manufacturer began to be checked: if it is not Intel, then the CPU's capabilities are not examined and it is considered to be a general-type CPU.

All the three switches use the new CPU-type checking procedure during the execution, and if a critical mismatch occurs (for instance, a code for a Prescott is run on a Northwood) the user sees the following message:

Fatal Error : This program was not built to run on the processor in your system.

Besides, due to this new procedure, using a code with new switches on non-Intel CPUs ends with the same message. Before, a CPU's capabilities to support SSE2 were identified regardless of the manufacturer, and when an SSE2-supproting program was run on a CPU without this set of instructions, the CPU gave out an execution error and the programmer was in charge of making a proper combination.

The code initialisation procedure for -QxW doesn't respond to the flag with a fatal error, so we can easily get a code that would use SSE2 and work with non-Intel CPUs. But the thing is that when the -QxN switch is used the compiler makes a different code even for a Pentium 4 CPU. And this performance difference can be crucial in some tasks. It is significant that for both -QxW and -QxN, optimised library functions (e.g. - sine measurement using an SSE2 code) are used in calculation tasks, and the CPU type is not checked. Despite the fact that the libraries have a universal variant of the procedure, that chooses the code variant depending on the CPU type, it is not used with -Qax* optimisations. At least, we didn't succeed in invoking universal procedures on the code of SPEC CPU2000 tasks. Instead, the compiler created two variants of the whole program code with a dispatcher at the beginning. It is quite understood, as a code optimisation for SIMD can be substantial and it is better to have two versions of the whole code. But the peculiarity here is that if the procedure is "simple", that is, can be executed in one instruction on the same FPU (like that sine), the things go as described above. But in a library function (such as arc trigonometrics), when -QxW/N/B switches are used a direct invocation of an SSE2-optimised function is inserted, and for a non-SSE2 code (work without -Qx* switches or an alternative variant for -Qax*) a library function is called, that has a built-in dispatcher by the CPU type. So, it can happen that programs not made to work with SSE2 (by choosing compilation switches) will nevertheless use these instructions in CPUs that support them (only in Intel CPUs for compilers 7.1/8.0 :) ).

There are also other functions for which the compiler doesn't immediately inserts an invocation of an optimised variant into the code, and so CPU is checked right in the functions. They include, for example, procedures with memory blocks (clearing, initialisation, copying). And a new subprogram is invoked in them, designed for fine-tuning procedures with memory blocks. For instance, it identifies CPU cache parameters, which enables to use a code optimal for each case. But it is also where the CPU manufacturer is checked once again. However, this check is absolutely appropriate here as Intel is the only leading company that provides a proper information on CPU caches, basing on which parameters of the above-mentioned functions are specified.

As today's Intel Pentium 4 CPUs allow using several variants of -Qx{W,N,B} switches and we have seen a significant difference in code execution speed, it means that in this case the user code is compiled in different ways.

A detailed comparison of the -QxW and -QxN variants in SPEC CPU2000 subtests has shown that their assembler codes that are generated by new compilers are almost the same. For instance, they are absolutely identical in 164.gzip and 172.mgrid (except the initialisation function call). But still, the differences that exist make 181.mcf readings increase dramatically.

But the comparison of the assembler listings with switches -QxN, -QxB, and -QxP has revealed that the compiler makes more adjustments to each particular cores: it chooses the order of instructions, uses various data allocations in the registers, and chooses different patterns for some function realisations.

Now we'll take a look at how new SSE3 instructions for the Pentium 4 Prescott CPU are used. An examination of SPEC CPU2000 tests have shown that these instructions are really used for the user code, though not all 13 of them.

	MOVSLDUP	MOVSHDUP	MOVDDUP	ADDSUBPS	ADDSUBPD	HADDPS	HSUBPS	HADDPD
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon								+
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf			+					+

168.wupwise			+		+
171.swim			+
172.mgrid			+					+
173.applu			+					+
177.mesa
178.galgel			+		+			+
179.art
183.equake			+					+
187.facerec	+	+		+		+	+	+
188.ammp			+					+
189.lucas
191.fma3d			+					+
200.sixtrack			+					+
301.apsi			+		+			+

It is significant that the data above are potentially dependent on the choice of optimisation switches and may change if the switches are chosen otherwise. There is also a great difference in the recurrency of the instructions. Thus, MOVSLDUP occurs only once in 187.facerec, while MOVDDUP is much more frequent in 168.wupwise. As for the codes of library functions, SSE3 instructions are probably not used there (as well as in SPEC CPU2000 tests for the moment).

Conclusions

Compilers version 8.0 showed a performnce increase in SPEC CPU2000 tests comparing to the previous version. Although there's only a 5-percent increase in SPECint_base2000. More interest is focused on optimisations for Pentium M Banias and, surely, Pentium 4 Prescott CPUs. The former gives a 10-to-15-percent gain in several subtests if the right switch is used (it also concerns real arithmetic tasks). But the appearance of a new option for Pentium 4 CPUs (-QxN) looks a bit strange as their core has practically "quit the stage". Besides, it doesn't produce any visible effect: only two CINT2000 subtests have shown a clear-cut response to it.

Intel tries not to leave its CPUs without software support, so it is quite natural that an optimisation for Prescott (-QxP) has appeared. Noteworthy, this optimisation uses new SSE3 instructions fot the user code. But they only have a substantial effect in 168.wupwise (+25 percent), in other tasks performance virtually coincides with the -QxN result.

As for the introduction of CPU manufacturer checking, it is not at all so bad, in our opinion: in any case, there's always the variant of using the -QxW switch that turns on the support of SSE/SSE2. Besides, it is odd to blame Intel for adjusting its compilers to its own CPUs. Considering that CPUs are different not only in features and sets of instructions, but also in the inner architecture (algorithms of prediction units, prefetch, caches, etc.). The company has made a sensible move that can contribute to a higher code performance.

By the way, other CPU manufacturers could also come up with their own compilers, or at least, to cooperate with Intel so that their products will be supported in Intel's compilers :).

The present situation can only mean one thing for testers: that from now on, CPUs made by different companies will be tested on very different codes. Which is not very good, in general. But then again, synthetics remain synthetics, no matter what.

Kirill Kochetkov kochet@ixbt.com,

22.04.2004

Write a comment below. No registration needed!