Not long ago we reviewed the new version of Intel C++/Fortran 9.0 compilers in SPEC CPU 2000 tests. We analyzed performance of only the 32-bit (x86) code version, generated by these compilers. On the one hand, because it's the most interesting case, as 32-bit applications currently make up a larger part of software. On the other hand, because... we just had failed to get any results for the EM64T code (seemingly native for Intel) on Pentium 4 670 processor for a long time.
There are only two participants today — Pentium M 770 leaves the battle field for an obvious reason, it does not support 64-bit extensions even theoretically. We again had to modify the code for AMD Athlon 64 FX-57 — this time it was 64-bit code. Our ICC Patcher utility handles it just as well. In case of Pentium 4 670 we had to... make a different type of modification. The fact is, this processor, running under Windows XP x64 Edition, refused to execute SPEC CPU2000 tasks, compiled for EM64T — the system would reboot in several minutes, irregardless of a given task and optimizations. We had absolutely no idea about the reasons for this incompatibility between Intel and native EM64T code. But then we reduced the CPU clock to minimum (2.8 GHz), suspecting that the reason had been in overheating (in order to exclude the negative effect of throttling on test results, we had to disable Thermal Monitor 1 or 2, in conflict with Intel principles). Processor aced all our tests at this clock, but in the long run we found out that the reason was not in overheating, but in... our defective engineering sample of Pentium 4 670 - it just couldn't run at the nominal 3.8 GHz, exactly in 64-bit mode. But already at 3.6 GHz, the processor started executing EM64T code perfectly, so it takes part in our today's tests as Pentium 4 660.
To make the comparison valid, we used the same compiler versions as in the previous analysis of 32-bit code performance (though newer versions have appeared since that time):
As usual, we used the same general optimization keys for compiling the code in all cases (32/64-bit code with various optimizations):
PASS1_CFLAGS= -Qipo -O3 -Qprof_gen
What concerns specific code optimizations, there are only three of them, as the other options are not allowed by the EM64T version of the compiler:
Before we proceed to comparing real results, let's note that it's actually wrong to compare directly with performance of the first code option "without optimization" on x86 and x86-86/EM64T platforms. Here are the reasons: in the first case (x86) this code uses FPU instructions, while the same code, compiled for EM64T, uses SSE/SSE2 instructions, as this platform has no support for the native x87 FPU code.
Pentium 4 660
So, let's start with the test results of Pentium 4 670, forced into Pentium 4 660 (3.6 GHz core, disabled TM1/TM2).
SPEC CPU2000 Integer Tests. As usual, we can pick out several groups of sub-tests. The first one — the tasks that are more or less indifferent to 64-bit, that is they neither gain nor lose performance versus 32-bit code (164.gzip, 175.vpr, 176.gcc). The second group includes sub-tests, demonstrating insignificant gain (about 5%) — 255.vortex and 256.bzip2. Higher performance gains (from 15% to 33%) are demonstrated by 186.crafty, 197.parser, and 252.eon. Nevertheless, SPECint2000 has a task demonstrating a no less performance drop (even higher — up to 40% in case of Prescott and SSE3 optimization) — 181.mcf. And finally, a no less interesting group is made up by the tasks, demonstrating either performance gain or loss, depending on an optimization option — 253.perlbmk, 254.gap, and 300.twolf.
SPECint_base2000 grand total: in general, EM64T usage is justified, but the performance gain is rather small — from 1.6% to 3.8%.
Comparing tests with real arithmetic is much more interesting. First of all, the advantage of 64-bit code "without optimization" is striking almost in all cases (except for 171.swim). That's the effect of the above mentioned situation — the transition from the 32-bit to 64-bit version is accompanied by switching from 8 FPU registers of the inconvenient stack architecture to using 16 linear SSE/SSE2 registers. That is the result is not "performance gain from 64 bit in pure form". That's why it's correct to compare only -QxW and -QxP optimization results, linear SSE/SSE2 registers being used in both cases (8 and 16 correspondingly).
Quite a lot of sub-tests are practically indifferent to 64 bits (there are neither noticeable performance gains nor drops) — 171.swim, 178.galgel, 188.ammp, 179.lucas, and 200.sixtrack. A little advantage in 64-bit mode is demonstrated by 187.facerec (5.7-6.1%), 191.fma3d (4.1-5.6%), and 301.apsi (5.0-5.4%). Performance in other tasks depends much on a given code optimization (SSE2 or SSE3). For example, 168.wupwise gains much only in case of SSE3 (12.5%), but its performance drops a little in case of SSE2 code (-2.4%). On the contrary, the largest performance gain is demonstrated by 172.mgrid in SSE2 (27.6%). 177.mesa demonstrates comparable gain in both cases (27.1% and 20.8% correspondingly). However, the most shocking behavior is demonstrated by 179.art. While we can see reasonable performance gain in SSE3 (34.7%), SSE2 code acts in a fantastic manner, demonstrating 140.5% performance increase for 64-bit code. It cannot be a coincidence — Athlon 64 FX-57 also demonstrates significant performance gain. So we can only assume that such behavior may be conditioned by the 32-bit Intel compiler generating unoptimized code of this task in case of the SSE2 optimization. What concerns the last SPECfp task (183.equake), we managed to get results only for the SSE2 code (13.7% performance gain). We failed to compile 64-bit SSE3 code — first-pass code generated errors at the profiling stage irregardless of a processor. In this connection, we didn't get the SPECfp_base2000 total score for -QxP. In case of -QxW, the average performance gain for 64-bit code is 13.7%. Interestingly, almost the same result (13.4%) is demonstrated by the non-optimized code. As we have already mentioned, direct performance comparison between its 32- and 64-bit versions is not quite properly-posed.
Athlon 64 FX-57
Let's proceed to the second contender — Athlon 64 FX-57. It supports everything necessary for running extension tests (x86-64, SSE2, and SSE3).
The difference in integer tests between 32-bit and 64-bit code is more pronounced versus Intel tests. First of all, there are practically no "mixed" results here — when one and the same task either demonstrates performance gain or loss, depending on a code optimization (except for the only task - 164.gzip). There are also no indifferent results (no noticeable performance gain or loss) — for example, 175.vpr, 176.gcc, and 300.twolf demonstrate a large performance drop in the 64-bit code, while 255.vortex and 256.bzip2, on the contrary, gain much from 64 bits compared to their behavior on Pentium 4 processors. Nevertheless, maximum performance gain (even if with different absolute results and ratings) is still demonstrated by 186.crafty, 197.parser, and 252.eon. And the maximum performance drop is demonstrated by 181.mcf. What's the most interesting — taken together, it all provides zero difference in the results in general (by SPECint_base2000) — from 0.3 to -0.6%.
Real arithmetic tests also demonstrate some behavioral differences on 64-bit platforms from Intel and AMD. First of all, the non-optimized code on the AMD platform does not look advantageous in all cases — that is using 8 FPU registers with stack architecture may be better than using 16 linear XMM registers. The problem is most likely not in "registers" (their number and execution units of a processor), but in code generation specifics — don't forget that the compiler under review is developed by Intel, not by AMD :).
Let's look closer at -QxW and -QxP options. As we have already noted, their comparison is better posed, as it's the pure result of transition from "32 bits" to "64 bits". So, 168.wupwise demonstrates a more evened result — from 3.2 to 6.3% performance gain (it was more chaotic on the Intel platform). 171.swim and 200.sixtrack do not get anything from the 64 bits either. 172.mgrid demonstrates lesser performance gain (3.8 — 10.8%), as well as 177.mesa (4.8 — 9.0%). On the contrary, 173.applu on the AMD platform acts in a more chaotic way (it gains 23.8% in case of SSE2 code, but its performance drops by 1.9% when the code is optimized for SSE3). 178.galgel demonstrates larger performance gain in case of SSE2 optimization, but it is almost no different from the 32-bit SSE3 version. As we have already mentioned above, 179.art in the SSE2 code also enjoys fantastic performance gain — 81.7%. Its performance gain is much lower (16.0%) in case of SSE3 code. 183.equake (we managed to compile its runnable code only for SSE2) on the AMD platform demonstrates 3.6% performance drop. Nevertheless, there are some tasks, which get much higher performance gain from the transition to 64-bit code on the AMD platform exactly — they are 188.ammp, 189.lucas, and 191.fma3d. The two remaining tasks — 187.facerec and 301.aspi — demonstrate similar performance gain both on Intel and AMD platforms.
And finally we only have to review the overall SPECfp_base2000 results. In case of non-optimized code, performance gain from 64-bit code (even if not in its pure form) is just 5.3% (there is nothing surprising about it, as this category includes a lot of tests that got negative results in the 64-bit mode). In case of SSE2 optimized code (-QxW) the performance gain is 11.6%, which is a tad lower than for the Intel platform (13.7%).
Here are the main conclusions we can draw from the results.
1. Converting the integer code into 64-bit mode may lead either to performance losses or gains. As always, it all depends on a given task — sometimes we can get up to 30% of performance gain, sometimes — up to 40% of performance drop. Nevertheless, the average result is still winning rather than losing.
2. In most cases, floating point tasks benefit from the transition from 32-bit (x86) to 64-bit (x86-64/EM64T) code. Even if there is a performance drop, it's not too high — not exceeding 5%.
3. In general, 64-bit AMD processors demonstrate lesser performance gain for the EM64T code versus 64-bit processors from Intel. Of course, in no way it can speak of the "64 bit" implementation efficiency in this or that case. It just means that the EM64T version of Intel compilers generates code, which is more optimized for P4, rather than for K8. There is actually nothing surprising about it. It would have been strange to have the opposite situation :).
Dmitri Besedin (email@example.com)
October 26, 2005.
Write a comment below. No registration needed!