SPEC CPU2000. Part 15. AMD64 and the 64-bit Code. Second Try.

It's been a long time since AMD64 CPUs began to be tested. And each of our articles on it brings a lot of responses reproaching us of an unfair attitude to the 64-bit mode of these models.

Indeed, we have nothing to offer here except a couple of synthetic tests and benchmarks. Even when we tried to find AMD's document containg 64-bit software, it was to no avail. Most benchmarks turned out to be in the development stage, and in other cases authors themselves knew nothing of the 64-bit versions :). However, we did manage to find a couple of real applications and we'll certainly try to make the most of them in future articles. As for today, we'll once again test compilers made for AMD64.

We have already tried to test AMD Opteron CPUs in a 64-bit OS with 64-bit compilers. It made no profound impression on us then, although performance in some of SPEC CPU2000 subtests was quite promising.

It has been seven months since that material appeared, and today we'll try to do it once again hoping that OSs and compilers have grown more mature.

Tests were carried out on the following platform:

AMD Athlon 64 FX-53 CPU
ASUS SK8V motherboard (VIA K8T800 chipset)
two Corsair PC3200 Registered ECC 512-MB DIMMs (timings: 2-3-2-5)

Linux

In Linux, we used SuSE 9.0 Pro and SuSE 9.0 Pro distributives for AMD64. A standard benchmark suite for the workstation was set, after which we refreshed the cores and gcc compilers (using ready-made SeSE rpm's). These are the resulting versions:

i386 platform: 2.4.21-209 core, 3.3.2-26 compiler
x86-64 platform: 2.4.21-199 core, 3.3.2-29 compiler

The tests were conducted on a standard gcc/g77/g++ compiler as well as on a Portland Group (PGI) compiler version 5.1-3 (released January, 14, 2004).

The gcc benchmark does not include the Fortran 90 compiler, so we can't obtain the official results of SPECfp_base2000. Therefore, only 10 out of 14 subtests are given here with results. And we managed to obtain fully official specs for PGI.

The following optimisation switches were used in the tests:

gcc/g77/g++: -O3 -funroll-all-loops +PGO (-fprofile-arcs/-fbranch-probabilities)
pgi: -fastsse -Mipa=fast (two compiler passes for using IPA)

Because there are too many figures and we're mostly interested in the changes caused by the transition to 64 bits, we'll confine ourselves with the tables only.

We'll start with CINT2000, as usual.

	gcc 32	pgi 32	gcc 64	pgi 64	gcc, change, %	pgi, change, %
164.gzip	1027	766	1246	978	21.32	27.68
175.vpr	1086	1011	1150	1006	5.89	-0.49
176.gcc	1429	1306	1475	1248	3.22	-4.44
181.mcf	1079	1016	673	670	-37.63	-34.06
186.crafty	1327	1019	2011	1521	51.54	49.26
197.parser	1011	813	1105	786	9.30	-3.32
252.eon	1143	314	1963	387	71.74	23.25
253.perlbmk	1533	1163	1655	1183	7.96	1.72
254.gap	1150	955	1296	951	12.70	-0.42
255.vortex	1527	1479	1748	1662	14.47	12.37
256.bzip2	1081	940	1233	1011	14.06	7.55
300.twolf	1424	1303	1148	1074	-19.38	-17.57
SPECint_base2000	1221	950	1338	979	9.58	3.05

Well, transition to 64 bits gives a rather ambiguous picture. The integral reading has risen but some other subtests show a wide range of results.

The situation with the gcc compiler is better than it was in the previous testing (version gcc 3.3.1): only two subtests have a performance decrease now (vs. four subtests in gcc 3.3.1), and integral reading performance has risen by 9.6 percent (vs. 4.8 percent in gcc 3.3.1). But the new PGI version seems worse than the previous one: the integral reading has only increased by 3 percent (vs. 11 percent in version 5.0-1), while falls in 181.mcf and 300.twolf have become deeper.

However, it must be taken into account that the previous testing was carried out on Opteron 240 CPUs that have a 1.4 GHz frequency and DDR333 memory.

Now let's take a look at CFP2000.

	gcc32	pgi 32	gcc 64	pgi 64	gcc, change, %	pgi, change, %
168.wupwise	1174	1519	1287	1725	9.63	13.56
171.swim	1411	1910	1437	1999	1.84	4.66
172.mgrid	798	1155	916	1357	14.79	17.49
173.applu	870	1150	947	1292	8.85	12.35
177.mesa	1199	1129	1749	1263	45.87	11.87
178.galgel		2077		2636		26.91
179.art	741	1344	1429	1272	92.85	-5.36
183.equake	1435	1174	1428	1262	-0.49	7.50
187.facerec		1416		2072		46.33
188.ammp	1117	932	1368	1267	22.47	35.94
189.lucas		1403		1415		0.86
191.fma3d		1333		1459		9.45
200.sixtrack	451	688	597	718	32.37	4.36
301.apsi	784	1129	984	1386	25.51	22.76
SPECfp_base2000		1266		1446		14.22

And again, 179.art is prominent on the gcc compiler. The use of the 64-bit mode all but doubles its result (although it is rather due to a bad result in 32 bits than to a good result in 64 bits). Other subtests mostly show a performance increase of 25.4 percent on average. Also noteworthy are better 171.swim readings: a 1.8-percent rise instead of a 8.6-percent fall. Thus, CFP2000, too, demonstrates a general performance increase of the 64-bit code for gcc.

In respect of pgi, 179.art is falling and other tasks are rising, just like it was last time. The integral reading gives a 14.2-percent increase (vs. 12.5 percent in version 5.0-1).

We also managed to test the much-spoken-of PathScale EKO Compiler Suite version 1.0. Although we only did it in the 64-bit mode, as the 32-bit code generation is only in the alpha version now. However, the "-m32" switch is officially used for peak results of some SPEC CPU2000 subtests. As for optimisation switches, we used the supplied configuration file which is almost fully identical to the one used for publishing results on SPEC's site. Note that the manufacturers were wise enough to install four DIMMs into the test station and to employ interleaving mode, which led to a significant increase in results (which is exactly the thing we showed last summer). Unfortunately, we use only two DIMMs now, so mind the reserve :). For comparison purposes, we'll take the results of 64-bit gcc and pgi versions.

	psc 1.0	gcc 64	pgi 64
164.gzip	1376	1246	978
175.vpr	1084	1150	1006
176.gcc	1585	1475	1248
181.mcf	671	673	670
186.crafty	2026	2011	1521
197.parser	1047	1105	786
252.eon	1738	1963	387
253.perlbmk	1596	1655	1183
254.gap	1261	1296	951
255.vortex	2287	1748	1662
256.bzip2	1245	1233	1011
300.twolf	1204	1148	1074
SPECint_base2000	1361	1338	979
	psc 1.0	gcc 64	pgi 64
168.wupwise	1641	1287	1725
171.swim	2070	1437	1999
172.mgrid	1428	916	1357
173.applu	1369	947	1292
177.mesa	1777	1749	1263
178.galgel	2510		2636
179.art	1649	1429	1272
183.equake	1428	1428	1262
187.facerec	1606		2072
188.ammp	1356	1368	1267
189.lucas	1387		1415
191.fma3d	1343		1459
200.sixtrack	673	597	718
301.apsi	1434	984	1386
SPECfp_base2000	1493		1446

The results show that psc can compete with gcc and pgi in integer calculations and real arithmetics, respectively. So, PathScale is definitely telling us that you don't always have to spoil before you spin. AMD64 can be said to have found solid support in this manufacturer.

Unfortunately, as soon as we were done with the PathScale part, we found out that a new compiler version (1.1) had just appeared (such things happen quite often) :), so we decided to put off the article for several days in order to include new results into it (especially considering that the bugfixes make a long list and many of them belong to SPEC CPU2000 tasks). We also used the new supplied configuration file for version 1.1. Apart from the correction of the mistakes, the version turned 32-bit code support from alpha to beta stage. The test run of the mode showed that almost all SPEC CPU2000 tasks (except 178.galgel which was executed in an indefinite time span) were compiled and passed quality control. On average, the results were 1.5-2 times lower than in 64 bits. Compared to version 1.0, the results changed little: SPECint_base2000 increased by 2.4 percent, SPECfp_base2000 fell by 0.2 percent. Interestingly, AMD ACML 2.0 mathematic library was used to peak-run the 178.galgel test. Obviously, this was the cause of its almost 5-percent increase.

We normally don't use peak readings in our tests. It is partially due to our conviction that adjustments of subtest settings are the department of compiler and CPU manufacturers, while most users seldom practice it. For example, can you guess that it is "-O3 -ipa -LNO:fusion=2:interchange=OFF:blocking=OFF:ou_prod_max=10:ou_max=5: prefetch=2 -OPT:IEEE_arith=1:ro=3:unroll_size=0 -TENV:X=4 -WOPT: mem_opnds=on:retype_expr=on:val=0" that will show the best result? :) And when it comes down to a subtle selection of multiple options, one can often achieve a maximal result on a user program by rewriting the code (e.g. basing on analyser's research). Thus, peak readings in SPEC CPU2000 synthetic tests rather serve for the measurement the compiler's "capabilities" than for a precise comparison of CPUs' performances. But this time around, we'll please AMD fans :) and include the PathScale product's peak readings into our table. And we'll compare it with Intel's fastest compiler for IA32, that worked in Windows XP.

	ic 8.0	psc 1.1	psc-peak 1.1
164.gzip	1303	1413	1413
175.vpr	1350	1124	1152
176.gcc	1239	1597	1597
181.mcf	1156	674	1056
186.crafty	1694	2043	2043
197.parser	1487	1048	1222
252.eon	2538	1795	1864
253.perlbmk	1598	1619	1728
254.gap	1626	1403	1403
255.vortex	2444	2303	2440
256.bzip2	1283	1274	1274
300.twolf	1654	1218	1553
SPECint_base2000	1566	1393	1518
	ic 80	psc 1.1	psc-peak 1.1
168.wupwise	1601	1636	1876
171.swim	2210	2059	2004
172.mgrid	1224	1422	1569
173.applu	1201	1344	1450
177.mesa	1771	1777	1930
178.galgel	2146	2468	2716
179.art	1864	1631	2286
183.equake	1505	1415	1393
187.facerec	1647	1747	1907
188.ammp	1193	1359	1372
189.lucas	1824	1349	1553
191.fma3d	1404	1301	1359
200.sixtrack	631	680	700
301.apsi	1373	1444	1455
SPECfp_base2000	1480	1490	1612

We finally got a small-scale sensation: it is the first time that an Intel compiler loses to its 64-bit rival (to be precise, it also concerns psc version 1.0) in SPECfp_base2000. There can be mixed reaction to this fact. Some may think that the era of 64-bit calculations has come and everybody has to rush in that direction :). Others may placidly analyse the situation and say that users now have one more reason to try using AMD64 on their tasks. The gap is not so big, especially considering that Intel was tested in another OS and its result in Linux may be a little different (see this article).

Windows

Windows XP AMD64 version released in February 2004 (build 1069) served as a 64-bit OS. We found two compilers: one from DDK for Windows 2003 Server build 3790 released in March 2003 (version 14.00.2207.0), the other from the Visual Studio «Whidbey» preview (version 14.0.30702.27) (it is named msvc8 in the table).

Unfortunately, there are less figures in this chapter. First, because only a C/C++ compiler was used, and second, some of the tests couldn't be compiled/run for a 64-bit OS. All the results of this chapter are unofficial, partially because each test was only run once.

	msvc8 32	msvc8 64	ddk 32	ddk 64	msvc8, change, %	ddk, change, %
164.gzip	1233	1173	1154	1023	-4.87	-11.35
175.vpr	1132	1195	1183	1113	5.57	-5.92
176.gcc	1554	1534	1549	1534	-1.29	-0.97
181.mcf	1152	769	1158	747	-33.25	-35.49
186.crafty	1612	2021	1576	1699	25.37	7.80
197.parser	1133	1089	1134	940	-3.88	-17.11
252.eon	1465		1402
253.perlbmk	1530		1517
254.gap	1279		1261
255.vortex	1557	1611	1556	1433	3.47	-7.90
256.bzip2	1206	1221	1202	1143	1.24	-4.91
300.twolf	1434	1146	1437	1103	-20.08	-23.24
SPECint_base2000	1346		1333

Two conclusions can be drawn from the results. First, a transition to 64 bits is at least not always good in terms of performance. And second, a new compiler is better adapted for the 64-bit mode. But we can't make really serious conlcusions about performance basing on nothing but the results of the compilers' beta versions. However, is is good news that nine out of twenty tasks written over three years ago could be compiled to work correctly in the 64-bit mode.

Interestingly, significant performance falls of the 64-bit code occur exactly in the same places as in gcc/pgi — 181.mcf and 300.twolf.

Only four CFP2000 tests are written in C, so we'll examine no others.

	msvc8 32	msvc8 64	ddk 32	ddk 64	msvc8, change, %	ddk, change, %
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa	858	1652	811	979	92.54	20.72
178.galgel
179.art	1752	1711	1647	1391	-2.34	-15.54
183.equake	1466	1103	1471	1046	-24.76	-28.89
187.facerec
188.ammp	1175	1440	1159	404	22.55	-65.14
189.lucas
191.fma3d
200.sixtrack
301.apsi
SPECfp_base2000

And again, the new compiler ensures a better 64-bit code performance than the last-year version. Although the result in 183.equake is rather bad too.

In our opinion, it's no use comparing MSVC results with Linux compilers. While SPEC CPU2000 integral readings could be compared in a way, separate subtests will be uninteresting and far-fetched in this switch (e.g. MSVC scores better in 179.art but is visibly inferior to gcc in 32-bit 177.mesa).

Conclusions

First of all, according to the integral estimates, all tested programs (except PathScale in CFP2000) lose to Intel's 32-bit compiler. Even this alone can spoil the pleasure of increased performance.

The fact that the compilers can't be possibly compared indicates their crudeness (as well as a rather bad AMD64 adaptation of the codes). But certainly, we can also note some progress in the development of standard compilers for Linux and Windows platforms. Although in such case, compilers are more expected to just work than provide a maximal efficiency of the resulting code.

Compilers (good compilers :)) for AMD64 have an unclear future ahead of them. On one hand, Intel has announced support of the 64-bit mode and its instructions in their CPUs, on the other hand, it is possible that the company's compilers will work with Intel CPUs only.

Concerning the products we have tested, gcc has a license as its advantage, and it will continue developing in the future, while PGI is relatively solid on the market of cluster-system compilers. Speaking about the PathScale product, it has been showing adequate results since the time its first version appeared, and hopefully, it will continue to be competitive to its more famous rivals.

As for the Windows platform and its standard Microsoft compiler, it rather aims at providing a high compatibility and a timely support of developers than at setting performance records.