SPEC CPU2000. Part 20. Intel C++/Fortran Compiler 9.0, Intel Pentium 4 670, Pentium M 770 and AMD Athlon 64 FX-57

In mid June Intel released Version 9 of its C++ and Fortran compilers. The new version of compilers is not principally different from previous Version 8.1. Its main features are compiler integration for IA-32, IA-64, and EM64T (x86-64) platforms into a unified package and additional options for processors with Hyper-Threading and multi-core processors as far as code optimizations are concerned. In particular, Software-based Speculative Pre-Computation (SSP).

In this article we shall analyze how fast the new version of compilers is compared to the previous version on top (or almost top) single core processors — both from Intel (Pentium 4 and Pentium M) as well as... from AMD (Athlon 64 FX-57 — of course, with some code adjustments, see below).

We used the following compilers:

Intel(R) C++ Compiler for 32-bit applications, Version 9.0 Build 20050624Z Package ID: W_CC_C_9.0.020
Intel(R) Fortran Compiler for 32-bit applications, Version 9.0 Build 20050624Z Package ID: W_FC_C_9.0.019

As a reference, we used the test code compiled by Intel C++ Compiler 8.1.022 and Intel Fortran Compiler 8.1.025.

As usual, we used identical general compilation keys in all cases (Compilers 8.1 and 9.0, different code optimizations):

PASS1_CFLAGS= -Qipo -O3 -Qprof_gen
PASS2_CFLAGS= -Qipo -O3 -Qprof_use

Pentium 4 670

Let's start with the results of the "native" processor — Pentium 4 670 (3.8 GHz) with Prescott core, which supports all necessary instruction sets and allows to execute code compiled with all possible specific optimization keys: -QxK, -QxW, -QxN, -QxB, and -QxP.

	No Opt.		-QxK		-QxW		-QxN		-QxB		-QxP
	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0
164.gzip	1150	1130 (-1.7%)	1253	1239 (-1.1%)	1255	1248 (-0.6%)	1265	1251 (-1.1%)	-	1247	1267	1241 (-2.1%)
175.vpr	x	x	1207	1201 (-0.5%)	1290	1283 (-0.5%)	1288	1272 (-1.2%)	-	1255	1286	1270 (-1.2%)
176.gcc	x	x	2142	2119 (-1.1%)	2132	2122 (-0.5%)	2146	2125 (-1.0%)	-	2116	2155	2116 (-1.8%)
181.mcf	1595	1594 (-0.1%)	1599	1599 (0.0%)	1598	1600 (0.1%)	2125	2125 (0.0%)	-	2113	2131	2115 (-0.8%)
186.crafty	1251	1260 (0.7%)	1272	1285 (1.0%)	1371	1398 (2.0%)	1375	1406 (2.3%)	-	1387	1387	1389 (0.1%)
197.parser	1553	1030 (-33.7%)	1562	1031 (-34.0%)	1562	1026 (-34.3%)	1560	1025 (-34.3%)	-	1019	1560	1031 (-33.9%)
252.eon	1640	1762 (7.4%)	1795	1836 (2.3%)	2188	2153 (-1.6%)	2391	2360 (-1.3%)	-	2101	2359	2320 (-1.7%)
253.perlbmk	1997	2021 (1.2%)	1954	2015 (3.1%)	1923	2012 (4.6%)	1940	1991 (2.6%)	-	2018	1947	2006 (3.0%)
254.gap	2033	2110 (3.8%)	1936	1990 (2.8%)	2019	2035 (0.8%)	2022	2061 (1.9%)	-	2029	2032	2049 (0.8%)
255.vortex	2876	2941 (2.3%)	2871	2971 (3.5%)	2869	2970 (3.5%)	2854	2970 (4.1%)	-	2852	2833	2948 (4.1%)
256.bzip2	1423	1428 (0.4%)	1390	1399 (0.6%)	1378	1372 (-0.4%)	1360	1348 (-0.9%)	-	1354	1372	1415 (3.1%)
300.twolf	1867	1526 (-18.3%)	1840	1880 (2.2%)	1859	1898 (2.1%)	1865	1910 (2.4%)	-	1879	1869	1908 (2.1%)
SPECint_base2000	1682	1604 (-4.6%)	1682	1642 (-2.4%)	1734	1687 (-2.7%)	1790	1739 (-2.8%)	-	1708	1792	1739 (-3.0%)

168.wupwise	1882	1843 (-2.1%)	2031	2074 (2.1%)	2235	2304 (3.1%)	2198	1735 (-21.1%)	-	1762	2860	2914 (1.9%)
171.swim	2089	2088 (0.0%)	2362	2544 (7.7%)	2524	2596 (2.9%)	2525	2595 (2.8%)	-	2553	2526	2595 (2.7%)
172.mgrid	1022	1023 (0.1%)	1237	1216 (-1.7%)	1518	1511 (-0.5%)	1674	1661 (-0.8%)	-	1306	1675	1661 (-0.8%)
173.applu	1419	1438 (1.3%)	1404	1414 (0.7%)	1481	1472 (-0.6%)	1655	1670 (0.9%)	-	1555	1638	1691 (3.2%)
177.mesa	1399	1371 (-2.0%)	1496	1476 (-1.3%)	1666	1669 (0.2%)	1662	1668 (0.4%)	-	1574	1659	1653 (-0.4%)
178.galgel	1445	1440 (-0.3%)	3036	3119 (2.7%)	3581	3637 (1.6%)	3564	3866 (8.5%)	-	3626	3603	3889 (7.9%)
179.art	2716	2356 (-13.3%)	2370	2393 (1.0%)	2918	2613 (-10.5%)	2987	2655 (-11.1%)	-	2524	4648	4597 (-1.1%)
183.equake	2074	2105 (1.5%)	2143	2118 (-1.2%)	2155	2154 (0.0%)	2158	2148 (-0.5%)	-	2092	2156	2420 (12.2%)
187.facerec	1736	1773 (2.1%)	2035	2148 (5.6%)	2049	2151 (5.0%)	2037	2165 (6.3%)	-	2114	2075	2179 (5.0%)
188.ammp	1305	1226 (-6.1%)	1240	1213 (-2.2%)	1365	1345 (-1.5%)	1371	1346 (-1.8%)	-	1210	1369	1346 (-1.7%)
189.lucas	2109	2101 (-0.4%)	2007	2025 (0.9%)	2285	2320 (1.5%)	2279	2331 (2.3%)	-	1984	2302	2306 (0.2%)
191.fma3d	1316	1342 (2.0%)	1291	1342 (4.0%)	1600	1648 (3.0%)	1581	1683 (6.5%)	-	1371	1606	1646 (2.5%)
200.sixtrack	604	606 (0.3%)	597	605 (1.3%)	678	754 (11.2%)	679	746 (9.9%)	-	621	683	748 (9.5%)
301.apsi	1309	1277 (-2.4%)	1317	1301 (-1.2%)	1386	1370 (-1.2%)	1408	1357 (-3.6%)	-	1300	1410	1357 (-3.8%)
SPECfp_base2000	1511	1489 (-1.5%)	1636	1657 (1.3%)	1826	1842 (0.9%)	1854	1845 (-0.5%)	-	1690	1956	2007 (2.6%)

But nevertheless, we shall start with a non-optimized variant. Let's note an important moment: this code version, compiled both by the previous and the new compiler versions, caused errors in 175.vpr and 176.gcc sub-tests — irregardless of the processor type. That's why we used the --noreportable key to start the tests to ignore errors in some sub-tests (--ignore_errors). Integer tests. The new version demonstrates advantage in some sub-tests (252.eon, 253.perlbmk, 254.gap, 255.vortex), which is impossible to compensate by a significant performance drop in 197.parser (about 34%!) as well as 300.twolf. As a result, the total score in SPECint_base2000 = 1604, it's lower by 4.6% than the score obtained in Version 8.1 (1682). The new version demonstrates only a minor performance advantage in some tests with real numbers, but there are noticeable performance drops in some sub-tests (13.3% in 179.art). As a result, the total score in SPECfp_base2000 (1489) is lower by 1.5% than the result obtained with the previous version (1511).

The next optimization variant that uses SSE instructions (-QxK). The situation in integer tests is similar — insignificant advantage of the new version in some sub-tests and the 1.5-fold performance drop in 197.parser. Nevertheless, 300.twolf in this case is notable for better performance (2.2%). The integral score is lower approximately by 2.5% compared to Version 8.1. The situation in floating point tests is different — performance of most tasks grows when we switch to Version 9.0, the maximum gain can be seen in 171.swim (7.7%) and 187.facerec (5.6%) sub-tests. The integral score in SPECfp_base2000 is higher by 1.3% than in the previous version.

What concerns the rest of the code optimization variants (-QxW, -QxN, and -QxP), the situation in integer tests is similar to the -QxK variant: we can still see the 1.5-fold performance drop in 197.parser, resulting in a lower integral score in SPECint_base2000. There are some differences between these optimization variants in floating point tests — in the integral score as well as in some sub-tests. For example, SSE2/Willamette (-QxW) demonstrates a noticeable performance gain in 200.sixtrack (11.2%) and 187.facerec (5.0%) with the significant performance drop in 179.art (-10.5%). The new version wins just 0.9% in SPECfp_base2000. On the contrary, SSE2/Northwood (-QxN) is outperformed by the previous version in total score (by 0.5%), due to a significant performance drop in 168.wupwise (-21.1%) and 179.art (-11.1%), accompanied by some performance gain in a number of sub-tests (178.galgel, 187.facerec, 191.fma3d, and 200.sixtrack). And finally, the native variant for Prescott SSE3 (-QxP) wins 2.6% in total score due to the performance gain in 178.galgel (7.9%), 183.equake (12.2%), 187.facerec (5.0%), and 200.sixtrack (9.5%), accompanied by a nearly imperceptible drop in execution speed of few other sub-tests (maximum — 3.8% in 300.aspi).

Absolute performance in integer as well as real tasks on the whole (according to the integral readings) grows in the row -QxK < -QxB < -QxW < -QxN < -QxP, which is reasonable for Prescott core.

Pentium M 770

We proceed to the second "nearly flagship" from Intel — Pentium M 770 processor with Dothan core 2.13 GHz. Tests with this processor were carried out on a desktop-mobile system — DFI 855GME-MGF motherboard with not the fastest Intel 855GM chipset, to be more exact — not the fastest memory system (single channel DDR-333).

	No Opt.		-QxK		-QxW		-QxN		-QxB
	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0
164.gzip	1143	1091 (-4.5%)	1248	1245 (-0.2%)	1236	1238 (0.2%)	1247	1246 (-0.1%)	-	1251
175.vpr	x	x	1321	1316 (-0.4%)	1367	1381 (1.0%)	1364	1377 (1.0%)	-	1361
176.gcc	x	x	1822	1805 (-0.9%)	1805	1803 (-0.1%)	1825	1806 (-1.0%)	-	1814
181.mcf	1042	1059 (1.6%)	1054	1052 (-0.2%)	1051	1047 (-0.4%)	1504	1507 (0.2%)	-	1507
186.crafty	1320	1303 (-1.3%)	1312	1313 (0.1%)	1455	1460 (0.3%)	1455	1456 (0.1%)	-	1631
197.parser	1381	1004 (-27.3%)	1392	1002 (-28.0%)	1392	990 (-28.9%)	1388	1008 (-27.4%)	-	1001
252.eon	1589	1736 (9.3%)	1688	1668 (-1.2%)	1922	1930 (0.4%)	2096	2066 (-1.4%)	-	2127
253.perlbmk	1724	1716 (-0.5%)	1736	1755 (1.1%)	1750	1775 (1.4%)	1752	1760 (0.5%)	-	1811
254.gap	1163	1282 (10.2%)	1151	1168 (1.5%)	1280	1302 (1.7%)	1282	1298 (1.2%)	-	1337
255.vortex	2456	2484 (1.1%)	2492	2497 (0.2%)	2466	2492 (1.1%)	2491	2488 (-0.1%)	-	2482
256.bzip2	1225	1238 (1.1%)	1156	1178 (1.9%)	1196	1176 (-1.7%)	1192	1178 (-1.2%)	-	1205
300.twolf	2102	1823 (-13.3%)	2111	2149 (1.8%)	2223	2252 (1.3%)	2220	2256 (1.6%)	-	2241
SPECint_base2000	1459	1416 (-2.9%)	1489	1453 (-2.4%)	1544	1507 (-2.4%)	1605	1564 (-2.6%)	-	1591

168.wupwise	1249	1264 (1.2%)	1327	1356 (2.2%)	1133	1145 (1.1%)	1149	1045 (-9.1%)	-	1285
171.swim	713	722 (1.3%)	854	782 (-8.4%)	841	822 (-2.3%)	845	821 (-2.8%)	-	821
172.mgrid	777	786 (1.2%)	835	839 (0.5%)	817	829 (1.5%)	818	820 (0.2%)	-	842
173.applu	612	617 (0.8%)	631	638 (1.1%)	611	608 (-0.5%)	701	703 (0.3%)	-	729
177.mesa	898	906 (0.9%)	1379	1506 (9.2%)	1578	1570 (-0.5%)	1579	1552 (-1.7%)	-	1651
178.galgel	1753	1694 (-3.4%)	2499	2503 (0.2%)	2224	2237 (0.6%)	2218	2428 (9.5%)	-	2803
179.art	2600	2495 (-4.0%)	2388	2360 (-1.2%)	2472	2437 (-1.4%)	2645	2575 (-2.6%)	-	2634
183.equake	888	906 (2.0%)	905	901 (-0.4%)	898	898 (0.0%)	900	899 (-0.1%)	-	900
187.facerec	1165	1156 (-0.8%)	1244	1274 (2.4%)	1237	1275 (3.1%)	1252	1273 (1.7%)	-	1268
188.ammp	1019	980 (-3.8%)	983	968 (-1.5%)	922	905 (-1.8%)	904	891 (-1.4%)	-	963
189.lucas	799	809 (1.3%)	793	791 (-0.3%)	891	899 (0.9%)	895	898 (0.3%)	-	897
191.fma3d	808	821 (1.6%)	801	812 (1.4%)	829	840 (1.3%)	839	853 (1.7%)	-	845
200.sixtrack	542	540 (-0.4%)	533	513 (-3.8%)	464	474 (2.2%)	452	475 (5.1%)	-	528
301.apsi	916	903 (-1.4%)	916	913 (-0.3%)	851	853 (0.2%)	856	846 (-1.2%)	-	902
SPECfp_base2000	963	960 (-0.3%)	1038	1038 (0.0%)	1015	1018 (0.3%)	1031	1030 (-0.1%)	-	1085

Integer tests without code optimizations: the new version demonstrates the highest gain in 254.gap (~10%), the lowest drop — in 197.parser again (it's a tad smaller in comparison with Pentium 4 — about 27%). At an average, the total score in SPECint_base2000 is lower than in the previous version by 3%. Floating point tests demonstrate a little spread in values — both upward and downward. But according to the integral score, the execution speed of the code, compiled in ICC/IFC 8.1 and 9.0, is practically identical. Surprisingly, the absolute results in some sub-tests and the total score in SPECfp_base2000 are too low in comparison with the Pentium 4 results, but integer test results are only a tad lower. It probably has to do with these tests being critical to memory bandwidth, which is much lower in case of a system based on Pentium M with single channel DDR-333 (2.67 GB/s versus 6.4 GB/s). It certainly has nothing to do with FPU, which is not only no worse in Pentium M than in Pentium 4, but rather much better.

Optimization keys (this processor allows -QxK, -QxW, -QxN, and -QxB) don't change the situation significantly, except for the increased overall performance (which grows exactly in the above mentioned row, that is the native code optimization for Banias core turns out the best for Dothan core as well.) Integer tests still demonstrate a tad lower results (approximately by 2.5%) than in the previous version (due to a noticeably reduced performance in 197.parser and the lack of noticeable gain in other sub-tests), while the tests with real numbers are practically equal to it in performance. But the latter effect is again achieved due to a compensating spread in results, both upwards and downwards, (especially prominent in case of -QxK and -QxN — up to 10% in some sub-tests), rather than by their complete identity.

Athlon 64 FX-57

The most interesting thing is reserved for the end of the article. Test results of Intel C++/Fortran Compiler 8.1/9.0 on the latest single core processor from the competitor — AMD Athlon 64 FX-57. You may wonder how we have done it. It very simple. All it has taken us is to study the algorithm of the processor type check in an application, compiled by Intel compilers. Here is how it looks like:

1. Vendor String validation for "GenuineIntel";

2. Detecting a processor model type (Pentium III/Pentium M — Model 6, or Pentium 4/Xeon — Model 15);

3. Determining the availability of necessary extended instruction sets (SSE, SSE2, SSE3).

Judging from this algorithm it's clear that all you should do is to remove Check #1 to make AMD processors execute the code, compiled in Intel C++/Fortran Compiler — given that the processor supports necessary instruction sets. It has to do with Intel and AMD processors having matching model numbers: Model 6 corresponds to AMD K7 processors (most of them support SSE), while Model 15 — AMD K8 processors (supporting SSE, SSE2, and their latest E core revision also supports SSE3). However, even if there had been no match, we could have just as well removed Check #2. In that case operability of applications would have depended solely on the lack/presence of necessary extensions in a processor.

Binary files can be corrected manually, but we have written a small utility — ICC Patcher (you can download it here). It scans a binary file for suspicious GenuineIntel validations and replaces them with NOPs. This utility can patch not only compiled executables, but also source libraries in Intel C++/Fortran Compiler, including those for EM64T. In this case, compiled applications would always run on processors both from Intel and AMD. I repeat that this patching is not "rude". For example, the code, compiled with the -QxP key, would run only on AMD Athlon64/Opteron processors, Core Revision E, and will pop up a warning that it cannot be executed on earlier core revisions and AMD K7 processors.

Let's proceed to test results. In order to save time, we decided not to recompile all test sources with "correct" Intel libraries, but to patch the existing binaries. Thus, we set the check_md5=0 option in config files of the tests, because patching executables changes their control sum.

	No Opt.		-QxK		-QxW		-QxN		-QxB		-QxP
	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0	ic8.1	ic9.0
164.gzip	1437	1363 (-5.1%)	1568	1571 (0.2%)	1546	1546 (0.0%)	1566	1540 (-1.7%)	-	1584	1574	1558 (-1.0%)
175.vpr	x	x	1429	1406 (-1.6%)	1515	1510 (-0.3%)	1516	1503 (-0.9%)	-	1483	1514	1486 (-1.8%)
176.gcc	x	x	2178	2184 (0.3%)	2161	2173 (0.6%)	2182	2143 (-1.8%)	-	2192	2199	2158 (-1.9%)
181.mcf	1149	1150 (0.1%)	1153	1149 (-0.3%)	1152	1148 (-0.3%)	1498	1500 (0.1%)	-	1501	1506	1505 (-0.1%)
186.crafty	1892	1877 (-0.8%)	1903	1921 (0.9%)	1952	1945 (-0.4%)	1935	1939 (0.2%)	-	2011	2011	1992 (-0.9%)
197.parser	1733	1257 (-27.5%)	1773	1275 (-28.1%)	1754	1253 (-28.6%)	1766	1267 (-28.3%)	-	1256	1764	1251 (-29.1%)
252.eon	2216	2622 (18.3%)	2463	2410 (-2.2%)	2973	2901 (-2.4%)	3220	3124 (-3.0%)	-	3176	3177	3133 (-1.4%)
253.perlbmk	2105	2104 (0.0%)	2093	2121 (1.3%)	2123	2148 (1.2%)	2142	2132 (-0.5%)	-	2209	2137	2250 (5.3%)
254.gap	1858	1869 (0.6%)	1889	1910 (1.1%)	1960	1999 (2.0%)	1974	1968 (-0.3%)	-	1990	1990	1952 (-1.9%)
255.vortex	2875	2799 (-2.6%)	2823	2829 (0.2%)	2797	2719 (-2.8%)	2856	2881 (0.9%)	-	2797	2835	2902 (2.4%)
256.bzip2	1480	1514 (2.3%)	1462	1460 (-0.1%)	1431	1437 (0.4%)	1433	1430 (-0.2%)	-	1442	1451	1445 (-0.4%)
300.twolf	1934	1777 (-8.1%)	1940	1939 (-0.1%)	1958	1950 (-0.4%)	1959	1962 (0.2%)	-	1944	1947	1953 (0.3%)
SPECint_base2000	1814	1761 (-2.9%)	1837	1787 (-2.7%)	1879	1823 (-3.0%)	1943	1879 (-3.3%)	-	1894	1950	1893 (-2.9%)

168.wupwise	2121	2131 (0.5%)	2166	2200 (1.6%)	2128	2174 (2.2%)	2456	2085 (-15.1%)	-	2197	2385	2366 (-0.8%)
171.swim	1448	1448 (0.0%)	2130	1944 (-8.7%)	2136	2110 (-1.2%)	2138	2110 (-1.3%)	-	2118	2134	2111 (-1.1%)
172.mgrid	1231	1244 (1.1%)	1330	1471 (10.6%)	1432	1463 (2.2%)	1458	1554 (6.6%)	-	1486	1418	1566 (10.4%)
173.applu	1230	1251 (1.7%)	1224	1243 (1.6%)	1205	1196 (-0.7%)	1530	1498 (-2.1%)	-	1530	1538	1513 (-1.6%)
177.mesa	1569	1587 (1.1%)	1893	1939 (2.4%)	2075	2046 (-1.4%)	2072	2075 (0.1%)	-	2018	2077	2046 (-1.5%)
178.galgel	2080	2056 (-1.2%)	2437	2459 (0.9%)	2495	2464 (-1.2%)	2445	2928 (19.8%)	-	2980	2475	2915 (17.8%)
179.art	1798	1804 (0.3%)	1785	1811 (1.5%)	1844	1839 (-0.3%)	1852	1847 (-0.3%)	-	1839	2686	2910 (8.3%)
183.equake	1657	1680 (1.4%)	1678	1669 (-0.5%)	1685	1680 (-0.3%)	1674	1671 (-0.2%)	-	1693	1679	1788 (6.5%)
187.facerec	1862	1722 (-7.5%)	1896	2024 (6.8%)	1902	2030 (6.7%)	1955	2036 (4.1%)	-	1989	1963	2001 (1.9%)
188.ammp	1390	1331 (-4.2%)	1333	1298 (-2.6%)	1319	1299 (-1.5%)	1276	1277 (0.1%)	-	1285	1298	1301 (0.2%)
189.lucas	1615	1624 (0.6%)	1570	1570 (0.0%)	1727	1734 (0.4%)	1729	1724 (-0.3%)	-	1722	1730	1723 (-0.4%)
191.fma3d	1525	1537 (0.8%)	1462	1483 (1.4%)	1566	1564 (-0.1%)	1593	1607 (0.9%)	-	1570	1614	1630 (1.0%)
200.sixtrack	779	778 (-0.1%)	781	791 (1.3%)	757	779 (2.9%)	750	779 (3.9%)	-	820	748	793 (6.0%)
301.apsi	1493	1456 (2.5%)	1475	1484 (0.6%)	1484	1492 (0.5%)	1519	1471 (-3.2%)	-	1474	1510	1464 (-3.0%)
SPECfp_base2000	1515	1506 (-0.6%)	1596	1613 (1.1%)	1633	1642 (0.6%)	1681	1693 (0.7%)	-	1698	1725	1776 (3.0%)

Non-optimized code: 197.parser is noticeably slower in integer tests on this processor as well (27.3% — the same result was obtained for Pentium M). The same concerns 300.twolf (13.3%), which is compensated to some extent by the breakaway in 252.eon (9.3%) and 254.gap (10.2%) tasks. The total score in SPECint_base2000 is lower than in the previous compiler version approximately by 3%, which again reminds of Pentium M test results. Floating point test results are again close to those demonstrated by the previous version, again due to the self-compensating spread in results rather than by the same performance in sub-tests. As a result, the total score in SPECfp_base2000 is just 0.6% low compared to the code compiled in ICC/IFC 8.1.

Optimized variants of integer tests make no noticeable difference in the picture we got on other processors. Namely, the noticeable lag of 197.parser (27-28%) remains, while there is no breakaway in some sub-tests at all (as an exception, we can note the 253.perlbmk task, compiled with -QxP, which demonstrates 5.3% performance gain). The 197.parser lag conditions the 3% drop in the total score in SPECint_base2000 in all cases. What concerns the absolute performance values, they grow in the row -QxK < -QxW < -QxN < -QxP < -QxB. That is the best (not much though, only in some tests and the total score) optimization is for Banias core. Thus, such a result is not at all outstanding, considering that AMD K8 architecture is similar to Intel Pentium III/Pentium M, not to Pentium 4 (NetBurst).

Let's proceed to optimized SPECfp code. Like Intel processors, Athlon 64 FX-57 always demonstrates performance gain when the new compiler version is used. The relative gain value varies (it depends on an optimization type) as well as methods to obtain it. For example, SSE variant (-QxK) demonstrates a noticeable 8.7% drop in 171.swim (note that the Pentium 4 processor gained in this task), while 172.mgrid gains 10.6% and 187.facerec gains 6.8%, the total score in SPECfp_base2000 being 1.1%. In the old SSE2 variant for Willamette core (-QxW, which can run on AMD K8 even without patching), the obvious leadership is retained only in 187.facerec (6.7%), the overall advantage is just 0.6%. The new SSE2 variant for Northwood core differs by a small increase in SPECfp_base2000 (0.7%). But the spread in values is noticeable in some sub-tests (-15.1%(!) in 168.wupwise, +6.6% in 172.mgrid, and +19.8% in 178.galgel). And finally, the best optimization for SSE3 (Prescott core, -QxP) is characterized by almost complete lack of a performance drop (we should just mention the 3% drop in 301.aspi) and a considerable performance increase in a number of tasks (172.mgrid - 10.4%, 178.galgel - 17.8%, 179.art - 8.3%, 183.equake - 6.5%). As a result, the total score in SPECfp_base2000 is higher than in the previous version by 3%. What concerns code efficiency, we have already noted that it's the highest in case of SSE3. Then goes SSE2 for Banias core (-QxB), which again does not contradict to our idea of the AMD K8 architecture, followed by -QxN, -QxW, and -QxK.

Conclusions

The new Intel C++/Fortran Compiler 9.0 demonstrates an ambiguous picture in its "typical" code compilation (we mean compiling with profiles). In general, the resulting integer code is a tad slower (by 3-5%) than the code compiled in previous Version 8.1. Significant performance drop is demonstrated only in one task, but it's quite weighty — from 27 to 34% depending on a processor. You will be lucky, if your code does not resemble this task :).

Nevertheless, the new version of compilers demonstrates a number of advantages over the previous version in terms of calculations with real numbers (where SSE, SSE2, SSE3 instructions are used) — quite insignificant though (from 0 to 3%). The usage of optimization keys for a given micro architecture of a processor remains adequate (-QxP for Pentium 4/Prescott, -QxB for Pentium M/Dothan, we can recommend experimenting with QxB and -QxP for AMD K8 processors).

By the way, let's say several words on AMD processors. According to our research, both ICC/IFC versions (8.1 and 9.0) compile code that demonstrates very good (even the best in some cases) performance on AMD processors... in case we "patch" it :), or we "patch" compiler libraries. It would have been peachy, if Intel the manufacturer replaced the current check of a processor type for a wiser one — similar to what we have used.

This modification would be beneficial to end users in the first place. In this case, even if a software developer uses "automatic" optimizations like -Qax*, the most optimized code will be chosen for execution, depending only on availability of necessary extended instruction sets, not on a CPU manufacturer. Note that one of the points charged by AMD to Intel is that AMD processors may be much slower than their competing processors, when executing an "automatic" code, even though they have necessary extensions.

It would be no less beneficial to software developers and testers — there would be no need to use different compilers for different processors or to develop applications for processors of a given manufacturer.

And of course, AMD itself would profit much — there would be no need to develop its own compiler, which has been on the hook for a long time already :).

Dmitri Besedin (dmitri_b@ixbt.com)
September 5, 2005.

Write a comment below. No registration needed!