Standard Performance Evaluation Corporation (SPEC) was founded in 1988 by several suppliers of computer facilities for the purpose of development and support of a wide range of computer system performance measurement programs. Today the corporation consists of over 60 well-known companies.
SPEC offers software for estimation of post servers, Internet servers, file servers, supercomputers and clusters, computation systems, professional graphics applications etc. Some tests are free and available for download, others are quite expensive; the most popular test is SPECviewperf which is used to estimate performance of OpenGL applications. But today we will speak about less known but also very interesting SPEC CPU2000 test.
The CPU2000 is developed to estimate performance of a central processor(s). As the CPU usually works in combination with RAM and chipset, it would be more correct to say that the CPU2000 tests performance of computation systems by using compute-intensive calculations.
The results do not actually depend on such components as a video card, a hard drive and a CD-ROM drive. First of all, because the test uses operation with a command line and doesn't display results. And secondly, the most part of operations are carried out in RAM not to stress a disc subsystem.
So that the portability to different platforms can be high the test consists of source texts and tasks in C, C++ and Fortran. On one hand, it allows comparing such different systems as, for example, a computer based on the AMD Athlon and Windows NT and a cluster of 32 dual-processor computers on the Intel Xeon working under the Unix clone. Such variant, however, brings in one more factor into the test which is a possibility to choose compilers and their settings when creating test files to be implemented on a chosen platform.
All applications enabled are divided into two groups. The CINT2000 includes 12 applications which operate mainly with integer data (and logical operators). 11 are written in C, and 1 in C++. The second suite - CFP2000 consists of 14 applications (6 Fortran-77, 4 Fortran-90 and 4 C) which use intensive floating-point operations. The final scores are based on measurement of time of operation of these applications.
The SPEC CPU is, in fact, a synthetic test. Although all tasks are taken from the real life (e.g., archiving and compilation), they differ from real programs. It can be explained by improvement of algorithms and by a chosen compiler - the test certainly uses the latest version, and a real application was probably compiled with a last-year version. That is why it's impossible to generalize the test results to your favorite application.
But because it's impossible to account for all tasks the SPEC CPU has become an industrial standard; it gives an average score which can be used as a standard reference point in performance comparison.
Unfortunately, tests are usually based either on a script with a real application (then it's much argued about which version should be used) or they are synthetic (then the results should be very carefully generalized as such tasks can never be used in reality at all).
SPEC tried to find compromise by using real applications in source codes (it means that it froze improvement of algorithms and limited code optimization). We will try to see whether the attempt is successful.
We will not discuss whether it's correct to use a single final score. However, it should be noted that the SPEC CPU2000 is a synthetic test, and a single final figure suits better for performance estimation of a wide range of tasks as separate applications can be untypical of the test platform (for example, the program version of OpenGL), and it's more difficult to generalize their test results to even similar tasks.
Test system and utilization
CD with the SPEC CPU2000 contains:
The total size of the files is over 300 MBytes.
Well, almost any system can be estimated in this test.
The process of test implementation includes the following stages:
If you need to compare several similar systems which differ, for example, only in a processor, it's not necessary to compile the tests several times. You can use one system for creation of exe files, and then run them in the test configurations.
The main tool of the test is utilities for compilation and implementation of the tests. To make the portability higher they are written mainly in Perl whose interpreter comes with the test. Besides, these utilities are used to obtain official results which can be then published on the SPEC's site.
This grand test is often referred to by majors, however the publication is not easy - the test uses technologies identical to electronic signatures. The utilities generate and verify checksums of both exe files and their results during compilation and running of the tests, and it's guaranteed that the given results are obtained with these particular program's versions and no figures are incorrect.
Of course, correctness of implementation of each application is controlled. I.e. the output data are compared with the reference ones.
The most important process here is creation of a configuration file. It contains all necessary parameters for test compilation including compilers used, optimization flags, libraries etc. Publication of the results makes no sense if this file is not shown because its contents has the greatest effect on the results.
The disadvantage of the test is lack of an automatic identification of the system configuration; therefore, information in the configuration file is not a sufficient source for repeated implementation of the test in the same conditions. In principle, it's possible to include everything into the narrative to this file, but it's inconvenient because, for example, to test processors which differ only in frequency you must prepare several files which have in fact only one different line.
As each suite includes a great heap of subtests, optimization of different tasks may need different optimization flags and even compilers. To make comparison of the results correct the tests divide into base and peak ones.
The first one has stricter limitations on code compilation: it's allowed to use one compiler (for tests of the same language) and the same optimization flags for it (not more than 4). Two-pass compilation is allowed (for intermodule optimization). In the second case different versions of flags and even compilers can be used on each subtest. In some configurations it helps to get a final score higher by approx 7%.
Another choice to be done is a choice of a metric. The test contains two versions - speed and rate. The first one is used for comparing the ability of a computer to complete single tasks and displays the result percentage-wise of the base system speed. A compiler creating a multiflow code is allowed to be used. But as the source texts are not prepared specially for such variant, no positive effect can be noticed.
The second measures the throughput or rate of a machine carrying out a number of tasks, and the result is obtained in "tasks in hour". As a rule, the number of simultaneously implemented tasks is usually equal to the number of processors (of course, if they are not 32. In this case you can leave one processor for the system). A small drawback of such approach is that similar tasks are started simultaneously. By the way, you can use the rate test with two or more tasks with just one processor. This information can be useful to estimate operation in a multitask system. It's also interesting to run the rate test on a dual-processor system indicating that it's necessary to emulate operation of just one user. The comparison of this figure with the one for two simultaneously implemented tasks allows us to estimate scalability of the architecture of the computation system.
The following formulae are used to calculate the final scores of the test:
"speed" SPEC int/fp= GEOMEAN(reftime/runtime * 100)
"rate" SPEC int/fp= GEOMEAN(1.16 * N * reftime/ruuntime)
The SUN Ultra 10 is used as the base system. Remember that for the official publication the runtime must be calculated as an average for at least three times of running the test.
Usage of the geometrical mean instead of the arithmetical one makes possible to smooth over differences in runtimes of different tests. It's of great importance as the test suite doesn't change often - the latest version of the CPU2000 replaced the CPU95, and at present they are gathering applications for the CPU2004.
As I mentioned before, the test consists of two suites of applications - measurement of a speed of processing of integer and real arguments. All subtests have their own names and a unique number, and are usually written as, for example, 176.gcc.
Below you can look at brief descriptions of all used applications.
Well, the suite is not monotonous. The applications were carefully selected during a long time by SPEC, and such grands as AMD, Compaq, HP and Intel agreed they were interesting for uses.
Now let's take a look at the CFP2000 tests which use primarily computation tasks with real numbers of double accuracy.
The CFP2000 suite is also varied. Note, however, that the most of tasks are highly specialized. Besides, as you know, most of algorithms get improved becoming much faster. And I want to say that in this test it's more correct to use one final score of the CFP2000 than scores of separate tests.
In the next parts of the SPEC CPU2000 test we will
try to find out what and how it actually measures, what the results
depend on and what the obtained figures mean. Also we are going
to show a great deal of interesting pictures.
Write a comment below. No registration needed!