New CPU Performance Test Method: The Description and An Illustrative Sample Usage

Preface substitute

In good old times everything was better and older, it's a well known fact :). Processors were scarce, they were released once half a year maximum, and they could be compared in a Favourite Comparative Program, which reported upon completion: this processor demonstrated performance of 26 units, and this one — 32.5 units. But nasty manufacturers produced many architectures, technologies, and multiple other concepts, so that CPU performance comparison is a real purgatory now: one program prefers one architecture, the other one — the other, and the third one makes no difference between Pentium and Celeron or Athlon and Duron. Such a mess. That's the mess we live now.

That's why we have to forget about the Favourite Comparative Program and use various programs. The more, the better (at least in theory). But theoretically does not at all mean in practice. That's why our detailed results presented on 101 diagrams inevitably put our readers into necessary but somewhat ill-timed for reading this article sleep :). The road of a modern tester lies between these Scylla and Charybdis: on the one hand – to present as much information as possible about CPU performance in various programs, and on the other hand – not to overdo the info volume, so that readers could read the article up to the end.

Besides, a squall of processors gave rise to another problem: to test them each time from scratch is absolutely impossible because there are too many of them, but if we want to use the results from previous articles in future, we should decide on the set of benchmarks from the very beginning and try not to change it afterwards.

The above mentioned reasons gave rise to so called "methods of testing", which are actually fixed sets of software, test data, and options to get comparable test results for rather a long period of time. Today we'll present you a test method which is intended to be used on our web site to test CPU performance in 2005. Let's start with reservations, as always :).

Firstly, do not raise alert if you haven't found here the Latest Version of Your Favourite Program, or haven't found this program at all. A common standard method of testing does not mean that other software be not be used at all. It just means that most tests will be carried out within this framework. We can just as well write a separate article devoted to the output rates of X's and O's in the popular noughts and crosses game with a couple dozens of processors and all possible combinations. But that will be a special, "dedicated" test. But in general cases the set of benchmarks will be as described below.

Secondly, though this article contains some results, you shouldn't consider it as test material. In fact, all you see below is draft results that were obtained in the process of estimating suitability of programs for the tests rather then test results. Roughly speaking: if a program failed to react to processor changes, it was rejected because its results were useless for CPU performance evaluation. But if the program showed the difference, it was enlisted as a candidate to be included into the test method. The software list you can see below includes the candidates that have passed preliminary tests. Besides, in this article we decided to publish as many results as possible and provide comments which ones look the most interesting. In our opinion, this approach provides deeper understanding of the test method: you can make sure for yourself that the results planned to reject or to be averaged really deserve this attitude.

Testbed configurations

Processors
- AMD Athlon 64 3500+ (Socket 939)
- AMD Athlon 64 3800+ (Socket 939)
- AMD Athlon 64 4000+ (Socket 939 aka Athlon 64 FX-53)
- 3.4 GHz Intel Pentium 4 eXtreme Edition (Socket 775)
- Intel Pentium 4 560 (3.6 GHz, Socket 775)
Motherboards
- ASUS P5AD2-E Premium (i925XE chipset)
- Engineering sample of a motherboard based on ATI Xpress 200P (RX480)
Memory
- 2x512 MB PC3200(DDR400) DDR SDRAM DIMM Corsair, 2-2-2-5
- 2x512 MB PC2-4300 (DDR2-533) DDR2 SDRAM DIMM Micron, 4-4-4-11
ATI Radeon X700 (PCI Express x16) video card
Western Digital WD360 (SATA), 10000 rpm HDD
Windows XP Professional SP2, DirectX 9.0c

Description of the test method

Section 1: "Semisynthetics"

We have entitled this section semisynthetics on purpose, because the programs used here cannot be considered synthetic in the strict meaning of the word. Pure processor synthetics are low level tests for the execution speed of certain instructions. But in this case for "synthetic" purposes we use application formulas and algorithms, which are used in non-test software. For example, CPU RightMark simulates interaction of solids, this task is often solved in game engines as well. Of course, in this case the code is optimized as much as possible for performance and use of additional instruction sets, so the test results obtained for CPU performance can be called somewhat idealized. But you also cannot consider this task to be purely synthetic and having no use in practice.

The second semisynthetic group is archivers. Here is the opposite situation: this task is absolutely real to all appearances, because we use software used on mass scale. However there are various archived data formats. In our case we use special test files, sort of an idealized situation: a lot of files of the same type, all the types are well compressible.

1.1 CPU RightMark (RMCPU 2004B)

In fact, CPU RM comprises two tests. The solver module calculates interaction of solids and the render module displays this interaction on screen. As we have already ascertained in practice before, the correlation of CPU performance of various architectures in solver and render modules is not always the same. Moreover, if the architectures are very different (Intel Pentium 4 and AMD Athlon 64, for example), the differences may be quite impressive. Let's have a look at the diagram below.

You can see that the AMD architecture in the solver module is always faster in general. Does it mean that the same tendency will be seen in the render module? Not necessarily.

On the contrary, the render module shows obvious preference to Intel architecture. First of all this is because the render module in CPU RM is multi-threaded, which allows to enable the Hyper-Threading technology, and it's supported only by Intel processors. Of course you may ask a question: why not create a multi-threaded solver? We can say just one thing so far: the work is in progress, but it turned out that calculations of physics is much worse at threading than rendering.

The total CPU RM result is strangely the least interesting. The fact is, the solver performance amounts to hundreds of frames per second, while the rendering performance — only to dozens. As always happens in these conditions, the total performance is determined by the slowest element. Roughly speaking, the correlation between processors on the diagram and the overall results will almost always be similar to the diagram with rendering results. That's why the diagram with overall results seems not necessary if there are two separate ones for the solver and render modules.

1.2 Archivers

As has been already said above, we tend to refer archiver tests to semisynthetics rather then to real tasks. Both programs we use demonstrate approximately the same preferences: they love fast memory with low latency and critical to L2 Cache capacity. Besides, unlike RAR, 7-zip can use multiprocessing (and its particular case – Hyper-Threading). Our test kit includes file groups of approximately the same size with the following file types: TXT (text documents), DOC (Microsoft Word files), PDF (Adobe format, which is often used for technical documentation), BMP (uncompressed graphics), DBF (one of still popular database formats), DLL (binary files).

Due to its multiprocessing support, 7-zip shows preference to the architecture from Intel. However, note how slowly it operates: the same files take in general 3-4 times as much time to compress as in case of RAR. On the other hand, this is the only archiver we know that supports multiprocessing. Besides, it's absolutely free and supports unpacking multiple formats.

WinRAR is gradually getting a de facto standard for higher compression, higher than you can get with ZIP. It's not so picky, but in general it tends to show more preference to the AMD platform. Besides, it works really fast, even with the largest dictionary size (4 MB).

There is hardly any point in averaging the results of such different programs, and throwing out one of them is not objective (especially as there are only two of them). That's why we shall publish both diagrams in future article.

Section 2: 3D Graphics

In this case we do not mean the 3D Graphics in general, but "3D modeling". It's represented by three most popular programs: 3ds max, Maya, and Lightwave. Of course, if you work in this field, you will notice the lack of SoftImage. Unfortunately, we don't have a method for testing this package yet. If you are really interested, contact us and tell us how you imagine it (this challenge mostly concerns professionals working in SoftImage). Perhaps, iXBT will place an interesting and unusual order ;).

2.1 3ds max 6

Though we have our own test method for rendering efficiency in 3ds max, upon thorough analysis we have decided on using a test from www.spec.org for the "common" test method. The main reason for this decision: it offers much more functions. If you read articles with iXBT.com tests, you should have noticed that it measured only rendering speed, even if with three various render engines. But the final rendering is not the only task, which speed interests people working with this software package. The speed of visualization, rotation, responsiveness to user actions – all of this is no less interesting.

The test from SPEC is a script that imitates user actions. It includes rendering among other things. That's why it looks more preferable from the point of view of measuring CPU performance in 3ds max (not only for rendering). Of course, it doesn't mean that our original method will be discarded – we are going to update and enhance it in future, but it will be used only in "full-scale" tests oriented at 3ds max users, not for regular users, who are interested in a mixed bag.

Note: in the general case, if a diagram contains points or frames per second, the highest result is the best. And if the unit of measurement is time, the best result is the lowest one. If the lowest points happen to be the best result, the article will provide a footnote.

Results of the rendering test group are of the least interest, because the SPEC test uses the built-in 3ds max render engine, which is practically ignored by professional designers working in this program.

Interactive operations – that's what we are mostly interested in in the SPEC test. There are some differences between processors, but they are hardly considerable. It's quite normal: as a rule, the more "real" the situation is, the less the difference is. Only synthetics may demonstrate manifold advantages, or at least 30-40-50% ones. The pure performance of a given component (processor, video card, memory, hard disk) in real tasks is superimposed by performance of other devices, so a replacement of a single component will not have such an impressive effect. On the whole, we can certify the advantage of AMD.

As this article is mostly focused on the test method description instead of the test results, we'll mention a method for calculating the total Composite scores: they include the results of the two previous groups (rendering and interactive operations) in the following proportion: 20% from rendering and 80% from interactive operations. Thus, interactive operations possess an overwhelming advantage, which makes it quite easy to predict the results: the ultimate winner in the majority of cases will be the winner in interactive tests.

In this connection it seems reasonable to publish only interactive test results in future articles, because the above mentioned built-in engine rendering is of little interest and the total score is just a little blurred version of the interactive test results (because rendering results are of little interest and amount for a small share in the total score).

2.2 Maya 6

The latest version of 3D modeling package from Alias|Wavefront is Maya 6. But at the time we decided on the test method, the SPECapc for Maya test was available only for Maya 5. However, it works well in Maya 6 as well, and we jumped at this opportunity. The "native" Maya 6 test is already available at present and later on we'll certainly use this version.

From the point of view of the test process organization, the SPEC test for Maya is similar to the 3ds max test – it's also a script that imitates various user actions. But in this case there is no final render test, that's for the better though. Upon completion the benchmark outputs four results: performance of the graphics system, input/output system, processor, and the total score.

Though theoretically graphics performance should depend on a video card (which was the same in all our tests), the differences between some processors are noticeable even in this test. However, considering that the x86 PC graphics system performance traditionally depends very much on how the drivers are implemented, and the driver code is executed by a processor, the explanation for this phenomenon lies on the surface. By the way, powerful processors demonstrate almost identical results, only weaker CPUs do not allow a video card to make maximum use of its capacity.

Frankly speaking, the SPEC web site does not provide clear explanation of the I/O system performance notion. It may be the data access rate of the hard disk – but we are absolutely uninterested in this parameter. We can still notice differences between some processors, so let's not write this test off.

Of course, the best difference in CPU performance is demonstrated by the test, which is named accordingly: "CPU". However even here we've got some questions: Athlon 64 3800+ turned out to be no different in performance from Athlon 64 4000+. This is theoretically possible, because they have the same clock. But still it's somewhat strange that a 3D modeling package cannot make use of the doubled L2 Cache capacity. However, one can also assume that the results were smoothed over by the medium performance video card. We'll check this assumption in future tests by replacing ATI Radeon X700 with X800.

The total score comprises the results of the three previous tests with the following weight factors: graphics – 70%, processor – 20%, I/O system – 10%. You can easily notice that our most interesting test has a relatively weak effect on the total result. Taking into account this fact and that some CPU influence (though inconsiderable) is still felt in all tests, it seems expedient to publish two diagrams for Maya: CPU test results and the total score.

2.3 Lightwave 3D 8

Unfortunately, we didn't manage to find an interactive test for Lightwave 8, so in this case we'll have to content ourselves with rendering a relatively complex scene. It would also be good to mention that Lightwave still refuses to "understand" automatically the number of CPUs. So the number of threads for the render module has to be specified manually. But our internal tests demonstrated that there was an easier way out: this parameter can always be set to maximum (8 threads). The increased number of rendering threads has no noticeable effect on the performance of single CPU systems.

There is no need in commenting the results, demonstrated preferences are quite obvious. As the difference between processors of different architectures is noticeable, this test can be safely considered useful and interesting and included into our test method. The diagram choice issue does not come up here because this test outputs only a single result.

Section 3: Bitmap graphics and prepress operations

The main test in this section is a script for Adobe Photoshop CS (8), developed in our testlab. it includes the most frequent operations: Blur and Sharpen filters, RGB -> CMYK -> Lab conversions, lighting effects, image rotation, resize, transform type operations. These actions are performed over a real photograph taken by a digital camera. At multiple requests from our readers we have also included an Adobe Acrobat Distiller test into this section – PS into PDF conversion. It uses several real iXBT.com magazine articles.

3.1 Adobe Photoshop CS (8)

The Blur script comprises image processing with Gaussian Blur, Motion Blur and Radial Blur. The latter filter includes the fewest operations because it's the most time-consuming. Filter parameters vary within a wide range.

The Color script is the simplest one: an image is consecutively converted from original RGB 8 bit/channel into CMYK 8 bit, Lab 8 bit, Lab 16 bit, CMYK 16 bit, RGB 16 bit, and then again into RGB 8 bit. This sequence is repeated a sufficient number of times to minimize the measurement error effect on the results.

The Lighting script operates with separate and multiple light sources of all types available in Adobe Photoshop: Spotlight, Directional, Omni, in various combinations.

The Rotate script consecutively rotates an image clockwise and counterclockwise at various "inconvenient" degrees (3.3, 6.6, 9.9, etc).

The Sharpen script applies Unsharp Mask with various parameters to an image, each parameter varies within a wide range.

The Resize script consecutively (stepwise) enlarges an image by over threefold (in each coordinate) and then shrinks it by threefold relative to the original size. This sequence is repeated a sufficient number of times to minimize the measurement error effect on the results.

The Transform script uses a cycled combination of all transform operations available in Adobe Photoshop: Scale, Rotate, Skew, Distort, Perspective.

The total score is calculated as a geometric mean of the total execution time of all seven scripts. Though the data format is formally retained in this case, but the data should be taken as mere "scores". That's because a geometric mean of several time intervals has practically nothing to do with the execution time of the given scripts.

We can get carried away and provide lengthy comments practically on each result and test. Performance difference between various processors and architectures changes unpredictably from test to test. However, remember that the present article is not devoted to testing processors in Adobe Photoshop, but to a general test method for CPU performance. That's why in "all-embracing" articles (as opposed to specialized materials) we'd better confine ourselves to a diagram with total scores, even though each test is interesting in its own way. Which does not at all mean that we cannot write a specialized article devoted to indepth CPU performance tests solely in Adobe Photoshop based on these test scripts.

3.2 Adobe Acrobat 6 Distiller

Those who use this program professionally do not need to be explained what it's used for. Those who have no idea of this program will require a lengthy lecture. Let's just do with a brief explanation: Distiller converts data (files) from one format into another. It's an inevitable prepress operation in most cases, it's done rather often, and it takes up much time. That is it's a classic task, which execution speed is vital for users.

Processors demonstrate noticeable differences, sometimes quite considerable, which is actually one of the key requirements to a test. Practical character of Distiller for a certain group of users is also beyond doubt. This test possesses all prerequisites to be included into the standard method of testing.

Section 4: CAD/CAM

This is our landmark decision, so we tried not to include a lot of programs into this section. In fact, it currently consists only of one program: SolidWorks 2003. We use a relatively old version, because SPEC.org has not yet released a test package for later versions. However, software develops relatively slowly in this sector, so one can hardly expect cardinal differences in CPU preferences between neighbouring versions.

Let's provide a brief description of the area where this software is used for those who have never dealt with it before. It's mostly used in complex machinery design: engines, cars... up to space ships, actually. SolidWorks is a "respectable" engineering CAD, and in our opinion the results of CPU test in this program will be appreciated by engineers working in this field.

4.1 SolidWorks 2003

According to the SPECapc tradition, the test script imitates user operations and outputs four results upon completion: total score, performance of the graphics system, input/output system, and processor. It should be noted that SPECapc for SolidWorks 2003 retains the performance rating in points, but the best results here are the lowest ones.

Both the previous diagram and the graphics system performance test demonstrate no substantial differences between processors. But on the other hand, we haven't run the processor test yet...

...Fortunately, this test demonstrates much more developed differences between processors. Perhaps we should try and install a more powerful video card: the difference may grow larger.

The mysterious I/O test... also demonstrates considerable difference! It's especially prominent in the comparison of Athlon 34 3500+ and Athlon 64 3800+. It's not quite clear why it happens — the only difference between the systems is their processors, all other components, including the motherboard chipset, are left the same. Perhaps SPEC does not mean HDD performance by I/O system performance, otherwise the differences between the results will be hard to explain.

So far the most appropriate candidate to be included into our future articles is the diagram with CPU performance test. Though we should try and install a more powerful video card, and if the graphics results get more informative, there may be a point in publishing the total score as well.

Section 5: Media encoding

This section comprises everything connected with encoding video and audio data, that is classic WAV -> MP3 conversion as well as video compression with popular codecs. The audio section limited to a single MP3 LAME codec may seem an unjustified narrowing of the topic, but let's face the truth: the overwhelming majority of users encode audio into MP3 (not OGG, not WMA), and they actually do it using LAME. The video section provides a tad more variety: we use four encoders: DivX, XviD, Windows Media Video 9, and MPEG2 Mainconcept MPEG Encoder.

5.1 Encoding audio

Good old LAME... Due to the great number of presets and their votaries, we have taken the simplest tack: encoding with maximum possible quality: 320 kbps CBR, q=0.

We already found out before that the q=0 parameter results in plummeted "cache-mongering" of the latest codec version, which is confirmed by the results of the diagram above (best results are demonstrated by Pentium 4 eXtreme Edition with 2 MB L3 Cache).

5.2 Encoding video

It would have been strange not to include DivX into the encoding speed test... so we have done it. The new version brought no special surprises

But what concerns XviD, the older it grows, the more like its commercial brother DivX it gets in performance.

...Windows Media Video 9 also demonstrates a similar picture...

...And MPEG2 encoding with Mainconcept MPEG Encoder does not change this picture either!

Thus, we see one diagram, which certainly deserves to be analyzed separately from all the others (encoding audio with LAME) due to its different character, as well as four diagrams with results for various codecs, which look very much alike. In this situation it looks "conceptually correct" to combine all the four video diagrams into one (by geometric mean) and to publish results for separate codecs only if their behaviour suddenly becomes different from the described above.

Section 6: 3D visualization

That's how we decided to entitle this section, most of which is devoted to... game tests. In fact everything is clear: when we test PC performance with demos, we test 3D visualization, nothing more. No game AI, even if available, is ever used in demos – movements of characters are programmed once and for all, they have no "personal will" in test runs. So the load is mostly on that part of the engine, which feeds the data to the video system. It may also fall on physical calculations (if the engine can do that). That's why it's generally wrong to say that demos reflect the real PC performance in real games: for example, demos don't use AI. Of course, AI works only in Single Play Mode, so we can say that demos reflect the real game speed in multiplayer mode.

But not only games are included into this section. SPEC viewperf falls in beside the games at the very end of the queue – it tests 3D graphics visualization performance in professional engineering software with OpenGL API support. Why not? What's the principal difference between the demos of DOOM3 or FarCry and 3ds max or Pro/ENGINEER? There is no difference except for the visualization engine. That's why SPEC viewperf is placed in this section. But we'll start with games anyway...

6.1 Modern 3D games

We have tested all the games in two modes: at 640x480, 32-bit color, and low quality settings and at 800x600, 32-bit color, and medium image quality settings. Internal tests demonstrated that more complex graphics modes in these games (e.g. 1024x768x32 with high image quality) level CPU influence on the performance, that's why they are of no interest from the point of view of the main objective of our test methods.

Graphics settings and resolution have no special effect on the comparative performance of various systems... but what is worse, there is absolutely no considerable difference between CPUs, which differ much in their performance! Of course, we can blame the weak video card, but DOOM3 does not qualify for a proper CPU benchmark.

The plot thickens in Far Cry, because it has four standard demos. Considering two resolutions and graphics quality settings, we have eight (!) diagrams in total. We'll publish all of them here for you to see with your own eyes that...

...the performance picture is the same from level to level. On the above diagram you can see the geometric mean of the low-quality graphics settings in all demos, it is the ditto of the four previous ones. Are there any changes at the increased resolution and quality settings?...

No, we can see no changes, only the spread of results got a tad smaller, which can be easily explained by the increased influence of the video card. Thus, the summary result of the four demos at 640x480x32 with low quality settings can stand the picture of CPU performance in 3D visualization in Far Cry engine. It's the best at demonstrating differences between processors. And don't forget that this is our only priority here.

Painkiller is certainly one of the most interesting games for a CPU test, exactly the new C5L2 benchmark (the older C5L1 depends little on a processor, we found it out in our internal tests). Pentium 4 eXtreme Edition demonstrates an impressive result – compared to the 3.6 GHz Pentium 4 on Prescott core, P4 XE results pour cold water. And that's what a tester needs – much "cold water". :)

For a test in Unreal Tournament 2004 we used two "third-party" (not part of the bundle) demos: Ons_dria and Primeval. They are much more intense by the number of events, participants, and special effects per unit time than the standard ones. However, we still don't see any relevant difference between processors, just the traditional rearguard of most game diagrams (Prescott) dropped ever further behind. But it's Prescott that makes the difference not so small to exclude the game from our test method. Perhaps the summary diagram for the low resolution is worthy of being included into articles.

There may be also a point in drawing a little "gaming bottom line" as a summary diagram with the average result for all the four (or three, if DOOM3 does not justify hopes) tests. Though its necessity is doubtful, frankly speaking. We are looking forward to your comments...

Some words about the sore subject: about Half-Life 2. Of course, we are going to consider the introduction of this game test into our test method, especially as we already have the real candidate to get the bounce (DOOM 3). But in practice, using Version 1.0 of a game in testing (that is the first release, no patches) often does not pay its way, because as soon as patches come out, the performance situation may change considerably. That's why it makes sense to wait at least for the first patch or better for two or three of them, and only then introduce this game test into the standard method of testing.

6.2 3D modeling packages

In case of SPEC viewperf, there is only one diagram worthy of being included into articles – the one with total test scores. Trust us, we just didn't want to crowd the article with extra diagrams, there are already no less than 50 of them :).

Conclusion

So, we have reviewed the method of testing modern CPU performance proposed for 2005. Of course, it will get modified with the release of new programs and updated versions of the old ones, but we'd like to hope that the current backbone will survive: it took us rather long to work it out, we tried to take into account the majority of users' wishes, and we "wouldn't wish it on our enemy" to rewrite this test method from scratch. From our point of view, the new method of testing is quite adequate: most resource-intensive areas are taken into account, performance differences between processors are easy to make out, the summary diagrams are factual. In the nearest future (if our readers don't find out any obvious glitches in the suggested test method), we are going to publish the first summary test of a larger CPU range, including not only topnotch models, but popular CPUs as well. Let's summarize the set of diagrams, which we accepted as an optimal choice after we reviewed all test results:

CPU RightMark 2004B
- Solver module
- Render module
Archivers
- 7-zip
- WinRAR
3D graphics
- SPEC for 3ds max 6, interactive test
- SPEC for Maya 6, CPU test or the total score
- Rendering in Lightwave 3D 8
Bitmap graphics and prepress operations
- Test script for Adobe Photoshop 8, total scores
- PS -> PDF conversion in Adobe Acrobat Distiller
CAD/CAM
- SPEC for SolidWorks 2003, CPU test or the total score
Media encoding
- LAME encoding
- Summary results for DivX/XviD/WMV9/Mainconcept MPEG Encoder
3D visualization
- DOOM3, low resolution (dubious)
- Far Cry, summary diagram for 4 demos in low resolution
- Painkiller, C5L2 benchmark, any resolution (even 800x600x32)
- Unreal Tournament 2004, summary diagram for 2 low resolution demos
- SPEC viewperf 8.01, summary diagram for all tests.

17 diagrams in total. On the one hand it seems quite enough for an adequate CPU performance evaluation; and on the other hand, it's not enough to put readers to sleep before they read an article up to the end :). We are looking forward to your feedback!

Stanislav Garmatiuk (nawhi@ixbt.com)

February 17, 2005

Write a comment below. No registration needed!