Battle of ATI and NVIDIA: where are fair duels? or dishonest treating of 3DMark

NVIDIA

No sooner had the disputes around the 3DMark2003 and Detonator FX drivers faded away than www.tech-report.com published some information about new cheats for this benchmark in the same Detonator FX, namely about the aggressive optimization of anisotropic filtering with the name of an active application being the criterion of its activation. The ShaderMark from ToMMTi-Systems also had its shader code slightly modified which noticeably affected the performance; it indicates that the Detonator FX plays tricks on it as well. Since this hunting is acquiring the mass character we decided to investigate this issue ourselves. But we realize that to reveal the driver's tricks with benchmarks we need to cooperate with developers of given applications in 99% of cases. Moreover, even the developers are not always able to find and eliminate such optimizations because drivers often use very sophisticated methods of detection of one or another application. That is why we decided to look at the problem differently and to analyze the driver inside to locate and lock all ways of detection of one or another Direct3D application. It will disarm the driver and bring to minimum the risk of getting wrong benchmark results.

So, the starting point in our investigation was that the driver reacts to renaming of an executable file of at least one benchmark. It means that its code must contain an algorithm that analyzes names of active Direct3D applications. The first our step was to look for it. The research showed that the Detonator FX really contains a base of applications detectable with a command line that covers about 70 elements. When creating a Direct3D context the driver calculates two 64bit checksums using the name of the executable file and a list of captions of all process windows to get a unique 128bit identifier of the Direct3D application and to make settings for this application if it's available in the base. Unfortunately, it's impossible to find out names of applications from 128bit identifiers from the base. The only thing we can do to decode names of applications is to write a utility able to calculate checksums for an arbitrarily entered line according to the algorithm used by the driver. It can help to check whether the base contains a given application. And now we can only guess what applications beside 3DMark2003 can be found there. Its size indicates that the most part of it is used to avoid certain problems and probably to optimize certain gaming applications because the overall number of existant synthetic and gaming Direct3D benchamrks is noticeably less than 70. Supposedly, the same base contains compatibility settings for forcing full-screen anti-aliasing by the super-sampling method in certain DirectX6 and DirectX7 applications (the base of compatibility settings for supersampling of quite a big size wasn't decoded in the earlier versions). But it mustn't be ruled out that some of applications are detected by the driver in order to realize tricks undesirable for users, in particular, to activate aggressive optimizations for one or another benchmark. Whether it's right or not will be found out a bit later, and now we are starting the procedure of calculating the checksum which is used by the driver to identify applications.

This procedure is used by the driver not only for identification of an active application by creating a 128bit checksum when the context is made, but also uses it to form additional checksums. We found the links to this procedure and very suspicious calculations of checksums in handlers of many D3DDP2OP_... tokens. Let me digress a little to explain to you what are D3DDP2OP_... tokens. If you have some knowledge of programming in the API Direct3D, you probably know that an application communicates with a Direct3D accelerator on the COM interface IDirect3DDevice..., but do you know when and how transferred from the application to the interface get to Direct3D driver? Requests for many methods of this interface are transformed by the API into special control words (tokens) that get into the internal command buffer, and its contents is the input data for the Direct3 driver. For example, when creating a pixel shader by the command CreatePixelShader the API puts into the command buffer the token D3DDP2OP_CREATEPIXELSHADER and a structure unique for this token which contains a shader code to be created. In other words, the internal command buffer contains D3DDP2OP_... tokens that identify one or another command a given Direct3D application fulfills and some data provided by the application. The Direct3D driver, or rather its D3dDrawPrimitives2 function processes the command buffer. This stage of the driver's processing of buffer tokens suits best of all for realization of detection of applications by analyzing the data supplied. At this moment the driver can identify vertex and pixel shaders created by a given application, analyze the format and contents of textures when it's loaded into the video memory and identify many other specific things. That is why we should focus on the command buffer processor realized in the body of the D3dDrawPrimitives2 function if we want to find code parts used by the driver for identification of certain applications.

So, we said that we noticed suspicious calculations of checksums in functions processing certain D3DDP2OP_... tokens. Unfortunately, there were also D3DDP2OP_CREATEPIXELSHADER and D3DDP2OP_CREATEVERTEXSHADER, and the checksum was calculated according to the tokenized input shader code. It means that this is the very attempt of the driver to identify one or another pixel or vertex shader or one or another Direct3D application. The scandal around the 3DMark2003 and DetonatorFX proved that the driver identifies and replaces some pixel and vertex shaders in this benchmark, but it's confusing that only the number of calculated checksums of vertex shaders is about 50, which is incomparable to the number of optimizations for the 3DMark2003 found and voiced by Futuremark. That is why we decided to lock the checkups found to estimate a pure performance of the Direct3D driver. The well structured Detonator code helped us: since the driver uses the same procedure of checksum calculation in all mechanisms of identification of Direct3D applications, we created a small patch-script NVAntiDetector that slightly modifies the calculation algorithm by changing the original value of its seed. The procedure thus calculates checksums different from the original ones and locks all drivers' detecton methods, the command line analysis etc. Surely, we can't guarantee that this patch-script disables all tricks, as well as locks admissible optimizations and algorithms of avoiding problems in certain applications. We admit that but it's not an obstacle because our aim is to estimate performance in benchmarks, and no detection algorithms for one or another popular synthetic test application used for performance estimation only should take place in the driver's body.

The results of the Detonator FX tested together with the NVAntiDetector exceeded all expectations. The GPU family with its flagship GeForceFX 5900Ultra got much slower in almost all synthetic and gaming Direct3D benchmarks. The performance drop in the 3DMark2003 is incomparable to the results of the patch 330 that locked all Detonator FX's cheats used in this benchmark detected by the Futurtemark. Only the beta version of our RightMark3D showed the same scores, but this benchmark is not very popular yet, and its shaders are unknown to the Direct3D driver. At the moment we are working on a full-scale review of production cards based on the GeForceFX 5900Ultra which will also cover operation of this card together with the NVAntiDetector. And now let's estimate the scores of the benchmark which is becoming a thing of the past. As you understand, we will speak about the predecessor of the 3DMark2003 - 3DMark2001 SE. You may say that this is an obsolete benchmark, and it makes no sense in the tests. But if you remember, Futuremark always tried to make benchmarks with an eye to the future, that is why its complexity corresponds to what we can see in today's games. Moreover, looking for tricks in this benchmark will help us prove or disprove NVIDIA's statements made right after the release of the Futuremark's patch 330 that eliminates Detonator FX's cheats for the 3DMark2003. If you remember, the company tried to justify the tricks saying that the 3DMark2003 tests didn't comply to what we could see in future games and the 3DMark2003 discredited their products. Let's have a look at the previous Futuremark's solution which didn't produce such a negative reaction and wasn't so biased from NVIDIA's standpoint.

So, to repeat results of our tests you will need:

AntiDetector scripts (support all Detonator FX versions available (44.03, 44.61 and 44.65) under the Windows 2000/XP).
The latest RivaTuner version because all patch-scripts are RTS (RivaTuner Script) files.

For the scripts you will need to install and start the RivaTuner at least once to register this application as a processor of RTS files. Then you can start a script required from the OS explorer, and install the patched version having changed the distrtibutive of the Direct3D driver. It's enough to lock in the Direct3D driver all found mechanisms of detection of Direct3D applications. Well, let's see whether the results of the 3DMark2001 will change after installation of the NVAntiDetector.

Testbed:

AthlonXP based computer:

AMD AthlonXP 2100+ Throughbred @ 2400+ (2000MHz, 133MHz FSB);
ASUS A7N8X on the NVIDIA nForce2;
512 MB PC2700 DDRAM, CAS2;
GeForce4 Ti4600;
Windows XP SP1; DirectX 9.0a;

It's clear that the situation with tricks in the synthetic benchmark is not new for NVIDIA, and the optimization for the 3DMark2003 is not an exception:

With the benchmark detection being impossible the total score of the 3DMark2001 goes down by almost 10%. The total score is based on the game test results and is calculated by the following formula:

3DMark score = (Game1LowDetail + Game2LowDetail + Game3LowDetail) * 10 + (Game1HighDetail + Game2HighDetail + Game3HighDetail + Game4) * 20

The formula indicates that the half of difference (i.e. almost 600 points, (20 * (74.6 - 42.8))) between the original and modified driver is dependent on the Game4 results. What does the driver do to be able to lift the scores so greatly when it detects a given scene?

Let's take a look at the test4 and try to find where the driver could save on calculations. Most probably, the rendering speed of the Nature scene must depend a lot on animated grass and leaves of trees which make a considerable part of the scene geometry. They are the main candidates for optimization, and the driver should save on them first. To clear that out we are going to compare the image quality on the screenshots made with the original and modified drivers (ALL FULL-SIZE SCREENSHOTS are in the BMP format, that is why EACH IS ABOUT 2.5MB, they are packed in RAR achives each measuring from 1.3 to 3 MB):

Detonator 44.03	Detonator 44.03 + NVAntiDetector

If you look attentively you will see that the leaves and the grass turn by a certain angle with the installation of the NVAntiDetector. To demonstrate the difference we used Adobe Photoshop to calculate the per-pixel difference between them and then intensified it using the Auto Contrast function:

I can't say that the image is better or worse - it's just a different scene with different geometry. It will get clearer if we take into account that the animation of leaves and grass that imitates a wind effect is realized with vertex shaders. This fact and the difference between the screenshots indicate that there must be some tricks with the vertex shaders. Let's have a closer look at the shader's code used for animation of leaves in the Nature scene. It's simple to do because the 3DMark2001 keeps the shader code in the text form in data.ras. So, please have a look at tree.vsh:

vertex shader file for dx8 scene trees ;
; constants:
; c0-c3 view matrix 4x4
; c4 x = time, y = amplitude, z = bias + PI, w = 1.5f
; c5 x = diffuse amplitude, y = diffuse adder, z = 1.0f / 40320.0f, w = 1.0f / 362880.0f
; c6-c8 cam->world rotation 3x3
; c9-c11 world->mesh 4x3
; c12 vecsin ( 1.0f, -1.0f / 6.0f, 1.0f / 120.0f, -1.0f / 5040.0f )
; c13 veccos ( 1.0f, -1.0f / 2.0f, 1.0f / 24.0f, -1.0f / 720.0f )
; c14 pis( PI, 1.0f / (2.0f * PI), 2.0f * PI, 0 )
;
; in:
; v0 diffuse
; v1 uv0
; v2 x = X, y = Y, z = phase1, w = phase2
; v3 xyz = world position, w = 1
;
; out:
; oPos, oD0, oT0

; float sin2 = sin( vd.fPhase2 + fTime * 1.5f );
;
; float fAngle = sin( vd.fPhase1 + C_fTime ) * sin2 * C_fAmplitude + C_fBias;
; float s = sin( fAngle );
; float c = cos( fAngle );
;
; v.x = c * vd.x - s * vd.y;
; v.y = -( s * vd.x + c * vd.y );
; v.z = 0.0f;
;
; v *= C_matrix;
;
; vWorldPos = vd.vWorldPosition;
; vWorldPos *= C_mWorldToObject;
; v += vWorldPos;
;
; vDiffuse = vd.vDiffuse;
; vDiffuse *= sin2 * C_fDiffuseAmplitude + C_fDiffuseAdder;
;
; pPrim->setPosition( i, v );
; pPrim->setDiffuseRGB( i, vDiffuse );

vs.1.0

;;;;;;;;;; sin2 = sin( vd.fPhase2 + fTime * 1.5f );
;;;;;;;;;; 13 instructions
; v2.w = phase2
; c4.x = time
; c4.w = 1.5

mad r0.x, c4.x, c4.w, v2.w

; wrap theta to -pi..pi
mul r0.x, r0.x, c14.y ; divide by 2*PI
expp r5.y, r0.x ; get fractional part only
;;mul r0.x, r5.y, c14.z ; multiply with 2*PI
;;add r0.x, r0.x,-c14.x ; subtract PI
mad r0.x, r5.y, c14.z, -c14.x ; multiply with 2*PI and subtract PI

; compute first 4 values in sin series
mul r5.x, r0.x, r0.x ; r5.x = x2
; r0.x = x
mul r0.y, r0.x, r5.x ; r0.y = x3
mul r0.z, r0.y, r5.x ; r0.z = x5
mul r0.w, r0.z, r5.x ; r0.w = x7
mul r5.x, r0.w, r5.x ; r5.x = x9

; sin 9th order Taylor approximation:
; x - x3 / 6.0f + x5 / 120.0f - x7 / 5040.0f + x9 / 362880.0f;
mul r0, r0, c12
dp4 r0.x, r0, c12.x
mad r0.x, r5.x, c5.w, r0.x

;; now r0.x = sin2

;;;;;;;;;; float fAngle = sin( vd.fPhase1 + C_fTime ) * sin2 * C_fAmplitude + C_fBias;
;;;;;;;;;; 15 instructions
; v2.z = phase1
; c4.x = time

add r1.x, v2.z, c4.x

; wrap theta to -pi..pi
;;add r1.x, r1.x, c14.x ; add PI
mul r1.x, r1.x, c14.y ; divide by 2*PI
expp r5.y, r1.x ; get fractional part only
;;mul r1.x, r5.y, c14.z ; multiply with 2*PI
;;add r1.x, r1.x,-c14.x ; subtract PI
mad r1.x, r5.y, c14.z, -c14.x ; multiply with 2*PI and subtract PI

; compute first 4 values in sin and cos series
mul r5.x, r1.x, r1.x ; r5.x = x2
; r1.x = x
mul r1.y, r1.x, r5.x ; r1.y = x3
mul r1.z, r1.y, r5.x ; r1.z = x5
mul r1.w, r1.z, r5.x ; r1.w = x7
mul r5.x, r1.w, r5.x ; r5.x = x9

; sin 9th order Taylor approximation:
; x - x3 / 6.0f + x5 / 120.0f - x7 / 5040.0f + x9 / 362880.0f;
mul r1, r1, c12
dp4 r1.x, r1, c12.x
mad r1.x, r5.x, c5.w, r1.x

; r1.x = sin( vd.fPhase1 + C_fTime )
; r0.x = sin2
; c4.y = amplitude
; c4.z = bias

mul r1.x, r1.x, r0.x
mad r1.x, r1.x, c4.y, c4.z

; now r1.x = fAngle

;;;;;;;;;; float s = sin( fAngle );
;;;;;;;;;; float c = cos( fAngle );
;;;;;;;;;; 16 instructions

; wrap theta to -pi..pi
mul r1.x, r1.x, c14.y ; divide by 2*PI
expp r2.y, r1.x ; get fractional part only
;;mul r1.x, r2.y, c14.z ; multiply with 2*PI
;;add r1.x, r1.x,-c14.x ; subtract PI
mad r1.x, r2.y, c14.z, -c14.x ; multiply with 2*PI

; compute first 4 values in sin and cos series
mov r2.x, c12.x ; theta^0 ;r2.x = 1
mul r2.y, r1.x, r1.x ; theta^2
mul r1.y, r1.x, r2.y ; theta^3
mul r2.z, r2.y, r2.y ; theta^4
mul r1.z, r1.x, r2.z ; theta^5
mul r2.w, r2.y, r2.z ; theta^6
mul r1.w, r1.x, r2.w ; theta^7

; sin
mul r1, r1, c12
dp4 r1.x, r1, c12.x

; cos
mul r2, r2, c13
dp4 r2.x, r2, c12.x

; now r1.x = s, r2.x = c

;;;;;;;;;;; v.x = c * vd.x - s * vd.y;
;;;;;;;;;;; v.y = -( s * vd.x + c * vd.y );
;;;;;;;;;;; v.z = 0.0f;
;;;;;;;;;; 5 instructions

; !!!!!!!!! use 2x2 matrix (dp3) for this?

; r1.x = s
; r2.x = c
; v2.x = vd.x
; v2.y = vd.y

mul r3.x, v2.x, r2.x
mad r3.x, v2.y, -r1.x, r3.x

mul r3.y, v2.x, -r1.x
mad r3.y, v2.y, -r2.x, r3.y

mov r3.z, c14.w

; now r3.xyz = v

;;;;;;;;;;;; v *= C_matrix;
;;;;;;;;;; 3 instructions
; r3.xyz = v
; c6-c8 = C_matrix

dp3 r4.x, c6, r3
dp3 r4.y, c7, r3
dp3 r4.z, c8, r3

; now r4.xyz = v

;;;;;;;;;;;;; vWorldPos = vd.vWorldPosition;
;;;;;;;;;;;;; vWorldPos *= C_mWorldToObject;
;;;;;;;;;;;;; v += vWorldPos;
;;;;;;;;;; 5 instructions
; v3 = vd.vWorldPosition
; c9-c11 = C_mWorldToObject

;dp4 r3.x, c9, v3
;dp4 r3.y, c10, v3
;dp4 r3.z, c11, v3
;add r4.xyz, r4.xyz, r3.xyz
add r4.xyz, r4.xyz, v3.xyz
mov r4.w, c12.x

; now r4 = v

;;;;;;;;;;;;; pPrim->setPosition( i, v );
;;;;;;;;;; 6 instructions

m4x4 oPos, r4, c0

;;;;;;;;;;;;; vDiffuse = vd.vDiffuse;
;;;;;;;;;;;;; vDiffuse *= sin2 * C_fDiffuseAmplitude + C_fDiffuseAdder;
; r0.x = sin2
; c5.x = diffuse amplitude
; c5.y = diffuse adder

mad r1.w, r0.x, c5.x, c5.y
mul oD0.xyz, v0.xyz, r1.www

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;; 2 instructions

mov oT0.xy, v1.xy

The 3DMark2001 uses trigonometrical functions approximated with the Taylor's series of the 9th order for animation of leaves of trees. If the code above is not a gobbledegook for you, you must understand that this is the most probable place where the driver can make its optimizations. It's possible to use fewer terms of the polymons in the calculations and get a rougher approximation of the sinusoid to reduce the number of GPU's clock cycles. In this case we will get a close but not identical trajectory for the leaves. And we do witness that in the screenshots.

To prove that the driver works exactly on the vertex shader which deals with leave animation we tried to change a little its code to eliminate its possible detection and replacement. The engine of the 3DMark2001 can read necessary data both from the file system and from its own file storage data.ras, with the file system having a higher priority. If we create the file tree.vsh in .\data\vsh in the 3DMark2001 folder we can make the benchmark use our own vertex shader instead of the one stored in data.ras. Well, we created tree.vsh with the contents given above. As expected, its even inconsiderable change (for example, insertion of the nop instruction in any place, or replacement of the v1.0 with 1.1) brought the leaves back to their places and reduced the performance.

This situation reminds me the Detonator 40.xx and wonders promised by NVIDIA. I remember the 25% gain in the driver performance in this version and the 3DMark2001 Nature scores demonstrated to prove it. With the NVAntiDetector the scores became much closer to the Detonator 30.xx and brought to nothing all shader optimizations.

ATI

Unfortunately, it also reminds me the mysterious performance boost in the same test which took place right after the announcement of the R200 processor. But this time the gain was noticed in the drivers of NVIDIA's competitor - ATI. To find out whether their drivers work correctly we tried to find and disable optimization for benchmarks in the latest Catalyst version. If you remember, the ATI's position in the dispute around the 3DMark2003 looked not so shameful as NVIDIA's one. The research of the ATI's driver could help us understand how far or close to the truth ATI's announcements were. I would want to believe them, but they looked too naive saying that they never used optimizations before and now were going to adopt the competitor's rules.

So, let's take a look inside the Catalyst drivers. This time we have no reference points like we had for the Detonator, but nevertheless, we decided to check ati3duag.dll, a uniform Direct3D driver for chips based on the R300 core and higher. The handlers of some D3DDP2OP_... tokens most probable for realization of application detection were analized. Even superficial analysis of the code reveals a dozen of places which can be identified as attempts to detect certain Direct3D applications. There are obvious attempts to check the code of shaders at the moment of their creation. ATI used a simpler approach - they inserted the tokenized code of identified shaders right in the driver's body. But the number of detectable pixel shaders is far different from what we saw in the NVIDIA drivers: they are three in all, and two of them (v2.0) are probably the optimizations reflected in the Futuremark's patch 330 for the 3DMark2003. If we put up with the fact that we could miss some thicks in the code, and if we consider that all optimizations of this kind are revealed, ATI would seem to have tenfold fewer shader optimizations than its competitor.

Besides, the Catalyst also have algorithms for searching texture patterns in the D3DDP2OP_TEXBLT processor, algorithms of application detection by creating a series of textures of a definite size and format and other methods which can be identified as attempts to detect Direct3D applications. Almost all procedures of application detection worked in a similar manner: they returned the unique identifier if an active Direct3D application matched their detection criterion, otherwise they returned zero. When such procedure was started it initialized the internal driver variable with the value which identifies a Direct3D application detected by the driver. With that in mind we built the ATIAntiDetector script on the principle of locking all attempts of initializing this variable that could be found not to let the driver identify an active application. Besides, with the script we changed the first eight bytes of each of two pixel shaders 2.0 identified by the driver because their identification mechanism changed the code directly instead of initialization of the internal variable.

No we are going to see whether the performance changes with the Catalyst driver processed with the ATIAntiDetector script. The tests were carried out on the same testbed and RADEON 9500 64MB video adapter with the RADEON 9500PRO capabilities unlocked:

The same 3DMark2001, the same Game test 4... The Nature test also suffers when the application detection code is disabled. Let's try to find out what ATI saves on and look at the image quality (ALL FULL-SIZE SCREENSHOTS are in the BMP format, that is why EACH IS ABOUT 2.5MB, they are packed in RAR achives each measuring from 1.3 to 3 MB):

Catalyst 3.4	Catalyst 3.4 + ATIAntiDetector

Like in case with NVIDIA, the scenes are identical at first glance. But again we used the careful examination of the screenshots. Again, we used Adobe Photoshop to calculate the per-pixel difference between them and then intensified it with the Auto Contrast function:

It reminds me the picture I saw above when analyzed optimizations of the NVIDIA's drivers. It's obvious that ATI also pulls up the results at the expense of the alternative rendering of leaves and grass, i.e. at the expense of the objects the scene is full of and which are a bottleneck of this test. But the character of the difference is unlike to NVIDIA's case. It seems that the optimizations take place during polygon texturing, not at the stage of transformation and lighting. Since the difference is most noticeable on the leaves' edges we can affirm that the driver detects the Nature scene and then replaces the texture format of the leaves and grass with a more advantageous one (in order to save on the bit capacity of the alpha channel) by forced compression of an uncompressed transparent texture or by repacking DXT3 textures into the DXT1 format. To see if our hypothesis is correct we tried to change settings of the texture format in the 3DMark2001 and analyze appearance of the leaves and grass in case of different formats. As expected, in case of 16bit textures the performance dropped much and the speed was lower even in comparison with the more resourse-hungry 32bit format. Finally, we looked through the code to find the place used for detection of the Nature in the 3DMark2001. The results proved that the Catalyst detected the scene by creation of three successive textures of a definite format and size by the Direct3D application.

Finishing the analysis of the Catalyst driver we decided to check the insides of ati3d2ag.dll - a Direct3D driver for the R200 processors because this chip gave us an idea on possible optimizations in the ATI's driver. Like in case of ati3duag.dll, the study of the code reveals multiple attempts to detect applications, and they are even more aggressive. Even without the patch-script we checked it the same way, i.e. altered the code of the shader tree.vsh. The result on the same system with the RADEON 8500LE video adapter is rather sad: almost 30% drop (from 42 to 31 FPS) in the same test 4. Again Nature? Deja vu? Again optimized vertex shaders?

So, let's sum it up. You may ask us why's bad if both companies are trying to boost performance in the 3DMark? There is no difference in quality like, for example, in case of the Game test 2 - Dragothic and Detonator FX. But even if it's noticeable like in the Game test 4 - Nature, the image doesn't get worse at all. So, why is it bad then?

The answer is simple: first of all, all optimizations we detected and locked work in only one benchmark, which indicates that it took a lot of time for the programmers to analyze its behavior and specific character, bottlenecks of certain tests and parts of the scene where it's possible to save the processor's resources even without the detriment to the image quality. But any optimizations which work when a benchmark is identified are based on the careful analyses of its specificity and rigidly tied to a certain scene, i.e. they make existance of benchmarks senseless as they do not reflect real performance in most other applications. Synthetic benchmarks are not aimed to reveal who faster renders a certain frame by culling out one or another hidden object or by using alternative rendering procedures. Benchmarks have a single aim which is to estimate the speed of execution of a strickly determined set of calculations by a given video adapter. It's nice if driver developers who know peculiarities of their chips much better than third software developers are able to notice an ineffective shader and can markedly lift up its speed even by altering the image quality. Althoutgh we don't welcome such approach, such optimizations can be used sometimes, but only in gaming applications. Nevertheless, such optimizations are not admissible for test applications which are meant exceptionally for speed estimation. Otherwise, it would make no sense to use benchmarks to compare performance of video adapters from different manufacturers because volumes of calculations would be different.

Conclusion

Unfortunately, the conclusion is rather distressing. First of all, I want to ask Futuremark where they were before and why they started fighting the cheating just a short time ago. Weren't they aware of aggressive optimizations made by vendors for their benchmark? I don't think so. Weren't there any reasons or facts before? They were. Just consider the case when the splash screen in the Detonator go off or the notorious drivers of the SiS Xabre. Weren't they qualified enough to locate and do away with the cheats? I don't believe it. It took us only one day to find and eliminate tricks in the drivers of two largest vendors though we didn't have the source code of the 3DMark2001. We don't want to believe that participation in a beta-program automatically gives the right for such optimizations, but this is the only idea we have. We hope we are wrong.

Secondly, it's a sorry sight to see how vendors stubbornly try to outrun one another using unfair means. Here is the history of optimizations of our ill-fated Nature: the announcement and release of the R200 - criticism of performance of the R200's shaders based on comparison with the NV20's scores - optimizations for the Nature in the ATI's drivers - the release of the R300 and the same pulling up of performance in the Nature - NVIDIA's response in the form of the Detonator 40.xx that lifted up the speed in the same notorious test allegedly at the expense of the higher shaders' performance... Unfortunately, the race has no end. I don't like hearing pathetic annoucements of both companies when one of them is caught red-handed. In the scandal around Quake/Quack NVIDIA stated that they never used optimizations for certain applications and that they advocated only general approaches for optimizations. In the disputes around the 3DMark2003 ATI took a position of a hurt child and was inconsistent in its actions: first it repented and promised to remove its optimizations saying that their programmers never used any rendering tricks to get higher scores in benchmarks, then they promised to optimize the drivers to catch up with the impious competitor. A sad thing. Do you still believe that there are honest players on this market? Unfortunately, none of the vendors is able to stop such an unfair race. The final scores of the popular benchmarks have little in common with the actual performance. Unfortunately, sales volumes of both ATI and NVIDIA are directly dependent on these scores. They will use any means to make them higher. Do we have any chance to see a fair situation in benchmarks? Probably, only in two cases: when only one player remains on the market or when one of the competitors is much weaker and it's useless to affect the real scores. At the moment we have two graphics processors with similar parameters from different manufacturers, and everyone is tempted to take the first place. That is why benchmarks are going to reflect a competition between the driver developers who optimize the code to get as high scores as possible even at the expense of an alternative rendering, rather than a real performance of video adapters.

What's next? Being well aware of how both companies react to such patch-scripts and taking into account the distructive methods they used against SoftQuadro and SoftR9x00 we are almost sure that in the next driver version they will try to prove that the scripts work incorrectly and disable a vital part of the code. NVIDIA will probably distribute all existant checkups along the Direct3D driver's body to make difficult their localization and disabling, and will use the checksum calculation procedure modified by the NVAntiDetector to calculate checksums by the fixed pattern included in the driver's body, and if it doesn't coincide with the original it will imitate certain problems.

Will the Direct3D application detection mechanisms disappear in the Detonator and Catalyst drivers? Unlikely. Judging by the test results of the drivers released after the scandal around 3DMark2003 none of the companies is going to step back. Thus, the new NVIDIA's Detonator FX 44.6x is able to detect all changes in the shader code made by Futuremark in the patch 330 and even increases performance as compared to what was before the patch. Obviously, NVIDIA decided to keep to the tactics of aggressive optimizations irrespective of the negative attitude of most users towards such things. Nevertheless, the NVAntiDetector script still makes the GeForceFX 5900Ultra work much slower in this benchmark. ATI also released a new build of the Catalyst drivers where they really removed two optimizations for the 3DMark2003 revealed by Futuremark. They proudly announced it in the following press-release, but the optimizations for the previous benchmark are still there. The company seems to follow the principle "innocent until proven guilty", and the fine guesture of locking undesirable optimizations will touch only those cheats which the company was thrusted under its nose. It's very convenient to be partially honest. Everyone is satisfied.

I do want to believe that this review will influence the attitude of the manufacturers of video adapters towards consumers. I think that the latest events clearly show that there are certain things which are much more important than the 3DMark scores, namely, reputation and trust of users which are hard to gain and easy to distroy. Actually we hope that we were wrong, and the real situation is not that sad, though it's very unlikely. That is why all comments from all parties this research concerns are welcome. Just note that when we were preparing this review for publication, ATI contacted us and assured that the parts of the code for application detection are used exceptionally for elimination of problems of applications themselves. Nevertheless, the problem with the 3DMark2001 remained without the answer...

Aleksei Nikolaichuk AKA Unwinder (unwinder@ixbt.com)

Write a comment below. No registration needed!