ATI RADEON 9500 64MB, 9500 128MB, 9500 PRO, 9700 and 9700 PRO in DirectX 9.0: Part 2

"There is such term as Art of Testing"

CONTESTS

General information
Philosophy of the synthetic tests
Description of RightMark 3D: Pixel Filling synthetic tests
Description of RightMark 3D: Geometry Processing Speed synthetic tests
Description of RightMark 3D: Hidden Surface Removal synthetic tests
Description of RightMark 3D: Pixel Shading synthetic tests
Description of RightMark 3D: Point Sprites synthetic tests
Test systems
Test results: Pixel Filling
Test results : Geometry Processing Speed
Test results : Hidden Surface Removal
Test results : Pixel Shading
Test results : Point Sprites
Conclusion

General information

As we mentioned it lots of times, the release of the new API DirectX 9.0 couldn't pass unnoticed, especially because there is a line of video cards supporting DX9. The first part is devoted to testing the line consisting of 5 cards in a great deal of gaming benchmarks (none of them is able to work under the DX9 yet, that is why it was mostly the estimation of the R9500-9700 line as a whole).

But before will you have a look at the reviews dealing with the RADEON 9500-9700 video cards.

Theoretical materials and reviews of video cards which concern functional properties of the VPU ATI RADEON 9700 (Pro) / RADEON 9500 (Pro)

Analyses of architecture of the RADEON 9700 and Microsoft DirectX 9.0
ATI RADEON 9700 Pro 128MB review
MAYA II R9700Pro 128MB - performance estimated on the new testbed based only on the Pentium 4 2.53 GHz, comparison with the NVIDIA's 40.41 driver
Hercules 3D Prophet 9700 Pro 128MB - performance of the new CATALYST 2.3 driver estimated in 3DMark2001 SE, and Unreal Tournament 2003 DEMO final release
PowerColor Evil Commando2 RADEON 9700 Pro 128MB - performance of the new CATALYST 2.3 driver estimated in game tests, 3D quality issues
Sapphire Atlantis RADEON 9700 Pro 128MB - detailed analyses of anisotropic filtering of the RADEON 9700
ATI RADEON 9700, RADEON 9500 64MB and Gigabyte MAYA II RADEON 9500 64MB
Sapphire Atlantis RADEON 9500 128MB - RADEON 9500 128 MB has a 256bit memory bus! Tests of the video cards in the DOOM III
HIS Excalibur RADEON 9700 Pro and tests in the DirectX 9.0 RC0
ATI RADEON 9500 PRO - 128 bit memory bus and buffer compression with AA (virtual 256bit bus)

Today we will study the cards tested in some synthetic tests from the new RightMark3D packet which will soon be brought to the Internet medium (a fragment of one of the test scenes is shown above). This packet is meant for DX9 cards though some tests can run on the DX81 cards as well.

Here are the price niches the cards belong to:

RADEON 9500 64MB (4 rendering pipelines, 128bit memory bus, 275/270 (540) MHz) - $120-140;
RADEON 9500 128MB (4 rendering pipelines, 256bit memory bus, 275/270 (540) MHz) - $150-160;
RADEON 9500 PRO 128MB (8 rendering pipelines, 128bit memory bus, 275/270 (540) MHz) - $180-200;
RADEON 9700 128MB (8 rendering pipelines, 256bit memory bus, 275/270 (540) MHz) - $230-250;
RADEON 9700 PRO 128MB (8 rendering pipelines, 256bit memory bus, 325/310 (620) MHz) - $290-330.

All the cards are given a detailed description to in the first park.

New synthetic tests for DX9

Today we will describe and obtain the first test results with the suite of synthetic tests we are currently developing for the API DX9.

The test suite from the RightMark 3D which is under development now includes the following synthetic tests at this moment:

Pixel Filling Test;
Geometry Processing Speed Test;
Hidden Surface Removal Test;
Pixel Shader Test;
Point Sprites Test;
State Changes Test (performance of the drivers and their operation with the accelerator).

Today we will deal with the first five tests and estimate data obtained on the ATI's and NVIDIA's accelerators. So, we have two aims - to estimate performance of the accelerators under the API DX9 and performance of the RightMark 3D synthetic tests in different situations. The last item is important to reveal the degree of applicability, repeatability and soundness of results of our synthetic tests that we are going to use widely for benchmarking various DX9 accelerators and make available for free download for our readers and all enthusiasts of computer graphics. But first of all, there is a small digression about the ideological test issues:

Philosophy of the synthetic tests

The main idea of all our tests is focusing on performance of one or another chip's subsystem. In contrast to real applications which measure effectiveness of accelerator's operation in one or another practical application integrally, synthetic tests stress on separate performance aspects. The matter is that a release of a new accelerator is usually a year away from applications which can use all its capabilities effectively. Any those users who want to be on the front line with technology have to buy one or another accelerator almost blindly, warmed only with results of the tests carried out on outdated software. No one can guarantee that the situation won't change with the games they are waiting for. Apart from such enthusiasts which take such risk, there are some other categories of people in such a complicated situation:

First category - people who don't want to mess up with upgrade and who buy a computer of the maximum configuration for a long time. It's very important for them to make the time of suitability of their machines for oncoming applications as long as possible.
Second category - software developers; they have to keep their eye on capabilities and balance of new accelerators to design and balance competently the engine (code) and content (levels, models) taking into account the effective usage of equipment which will become widespread by the time their applications get onto the market. The synthetic tests will help them choose ways for realization of their ideas and restrain the bounds of their imagination :-).
Third category - IT analysts (for example, from big trade companies) and hardware reviewers, i.e. the people who have to estimate potential of products when they are not officially announced yet.

So, synthetic tests allow estimating performance and capabilities of separate subsystems of accelerators in order to forecast accelerator's behavior in some or other applications, both existing (overall estimation of suitability and prospects for a whole class of applications) and developing, provided that a given accelerator demonstrates peculiar behavior under such applications.

Description of the RightMark 3D synthetic tests

Pixel Filling

This test has several functions, namely:

Measurement of frame buffer filling performance
Measurement of performance of differnt texture filtering modes
Measurement of effectiveness of operation (caching) with textures of differnt sizes
Measurement of effectiveness of operation (caching and compression) with textures of different formats
Measurement of multitexturing effectiveness
Visual comparison of quality of implementation of some or other texture filtering modes

The test draws a pyramid whose base lies in the monitor's plane and the vertex is moved away to the maximum:

Each of its four sides consists of triangles. A small number of triangles allows to avoid dependence on geometrical performance which has nothing to do with what is studied. 1 to 8 textures are applied to each pixel during filling. You can disable texturing (0 textures) and measure only the fill rate using a constant color value.

During the test the vertex moves around at a constant speed, and the base rotates around the axis Z:

So, the pyramid's sides take all possible angles of inclination in both planes, and the number of shaded pixels is constant and there are all possible distances from the minimal to the maximum. The inclination of the shaded plane and the distance to the shaded pixels define many filtering algorithms, in particular, anisotropic filtering and various modern realizations of trilinear filtering. By rotating the pyramid we put the accelerator in all conditions which can take place in real applications. It allows us to estimate the filtering quality in all possible cases and get weighted performance data.

The test can be carried out in different modes - the same operations can be accomplished by shaders of different versions and fixed pipelines inherited from the previous DX generations. That is why you can find out the performance gap between different shader versions.

A special texture with different colors and figures eases investigation of quality aspects of the filtering and its interaction with full-screen anti-aliasing. Mip levels can have different colors:

so that you can estimate the algorithm of their blending and selection.

Here are the adjustable test parameters:

Resolution
Window or fullscreen mode
Test time (accumulation of statistics) in seconds
Color mip levels
Operating mode (and the maximum number of textures per 1 pixel):

Vertex Shaders 1.1 and Fixed Function Blend Stages (up to 8 textures)
Vertex Shaders 2.0 and Fixed Function Blend Stages (up to 8 textures)
Vertex Shaders 1.1 and Pixel Shaders 1.1 (up to 4 textures)
Vertex Shaders 1.1 and Pixel Shaders 1.4 (up to 6 textures)
Vertex Shaders 2.0 and Pixel Shaders 2.0 (up to 8 textures)

Textures per pixel:

0 (only filling)
from 1 to 8

Texture size:

128x128
256x256
512x512

Texture format:

A8R8G8B8
X8R8G8B8
A1R5G5B5
X1R5G5B5
DXT1
DXT2
DXT3
DXT4
DXT5

Filtering type:

no
bilinear
trilinear
anisotropic
anisotropic + trilinear

The test gives its results in FPS and FillRate. The latter plays two roles. In the no-texture mode we measure exactly the frame buffer write speed. In this respect, this parameter defines the number of pixels filled in per second - Pixel FillRate. In the texture mode it indicates the number of sampled and filtered texture values per second (Texturing Rate, Texture Fill Rate).

Here is an example of a pixel shader used for filling in case of the most intensive version of this test (PS/VS 2.0, 8 textures):

ps_2_0

dcl t0
dcl t1
dcl t2
dcl t3
dcl t4
dcl t5
dcl t6
dcl t7

dcl_2d s0
dcl_2d s1
dcl_2d s2
dcl_2d s3
dcl_2d s4
dcl_2d s5
dcl_2d s6
dcl_2d s7

texld r0, t0, s0
texld r1, t1, s1
texld r2, t2, s2
texld r3, t3, s3
texld r4, t4, s4
texld r5, t5, s5
texld r6, t6, s6
texld r7, t7, s7

mov r11, r0
lrp r11, c0, r11, r1
lrp r11, c0, r11, r2
lrp r11, c0, r11, r3
lrp r11, c0, r11, r4
lrp r11, c0, r11, r5
lrp r11, c0, r11, r6
lrp r11, c0, r11, r7

mov oC0, r11

Geometry Processing Speed

This test measures the geometry processing speed in different modes. We tried to minimize the influence of filling and other accelerator's subsystems, as well as to make geometrical information and its processing as close to real models as possible. The main task is to measure the peak geometrical performance in different transform and lighting tasks. At present, the test allows for the following lighting models (calculated at the vertex level):

Ambient Lighting - simplest constant lighting
1 Diffuse Light
2 Diffuse Lights
3 Diffuse Lights
1 Diffuse + Specular Light
2 Diffuse + Specular Lights
3 Diffuse + Specular Lights

The test draws several samples of the same model with a great number of polygons. Each sample has its own parameters of geometrical transformation and relative positions of light sources. The model is extremely small (most polygons are comparable or smaller than a screen pixel):

thus, the resolution and filling do not affect the test results:

The light sources move in different directions during the test to underline various combinations of the initial parameters.

There are three degrees of scene detailing - they influence the total number of polygons transformed in one frame. It's necessary to make sure that the test results do not depend on a scene and fps at all.

Here are the adjustable test parameters:

Resolution
Window or fullscreen mode
Test time Bpemÿ tectèpobahèÿ (accumulation of statistics) in seconds
Vertex shaders software emulation and TCL
Operating modes:

Fixed Function TCL and Fixed Function Blend Stages
Vertex Shaders 1.1 and Fixed Function Blend Stages
Vertex Shaders 2.0 and Fixed Function Blend Stages
Vertex Shaders 1.1 and Pixel Shaders 1.1
Vertex Shaders 1.1 and Pixel Shaders 1.4
Vertex Shaders 2.0 and Pixel Shaders 2.0

Geometry detailing:

1 (low)
2 (middle)
3 (high)

Lighting model (determines complexity of calculations):

Ambient Lighting - simplest constant lighting
1 Diffuse Light
2 Diffuse Lights
3 Diffuse Lights
1 (Diffuse + Specular) Light
2 (Diffuse + Specular) Lights
3 (Diffuse + Specular) Lights

The test results are available in FPS and PPS (Polygons Per Second).

Here is an example of a vertex shader (VS 2.0) used for transformation and calculation of lighting according to quantity of diffuse-specular lights set externally in this test:

vs_2_0

dcl_position v0
dcl_normal v3

//
// Position Setup
//

m4x4 oPos, v0, c16

//
// Lighting Setup
//

m4x4 r10, v0, c8 // transform position to world space
m3x3 r0.xyz, v3.xyz, c8 // transform normal to world space

nrm r7, r0 // normalize normal

add r0, -r10, c2 // get a vector toward the camera position

nrm r6, r0 // normalize eye vector

mov r4, c0 // set diffuse to 0,0,0,0

mov r2, c0 // setup diffuse,specular factors to 0,0
mov r2.w, c94.w // setup specular power

//
// Lighting
//

loop aL, i0

    add r1, c[40+aL], -r10 // vertex to light direction
    dp3 r0.w, r1, r1
    rsq r1.w, r0.w

    dst r9, r0.wwww, r1.wwww // (1, d, d*d, 1/d)
    dp3 r0.w, r9, c[70+aL] // (a0 + a1*d + a2*d2)
    rcp r8.w, r0.w // 1 / (a0 + a1*d + a2*d)

mul r1, r1, r1.w // normalize the vertex to the light vector

add r0, r6, r1 // calculate half-vector (light vector + eye vector)

nrm r11, r0 // normalize half-vector

dp3 r2.x, r7, r1 // N*L
dp3 r2.yz, r7, r11 // N*H

    sge r3.x, c[80+aL].y, r9.y // (range > d) ? 1:0
    mul r2.x, r2.x, r3.x
    mul r2.y, r2.y, r3.x

lit r5, r2 // calculate the diffuse & specular factors
mul r5, r5, r8.w // scale by attenuation

mul r0, r5.y, c[30+aL] // calculate diffuse color
mad r4, r0, c90, r4 // add (diffuse color * material diffuse)

mul r0, r5.z, c[60+aL] // calculate specular color
mad r4, r0, c91, r4 // add (specular color * material specular)

endloop

mov oD0, r4 // final color

Hidden Surface Removal

This test looks for techniques of removal of hidden surfaces and pixels and estimates their effectiveness, i.e. effectiveness of operation with a traditional depth buffer and effectiveness and availability of early culling of hidden pixels. The test generates a pseudorandom scene of a given number of triangles:

which will be rendered in one of three modes:

sorted, front to back order
sorted, back to front order
unsorted

In the second case the test renders all pixels in turn, including hidden ones, in case the accelerator is based on the traditional or hybrid architecture (a tile accelerator can provide optimization in this case as well, but remember that the sorting will take place anyway, even though on the hardware or driver levels).

In the first case the test can draw only a small number of visible pixels and the others can be removed yet before filling. In the third case we have some sort of a middle similar to what the HSR mechanism can encounter in real operations in applications that do not optimize the sequence of scene displaying. To get an idea on the peak effectiveness of the HSR algorithm it's necessary to collate the results of the first and second modes (the most optimal first mode with the least convenient second one). The comparison of the optimal mode with the unsorted one (i.e. the first and third) will give us an approximate degree of effectiveness in real applications.

The scene rotates around the axis Z in the test to smooth away any potential peculiarities of different early HSR algorithms which are primarily based on the frame buffer zoning. As a result, the triangles and their verges take all possible positions.

You can also change the number of rendered triangles to see how the test depends on other chip's subsystems and drivers. We can expect improvement of the results as the number of triangles grows up, but on the other hand, the growth is justified only up to a certain degree after which the influence of other subsystems on the test can start going up again. That is why this parameter was brought in to estimate quality of the test regarding the number of triangles.

Here are the adjustable parameters:

Resolution
Window or fullscreen mode
Test time Bpemÿ tectèpobahèÿ (accumulation of statistics) in seconds
Vertex shaders software emulation and TCL
Operating modes:

Fixed Function TCL and Fixed Function Blend Stages
Vertex Shaders 1.1 and Fixed Function Blend Stages
Vertex Shaders 2.0 and Fixed Function Blend Stages

Number of triangles:

1000 to 20000

Sorting mode for a rendered scene:

no;
back to front polygons
front to back polygons

Pixel Shading

This test estimates performance of various pixel shaders 2.0. In case of PS 1.1 the speed of execution of shaders translated into the stage settings could be easily defined, and it was needed to have only a test like Pixel Filling carried out with a great number of textures, in case of PX 2.0 the situation looks much more complicated. Instruction per clock execution and new data formats (floating-point numbers) can create a significant difference in performance not only when the accelerator architectures differ, but also on the level of combination of separate instructions and data formats inside one chip. We decided to use an approach similar to the CPU benchmarking for testing performance of pixel processors of modern accelerators, i.e. to measure performance of the following set of pixel shaders which have real prototypes and applications:

per-pixel diffuse lighting with per-pixel attenuation - 1 point source
per-pixel diffuse lighting with per-pixel attenuation - 2 point sources
per-pixel diffuse lighting with per-pixel attenuation - 3 point sources:

per-pixel diffuse lighting + specular lighting with per-pixel attenuation (1 point source)
per-pixel diffuse lighting + specular lighting with per-pixel attenuation (2 point sources):

marble animated procedure texturing
fire animated procedure texturing:

Two last tests implement the procedure textures (pixel color values are calculated according to a certain formula) which are an approximate mathematical model of the material. Such textures take little memory (only comparatively small tables for accelerated calculation of various factors are stored there) and support almost infinite detailing! They are easy to animate by changing the basic parameters. It's quite possible that future applications will use exactly such texturing methods as capabilities of accelerators will grow.

The geometrical test scene is simplified, and dependence on the chip's geometrical performance is almost eliminated. Hidden surface removal is absent as well - all surfaces of the scene are visible at any moment. The load is laid only on the pixel pipelines.

Here are adjustable parameters:

Resolution
Window or fullscreen mode
Test time (accumulation of statistics) in seconds
Vertex shaders software emulation and TCL
Pixel shader:

1 point light ( per-pixel diffuse with per-pixel attenuation )
2 point lights ( per-pixel diffuse with per-pixel attenuation )
3 point lights ( per-pixel diffuse with per-pixel attenuation )
1 point light ( per-pixel diffuse + secular with per-pixel attenuation )
2 point lights ( per-pixel diffuse + secular with per-pixel attenuation )
Procedure texturing (Marble)
Procedure texturing (Fire)

Below are the codes of some shaders. Per-pixel diffuse with per-pixel attenuation for 2 light sources:

ps_2_0

//
// Texture Coords
//

dcl t0 // Diffuse Map
dcl t1 // Normal Map
dcl t2 // Specular Map

dcl t3.xyzw // Position (World Space)

dcl t4.xyzw // Tangent
dcl t5.xyzw // Binormal
dcl t6.xyzw // Normal

//
// Samplers
//

dcl_2d s0 // Sampler for Base Texture
dcl_2d s1 // Sampler for Normal Map
dcl_2d s2 // Sampler for Specular Map

//
// Normal Map
//

texld r1, t1, s1
mad r1, r1, c29.x, c29.y

//
// Light 0
//

// Attenuation

add r3, -c0, t3 // LightPosition-PixelPosition
dp3 r4.x, r3, r3 // Distance^2
rsq r5, r4.x // 1 / Distance
mul r6.x, r5.x, c20.x // Attenuation / Distance

// Light Direction to Tangent Space

mul r3, r3, r5.x // Normalize light direction

dp3 r8.x, t4, -r3 // Transform light direction to tangent space
dp3 r8.y, t5, -r3
dp3 r8.z, t6, -r3
mov r8.w, c28.w

// Half Angle to Tangent Space

add r0, -t3, c25 // Get a vector toward the camera
nrm r11, r0

add r0, r11, -r3 // Get half angle
nrm r11, r0
dp3 r7.x, t4, r11 // Transform half angle to tangent space
dp3 r7.y, t5, r11
dp3 r7.z, t6, r11
mov r7.w, c28.w

// Diffuse

dp3 r2.x, r1, r8 // N * L
mul r9.x, r2.x, r6.x // * Attenuation / Distance

mul r9, c10, r9.x // * Light Color

// Specular

dp3 r2.x, r1, r7 // N * H
pow r2.x, r2.x, c26.x // ^ Specular Power
mul r10.x, r2.x, r6.x // * Attenuation / Distance

mul r10, c12, r10.x // * Light Color

//
// Light 2
//

// Attenuation

add r3, -c1, t3 // LightPosition-PixelPosition
dp3 r4.x, r3, r3 // Distance^2
rsq r5, r4.x // 1 / Distance
mul r6.x, r5.x, c21.x // Attenuation / Distance

// Light Direction to Tangent Space

mul r3, r3, r5.x // Normalize light direction

dp3 r8.x, t4, -r3 // Transform light direction to tangent space
dp3 r8.y, t5, -r3
dp3 r8.z, t6, -r3
mov r8.w, c28.w

// Half Angle to Tangent Space

add r0, -t3, c25 // Get a vector toward the camera
nrm r11, r0

add r0, r11, -r3 // Get half angle
nrm r11, r0

dp3 r7.x, t4, r11 // Transform half angle to tangent space
dp3 r7.y, t5, r11
dp3 r7.z, t6, r11
mov r7.w, c28.w

// Diffuse

dp3 r2.x, r1, r8 // N * L
mul r2.x, r2.x, r6.x // * Attenuation / Distance

mad r9, c11, r2.x, r9 // * Light Color

// Specular

dp3 r2.x, r1, r7 // N * H
pow r2.x, r2.x, c26.x // ^ Specular Power
mul r2.x, r2.x, r6.x // * Attenuation / Distance

mad r10, c13, r2.x, r10 // * Light Color

//
// Diffuse + Specular Maps
//

texld r0, t0, s0
texld r1, t2, s2

mul r9, r9, r0 // Diffuse Map
mad r9, r10, r1, r9 // Specular Map

// Finalize

mov oC0, r9

Fire procedure texture:

ps_2_0

def c3, -0.5, 0, 0, 1
def c4, 0.159155, 6.28319, -3.14159, 0.25
def c5, -2.52399e-007, -0.00138884, 0.0416666, 2.47609e-005

dcl v0

dcl t0.xyz
dcl t1.xyz
dcl t2.xyz
dcl t3.xyz

dcl_volume s0
dcl_2d s1

texld r0, t0, s0
mul r7.w, c0.x, r0.x
texld r2, t1, s0
mad r4.w, c0.y, r2.x, r7.w
texld r11, t2, s0
mad r1.w, c0.z, r11.x, r4.w
texld r8, t3, s0
mad r10.w, c0.w, r8.x, r1.w
mul r5.w, c2.x, r10.w
mad r7.w, c1.x, t0.x, r5.w
mad r9.w, r7.w, c4.x, c4.w
frc r4.w, r9.w
mad r6.w, r4.w, c4.y, c4.z
mul r1.w, r6.w, r6.w
mad r3.w, r1.w, c5.x, c5.w
mad r5.w, r1.w, r3.w, c5.y
mad r7.w, r1.w, r5.w, c5.z
mad r9.w, r1.w, r7.w, c3.x
mad r11.w, r1.w, r9.w, c3.w
mov r3.xy, r11.w
texld r6, r3, s1
mov oC0, r6

Point Sprites

This test measures performance of just one function: displaying of pixel sprites used for creating systems of particles. The test draws an animated system of particles resembling a human body:

We can adjust a size of the particles (which will affect the fillrate), enable and disable light processing and animation. In case of a system of particles geometry processing is very important, that is why we didn't separate these two aspects - filling and geometrical calculations (animation and lighting) but made possible to change a load degree of one or another body part by changing sprite size and switching on/off their animation and lighting.

Here are adjustable parameters:

Resolution
Window or fullscreen mode
Test time (accumulation of statistics) in seconds
Vertex shaders software emulation and TCL
Operating modes:

Vertex Shaders 1.1 and Fixed Function Blend Stages
Vertex Shaders 2.0 and Fixed Function Blend Stages

Animation mode:

Lighting mode:

Stay with us

In the near future we will finish debugging and publish the first results of the 7th test which, first of all, measures quality of the drivers and how effectively data and parameters are delivered to the accelerator.

Soon all synthtic tests will be able to use not only Assembler shader versions but also those which are compiled from a higher-level language with the Microsoft (HLSL) compiler and the NVIDIA's one - CG+CGFX.

The most pleasant event is the approaching release of the first beta version of the RightMark 3D packet. In the beginning the first beta version will provide only synthetic tests for free use.

Practical estimation

Now comes the most interesting part where we will show and comment the data obtained on the accelerators of two main families - ATI RADEON 9500/9700 and NVIDIA GeForce 4 4200/4600.