iXBT Labs - Computer Hardware in Detail

Platform

Video

Multimedia

Mobile

Other

ATI RADEON 9500 64MB, 9500 128MB,
9500 PRO, 9700 and 9700 PRO
in DirectX 9.0:
Part 2 - New Synthetic Tests for DX9

January 9, 2003






"There is such term as Art of Testing"



CONTESTS

  1. General information
  2. Philosophy of the synthetic tests
  3. Description of RightMark 3D: Pixel Filling synthetic tests 
  4. Description of RightMark 3D: Geometry Processing Speed synthetic tests 
  5. Description of RightMark 3D: Hidden Surface Removal synthetic tests 
  6. Description of RightMark 3D: Pixel Shading synthetic tests 
  7. Description of RightMark 3D: Point Sprites synthetic tests 
  8. Test systems
  9. Test results: Pixel Filling 
  10. Test results: Geometry Processing Speed 
  11. Test results: Hidden Surface Removal 
  12. Test results: Pixel Shading 
  13. Test results: Point Sprites 
  14. Conclusion

General information

As we mentioned it lots of times, the release of the new API DirectX 9.0 couldn't pass unnoticed, especially because there is a line of video cards supporting DX9. The first part is devoted to testing the line consisting of 5 cards in a great deal of gaming benchmarks (none of them is able to work under the DX9 yet, that is why it was mostly the estimation of the R9500-9700 line as a whole). 

But before will you have a look at the reviews dealing with the RADEON 9500-9700 video cards. 

Theoretical materials and reviews of video cards which concern functional properties of the VPU ATI RADEON 9700 (Pro) / RADEON 9500 (Pro)

Today we will study the cards tested in some synthetic tests from the new RightMark3D packet which will soon be brought to the Internet medium  (a fragment of one of the test scenes is shown above). This packet is meant for DX9 cards though some tests can run on the DX81 cards as well. 

Here are the price niches the cards belong to:

  1. RADEON 9500 64MB (4 rendering pipelines, 128bit memory bus, 275/270 (540) MHz) - $120-140; 
  2. RADEON 9500 128MB (4 rendering pipelines, 256bit memory bus, 275/270 (540) MHz) - $150-160; 
  3. RADEON 9500 PRO 128MB (8 rendering pipelines, 128bit memory bus, 275/270 (540) MHz) - $180-200; 
  4. RADEON 9700 128MB (8 rendering pipelines, 256bit memory bus, 275/270 (540) MHz) - $230-250; 
  5. RADEON 9700 PRO 128MB (8 rendering pipelines, 256bit memory bus, 325/310 (620) MHz) - $290-330. 

All the cards are given a detailed description to in the first park

New synthetic tests for DX9

Today we will describe and obtain the first test results with the suite of synthetic tests we are currently developing for the API DX9. 

The test suite from the RightMark 3D which is under development now includes the following synthetic tests at this moment: 

  1. Pixel Filling Test; 
  2. Geometry Processing Speed Test; 
  3. Hidden Surface Removal Test; 
  4. Pixel Shader Test; 
  5. Point Sprites Test; 
  6. State Changes Test (performance of the drivers and their operation with the accelerator). 

Today we will deal with the first five tests and estimate data obtained on the ATI's and NVIDIA's accelerators. So, we have two aims - to estimate performance of the accelerators under the API DX9 and performance of the RightMark 3D synthetic tests in different situations. The last item is important to reveal the degree of applicability, repeatability and soundness of results of our synthetic tests that we are going to use widely for benchmarking various DX9 accelerators and make available for free download for our readers and all enthusiasts of computer graphics. But first of all, there is a small digression about the ideological test issues: 

Philosophy of the synthetic tests

The main idea of all our tests is focusing on performance of one or another chip's subsystem. In contrast to real applications which measure effectiveness of accelerator's operation in one or another practical application integrally, synthetic tests stress on separate performance aspects. The matter is that a release of a new accelerator is usually a year away from applications which can use all its capabilities effectively. Any those users who want to be on the front line with technology have to buy one or another accelerator almost blindly, warmed only with results of the tests carried out on outdated software. No one can guarantee that the situation won't change with the games they are waiting for. Apart from such enthusiasts which take such risk, there are some other categories of people in such a complicated situation: 

  • First category - people who don't want to mess up with upgrade and who buy a computer of the maximum configuration for a long time. It's very important for them to make the time of suitability of their machines for oncoming applications as long as possible. 
  • Second category - software developers; they have to keep their eye on capabilities and balance of new accelerators to design and balance competently the engine (code) and content (levels, models) taking into account the effective usage of equipment which will become widespread by the time their applications get onto the market. The synthetic tests will help them choose ways for realization of their ideas and restrain the bounds of their imagination :-). 
  • Third category - IT analysts (for example, from big trade companies) and hardware reviewers, i.e. the people who have to estimate potential of products when they are not officially announced yet. 

So, synthetic tests allow estimating performance and capabilities of separate subsystems of accelerators in order to forecast accelerator's behavior in some or other applications, both existing (overall estimation of suitability and prospects for a whole class of applications) and developing, provided that a given accelerator demonstrates peculiar behavior under such applications. 

Description of the RightMark 3D synthetic tests

Pixel Filling

This test has several functions, namely: 

  1. Measurement of frame buffer filling performance 
  2. Measurement of performance of differnt texture filtering modes
  3. Measurement of effectiveness of operation (caching) with textures of differnt sizes
  4. Measurement of effectiveness of operation (caching and compression) with textures of different formats
  5. Measurement of multitexturing effectiveness 
  6. Visual comparison of quality of implementation of some or other texture filtering modes

The test draws a pyramid whose base lies in the monitor's plane and the vertex is moved away to the maximum: 




Each of its four sides consists of triangles. A small number of triangles allows to avoid dependence on geometrical performance which has nothing to do with what is studied. 1 to 8 textures are applied to each pixel during filling. You can disable texturing (0 textures) and measure only the fill rate using a constant color value. 

During the test the vertex moves around at a constant speed, and the base rotates around the axis Z: 




So, the pyramid's sides take all possible angles of inclination in both planes, and the number of shaded pixels is constant and there are all possible distances from the minimal to the maximum. The inclination of the shaded plane and the distance to the shaded pixels define many filtering algorithms, in particular, anisotropic filtering and various modern realizations of trilinear filtering. By rotating the pyramid we put the accelerator in all conditions which can take place in real applications. It allows us to estimate the filtering quality in all possible cases and get weighted performance data. 

The test can be carried out in different modes - the same operations can be accomplished by shaders of different versions and fixed pipelines inherited from the previous DX generations. That is why you can find out the performance gap between different shader versions. 

A special texture with different colors and figures eases investigation of quality aspects of the filtering and its interaction with full-screen anti-aliasing. Mip levels can have different colors: 




so that you can estimate the algorithm of their blending and selection. 

Here are the adjustable test parameters: 

  • Resolution
  • Window or fullscreen mode
  • Test time (accumulation of statistics) in seconds
  • Color mip levels
  • Operating mode (and the maximum number of textures per 1 pixel): 
    • Vertex Shaders 1.1 and Fixed Function Blend Stages (up to 8 textures) 
    • Vertex Shaders 2.0 and Fixed Function Blend Stages (up to 8 textures) 
    • Vertex Shaders 1.1 and Pixel Shaders 1.1 (up to 4 textures) 
    • Vertex Shaders 1.1 and Pixel Shaders 1.4 (up to 6 textures) 
    • Vertex Shaders 2.0 and Pixel Shaders 2.0 (up to 8 textures) 

  • Textures per pixel: 
    • 0 (only filling) 
    • from 1 to 8 

  • Texture size: 
    • 128x128 
    • 256x256 
    • 512x512 

  • Texture format: 
    • A8R8G8B8 
    • X8R8G8B8 
    • A1R5G5B5 
    • X1R5G5B5 
    • DXT1 
    • DXT2 
    • DXT3 
    • DXT4 
    • DXT5 

  • Filtering type: 
    • no
    • bilinear
    • trilinear
    • anisotropic
    • anisotropic + trilinear

The test gives its results in FPS and FillRate. The latter plays two roles. In the no-texture mode we measure exactly the frame buffer write speed. In this respect, this parameter defines the number of pixels filled in per second - Pixel FillRate. In the texture mode it indicates the number of sampled and filtered texture values per second (Texturing Rate, Texture Fill Rate). 

Here is an example of a pixel shader used for filling in case of the most intensive version of this test (PS/VS 2.0, 8 textures): 

ps_2_0

dcl t0
dcl t1
dcl t2
dcl t3
dcl t4
dcl t5
dcl t6
dcl t7

dcl_2d s0
dcl_2d s1
dcl_2d s2
dcl_2d s3
dcl_2d s4
dcl_2d s5
dcl_2d s6
dcl_2d s7

texld r0, t0, s0
texld r1, t1, s1
texld r2, t2, s2
texld r3, t3, s3
texld r4, t4, s4
texld r5, t5, s5
texld r6, t6, s6
texld r7, t7, s7

mov r11, r0
lrp r11, c0, r11, r1
lrp r11, c0, r11, r2
lrp r11, c0, r11, r3
lrp r11, c0, r11, r4
lrp r11, c0, r11, r5
lrp r11, c0, r11, r6
lrp r11, c0, r11, r7

mov oC0, r11

Geometry Processing Speed

This test measures the geometry processing speed in different modes. We tried to minimize the influence of filling and other accelerator's subsystems, as well as to make geometrical information and its processing as close to real models as possible. The main task is to measure the peak geometrical performance in different transform and lighting tasks. At present, the test allows for the following lighting models (calculated at the vertex level): 

  1. Ambient Lighting - simplest constant lighting
  2. 1 Diffuse Light 
  3. 2 Diffuse Lights
  4. 3 Diffuse Lights
  5. 1 Diffuse + Specular Light 
  6. 2 Diffuse + Specular Lights 
  7. 3 Diffuse + Specular Lights 

The test draws several samples of the same model with a great number of polygons. Each sample has its own parameters of geometrical transformation and relative positions of light sources. The model is extremely small (most polygons are comparable or smaller than a screen pixel): 




thus, the resolution and filling do not affect the test results: 




The light sources move in different directions during the test to underline various combinations of the initial parameters. 

There are three degrees of scene detailing - they influence the total number of polygons transformed in one frame. It's necessary to make sure that the test results do not depend on a scene and fps at all. 

Here are the adjustable test parameters: 

  • Resolution
  • Window or fullscreen mode
  • Test time Bpemÿ tectèpobahèÿ (accumulation of statistics) in seconds
  • Vertex shaders software emulation and TCL 
  • Operating modes: 
    • Fixed Function TCL and Fixed Function Blend Stages 
    • Vertex Shaders 1.1 and Fixed Function Blend Stages 
    • Vertex Shaders 2.0 and Fixed Function Blend Stages 
    • Vertex Shaders 1.1 and Pixel Shaders 1.1 
    • Vertex Shaders 1.1 and Pixel Shaders 1.4 
    • Vertex Shaders 2.0 and Pixel Shaders 2.0 

  • Geometry detailing: 
    • 1 (low) 
    • 2 (middle) 
    • 3 (high) 

  • Lighting model (determines complexity of calculations): 
    • Ambient Lighting - simplest constant lighting
    • 1 Diffuse Light
    • 2 Diffuse Lights
    • 3 Diffuse Lights
    • 1 (Diffuse + Specular) Light 
    • 2 (Diffuse + Specular) Lights
    • 3 (Diffuse + Specular) Lights

The test results are available in FPS and PPS (Polygons Per Second). 

Here is an example of a vertex shader (VS 2.0) used for transformation and calculation of lighting according to quantity of diffuse-specular lights set externally in this test: 

vs_2_0

dcl_position v0
dcl_normal v3

//
// Position Setup
//

m4x4 oPos, v0, c16

//
// Lighting Setup
//

m4x4 r10, v0, c8 // transform position to world space
m3x3 r0.xyz, v3.xyz, c8 // transform normal to world space

nrm r7, r0 // normalize normal

add r0, -r10, c2 // get a vector toward the camera position

nrm r6, r0 // normalize eye vector 

mov r4, c0 // set diffuse to 0,0,0,0

mov r2, c0 // setup diffuse,specular factors to 0,0
mov r2.w, c94.w // setup specular power

//
// Lighting
//

loop aL, i0

    add r1, c[40+aL], -r10 // vertex to light direction
    dp3 r0.w, r1, r1
    rsq r1.w, r0.w

    dst r9, r0.wwww, r1.wwww // (1, d, d*d, 1/d)
    dp3 r0.w, r9, c[70+aL] // (a0 + a1*d + a2*d2)
    rcp r8.w, r0.w // 1 / (a0 + a1*d + a2*d) 

    mul r1, r1, r1.w // normalize the vertex to the light vector

    add r0, r6, r1 // calculate half-vector (light vector + eye vector)

    nrm r11, r0 // normalize half-vector

    dp3 r2.x, r7, r1 // N*L
    dp3 r2.yz, r7, r11 // N*H

    sge r3.x, c[80+aL].y, r9.y // (range > d) ? 1:0
    mul r2.x, r2.x, r3.x
    mul r2.y, r2.y, r3.x

    lit r5, r2 // calculate the diffuse & specular factors
    mul r5, r5, r8.w // scale by attenuation

    mul r0, r5.y, c[30+aL] // calculate diffuse color
    mad r4, r0, c90, r4 // add (diffuse color * material diffuse)

    mul r0, r5.z, c[60+aL] // calculate specular color
    mad r4, r0, c91, r4 // add (specular color * material specular)

endloop

mov oD0, r4 // final color

Hidden Surface Removal

This test looks for techniques of removal of hidden surfaces and pixels and estimates their effectiveness, i.e. effectiveness of operation with a traditional depth buffer and effectiveness and availability of early culling of hidden pixels. The test generates a pseudorandom scene of a given number of triangles: 




which will be rendered in one of three modes: 

  1. sorted, front to back order 
  2. sorted, back to front order 
  3. unsorted 

In the second case the test renders all pixels in turn, including hidden ones, in case the accelerator is based on the traditional or hybrid architecture (a tile accelerator can provide optimization in this case as well, but remember that the sorting will take place anyway, even though on the hardware or driver levels). 

In the first case the test can draw only a small number of visible pixels and the others can be removed yet before filling. In the third case we have some sort of a middle similar to what the HSR mechanism can encounter in real operations in applications that do not optimize the sequence of scene displaying. To get an idea on the peak effectiveness of the HSR algorithm it's necessary to collate the results of the first and second modes (the most optimal first mode with the least convenient second one). The comparison of the optimal mode with the unsorted one (i.e. the first and third) will give us an approximate degree of effectiveness in real applications. 

The scene rotates around the axis Z in the test to smooth away any potential peculiarities of different early HSR algorithms which are primarily based on the frame buffer zoning. As a result, the triangles and their verges take all possible positions. 

You can also change the number of rendered triangles to see how the test depends on other chip's subsystems and drivers. We can expect improvement of the results as the number of triangles grows up, but on the other hand, the growth is justified only up to a certain degree after which the influence of other subsystems on the test can start going up again. That is why this parameter was brought in to estimate quality of the test regarding the number of triangles. 

Here are the adjustable parameters: 

  • Resolution
  • Window or fullscreen mode
  • Test time Bpemÿ tectèpobahèÿ (accumulation of statistics) in seconds
  • Vertex shaders software emulation and TCL 
  • Operating modes: 
    • Fixed Function TCL and Fixed Function Blend Stages 
    • Vertex Shaders 1.1 and Fixed Function Blend Stages 
    • Vertex Shaders 2.0 and Fixed Function Blend Stages 

  • Number of triangles: 
    • 1000 to 20000 

  • Sorting mode for a rendered scene: 
    • no; 
    • back to front polygons
    • front to back polygons

Pixel Shading

This test estimates performance of various pixel shaders 2.0. In case of PS 1.1 the speed of execution of shaders translated into the stage settings could be easily defined, and it was needed to have only a test like Pixel Filling carried out with a great number of textures, in case of PX 2.0 the situation looks much more complicated. Instruction per clock execution and new data formats (floating-point numbers) can create a significant difference in performance not only when the accelerator architectures differ, but also on the level of combination of separate instructions and data formats inside one chip. We decided to use an approach similar to the CPU benchmarking for testing performance of pixel processors of modern accelerators, i.e. to measure performance of the following set of pixel shaders which have real prototypes and applications: 

  1. per-pixel diffuse lighting with per-pixel attenuation - 1 point source
  2. per-pixel diffuse lighting with per-pixel attenuation - 2 point sources
  3. per-pixel diffuse lighting with per-pixel attenuation - 3 point sources: 

  4.  






  5. per-pixel diffuse lighting + specular lighting with per-pixel attenuation (1 point source) 
  6. per-pixel diffuse lighting + specular lighting with per-pixel attenuation (2 point sources): 

  7.  






  8. marble animated procedure texturing 
  9. fire animated procedure texturing: 

  10.  






Two last tests implement the procedure textures (pixel color values are calculated according to a certain formula) which are an approximate mathematical model of the material. Such textures take little memory (only comparatively small tables for accelerated calculation of various factors are stored there) and support almost infinite detailing! They are easy to animate by changing the basic parameters. It's quite possible that future applications will use exactly such texturing methods as capabilities of accelerators will grow. 

The geometrical test scene is simplified, and dependence on the chip's geometrical performance is almost eliminated. Hidden surface removal is absent as well - all surfaces of the scene are visible at any moment. The load is laid only on the pixel pipelines. 

Here are adjustable parameters: 

  • Resolution
  • Window or fullscreen mode
  • Test time (accumulation of statistics) in seconds
  • Vertex shaders software emulation and TCL 
  • Pixel shader: 
    • 1 point light ( per-pixel diffuse with per-pixel attenuation ) 
    • 2 point lights ( per-pixel diffuse with per-pixel attenuation ) 
    • 3 point lights ( per-pixel diffuse with per-pixel attenuation ) 
    • 1 point light ( per-pixel diffuse + secular with per-pixel attenuation ) 
    • 2 point lights ( per-pixel diffuse + secular with per-pixel attenuation ) 
    • Procedure texturing (Marble) 
    • Procedure texturing (Fire) 

Below are the codes of some shaders. Per-pixel diffuse with per-pixel attenuation for 2 light sources: 

ps_2_0

// 
// Texture Coords
//

dcl t0 // Diffuse Map
dcl t1 // Normal Map
dcl t2 // Specular Map

dcl t3.xyzw // Position (World Space)

dcl t4.xyzw // Tangent
dcl t5.xyzw // Binormal
dcl t6.xyzw // Normal

//
// Samplers
//

dcl_2d s0 // Sampler for Base Texture
dcl_2d s1 // Sampler for Normal Map
dcl_2d s2 // Sampler for Specular Map

//
// Normal Map
//

texld r1, t1, s1
mad r1, r1, c29.x, c29.y

//
// Light 0
//

// Attenuation

add r3, -c0, t3 // LightPosition-PixelPosition
dp3 r4.x, r3, r3 // Distance^2
rsq r5, r4.x // 1 / Distance
mul r6.x, r5.x, c20.x // Attenuation / Distance

// Light Direction to Tangent Space

mul r3, r3, r5.x // Normalize light direction

dp3 r8.x, t4, -r3 // Transform light direction to tangent space
dp3 r8.y, t5, -r3
dp3 r8.z, t6, -r3
mov r8.w, c28.w
 

// Half Angle to Tangent Space

add r0, -t3, c25 // Get a vector toward the camera
nrm r11, r0

add r0, r11, -r3 // Get half angle
nrm r11, r0 
dp3 r7.x, t4, r11 // Transform half angle to tangent space
dp3 r7.y, t5, r11
dp3 r7.z, t6, r11
mov r7.w, c28.w

// Diffuse

dp3 r2.x, r1, r8 // N * L
mul r9.x, r2.x, r6.x // * Attenuation / Distance

mul r9, c10, r9.x // * Light Color

// Specular

dp3 r2.x, r1, r7 // N * H
pow r2.x, r2.x, c26.x // ^ Specular Power
mul r10.x, r2.x, r6.x // * Attenuation / Distance

mul r10, c12, r10.x // * Light Color

//
// Light 2
//

// Attenuation

add r3, -c1, t3 // LightPosition-PixelPosition
dp3 r4.x, r3, r3 // Distance^2
rsq r5, r4.x // 1 / Distance
mul r6.x, r5.x, c21.x // Attenuation / Distance

// Light Direction to Tangent Space

mul r3, r3, r5.x // Normalize light direction

dp3 r8.x, t4, -r3 // Transform light direction to tangent space
dp3 r8.y, t5, -r3
dp3 r8.z, t6, -r3
mov r8.w, c28.w

// Half Angle to Tangent Space

add r0, -t3, c25 // Get a vector toward the camera
nrm r11, r0

add r0, r11, -r3 // Get half angle
nrm r11, r0 

dp3 r7.x, t4, r11 // Transform half angle to tangent space
dp3 r7.y, t5, r11
dp3 r7.z, t6, r11
mov r7.w, c28.w

// Diffuse

dp3 r2.x, r1, r8 // N * L
mul r2.x, r2.x, r6.x // * Attenuation / Distance

mad r9, c11, r2.x, r9 // * Light Color

// Specular

dp3 r2.x, r1, r7 // N * H
pow r2.x, r2.x, c26.x // ^ Specular Power
mul r2.x, r2.x, r6.x // * Attenuation / Distance

mad r10, c13, r2.x, r10 // * Light Color

//
// Diffuse + Specular Maps
//

texld r0, t0, s0
texld r1, t2, s2

mul r9, r9, r0 // Diffuse Map
mad r9, r10, r1, r9 // Specular Map

// Finalize

mov oC0, r9

Fire procedure texture: 

ps_2_0

def c3, -0.5, 0, 0, 1
def c4, 0.159155, 6.28319, -3.14159, 0.25
def c5, -2.52399e-007, -0.00138884, 0.0416666, 2.47609e-005

dcl v0

dcl t0.xyz
dcl t1.xyz
dcl t2.xyz
dcl t3.xyz

dcl_volume s0
dcl_2d s1

texld r0, t0, s0
mul r7.w, c0.x, r0.x
texld r2, t1, s0
mad r4.w, c0.y, r2.x, r7.w
texld r11, t2, s0
mad r1.w, c0.z, r11.x, r4.w
texld r8, t3, s0
mad r10.w, c0.w, r8.x, r1.w
mul r5.w, c2.x, r10.w
mad r7.w, c1.x, t0.x, r5.w
mad r9.w, r7.w, c4.x, c4.w
frc r4.w, r9.w
mad r6.w, r4.w, c4.y, c4.z
mul r1.w, r6.w, r6.w
mad r3.w, r1.w, c5.x, c5.w
mad r5.w, r1.w, r3.w, c5.y
mad r7.w, r1.w, r5.w, c5.z
mad r9.w, r1.w, r7.w, c3.x
mad r11.w, r1.w, r9.w, c3.w
mov r3.xy, r11.w
texld r6, r3, s1
mov oC0, r6

Point Sprites

This test measures performance of just one function: displaying of pixel sprites used for creating systems of particles. The test draws an animated system of particles resembling a human body: 




We can adjust a size of the particles (which will affect the fillrate), enable and disable light processing and animation. In case of a system of particles geometry processing is very important, that is why we didn't separate these two aspects - filling and geometrical calculations (animation and lighting) but made possible to change a load degree of one or another body part by changing sprite size and switching on/off their animation and lighting. 

Here are adjustable parameters: 

  • Resolution
  • Window or fullscreen mode
  • Test time (accumulation of statistics) in seconds
  • Vertex shaders software emulation and TCL 
  • Operating modes: 
    • Vertex Shaders 1.1 and Fixed Function Blend Stages 
    • Vertex Shaders 2.0 and Fixed Function Blend Stages 

  • Animation mode: 
    • off
    • on

  • Lighting mode: 
    • off
    • on

Stay with us

In the near future we will finish debugging and publish the first results of the 7th test which, first of all, measures quality of the drivers and how effectively data and parameters are delivered to the accelerator. 

Soon all synthtic tests will be able to use not only Assembler shader versions but also those which are compiled from a higher-level language with the Microsoft (HLSL) compiler and the NVIDIA's one - CG+CGFX. 

The most pleasant event is the approaching release of the first beta version of the RightMark 3D packet. In the beginning the first beta version will provide only synthetic tests for free use. 

Practical estimation

Now comes the most interesting part where we will show and comment the data obtained on the accelerators of two main families - ATI RADEON 9500/9700 and NVIDIA GeForce 4 4200/4600. 

Test system and drivers

Testbed: 

  • Pentium 4 based computer (Socket 478): 
    • Intel Pentium 4 3066 MHz CPU; 
    • ASUS P4GX8 (iE7205, HyperThreading ON) mainboard; 
    • 512 MB DDR SDRAM PC3200; 
    • Seagate Barracuda IV 40GB hard drive; 
    • Windows XP SP1; 
    • ViewSonic P817 (21") monitor. 

In the tests we used ATI's drivers 6.255 (CATALYST 3.0a), DirectX 9.0. VSync was off in the drivers. 

The results of the following video cards are used for comparison: 

  • Albatron Medusa GeForce4 Ti 4600 (300/325 (650) MHz, 128 MB, driver 42.01); 
  • ABIT Siluro GF4 Ti4200-8x (GeForce4 Ti 4200 with AGP 8x, 250/256 (512) MHz, 128 MB, driver 42.01); 

Pixel Filling

  1. The test measures the frame buffer fill rate (Pixel Fillrate). Constant color, no texture sampling. The scores are given in million pixels per second for different resolutions both in the standard mode and in 4x MSAA: 



  2. As the resolution grows up, the scores of the top RADEONs based on R300 are increasing coming close to the theoretical level. The NV25 based solutions get frozen at the certain level starting from 1024x768 apparently determined by the memory bandwidth (or rather its shortage). The losses connected with AA are less dramatic of the senior RADEONs, but they are greater of the RADEON 9500 even as compared with the aging NVIDIA's solutions. Interestingly, for the AA modes the ATI's solutions have optimal performance at 1280x1024, after which the frame buffer affects badly because even when twice compressed at MSAA 4x takes a lot of memory. 

  3. Frame buffer fillrate with simultaneous texturing. Sampling of one simple bilinear texture is added - thus, we will estimate how a competitive read stream from memory cuts down the filling effectiveness. The results are given in million pixels per second for different resolutions in the standard mode and at 4x MSAA: 



  4. The picture looks similar, though the peak values are lower. Let's see how the measured data correlate with the theoretical limits calculated with the core frequency and number of pipelines: 

    Product Theoretical maximum Measured maximum (without texture) Measured maximum (with 1 texture)
    GeForce4 Ti 4200-8x 1000 978 947
    GeForce4 Ti 4600 1200 1175 1150
    RADEON 9500 128 1100 1051 1036
    RADEON 9500 PRO 2200 (128 bit!) 1737 1363
    RADEON 9700 2200 2070 1982
    RADEON 9700 PRO 2600 2340 2184

    The test results are very close to the theoretical maximum values, which proves the test's appropriateness. Note that the NVIDIA's solution are much closer to the maximum than the ATI. The competitive stream affected the RADEON 9500 PRO most of all because of the scarce memory throughput and only two controllers, as a result, the local bus overflows. The GeForce4 Ti 4600 shines in this test. 

  5. Now look at the dependence of the Texturing Rate (pixels sampled and filtered from textures, per second) on the number of textures applied in a pass: 



  6. While NVIDIA performs well, at least, up to 4 textures can be applied in a pass, the ATI's solutions decrease their performance a bit as the number of textures go up, which is typical of the new generation. Now instead of the stages we have a pixel pipeline, that is why each new texture costs a new instruction. 

    Product Theoretical maximum Measured maximum (2 textures) Measured maximum (max. textures)
    GeForce4 Ti 4200-8x 2000 1682 1839
    GeForce4 Ti 4600 2400 2075 2223
    RADEON 9500 128 1100 (4 TU!) 1055 678
    RADEON 9500 PRO 2200 (128 bits!) 1799 1339
    RADEON 9700 2200 1778 1233
    RADEON 9700 PRO 2600 2070 1430

    It's well seen that in case of a great number of textures (future applications) ATI doesn't depend much on memory (compare the last column for 9500 PRO and 9700), but is affected much by the number of pipelines and core's clock speed; it's clear that it's needed to enable the pipelines and overclock the core. 

  7. Dependence on the texture format: 



  8. The results remain almost the same - all the chips are optimized for 32-bit textures long ago and unpack compressed textures without any delays. But in future reviews, including the GeForce FX, we will take a look at comparatively new floating-point formats of the rendering and texture buffer, where we can have a real surprise (considering the influence of the format on the processing speed). 

  9. Dependence on the filtering type: 



  10. With the significant anisotropy settings the NVIDIA's solutions start losing their performance. It was discussed in depth in our previous reviews. What I want to say is that our test showed quite expected results. Soon we will see how the GeForce FX works in the anisotropy mode. 

Geometry Processing Speed

  1. Simple lighting, i.e. peak throughput for triangles: 



  2. The chips show repeatable, peak performance results. The ATI's scores do not depend on the shader version (or T&L hardware emulation - on the FFT (Fixed Function Transformation) diagram). 104 million vertices per second is quite a lot. This figure nicely correlates with the peak value given by ATI. Note that the results depend only on the core's performance, that is why in none of the junior R300 based models the performance of geometrical processes is not understated. The hardware emulation of T&L of the NVIDIA's products is a bit more effective than the similar, actually microscopical (2 instructions - multiplication of matrices 4x4 and one copying of the result) vertex shader. Note also that the NVIDIA's products lose markedly because of only two vertex processors against 4 of ATI. So, the test showed the unprecedented results (at least, in the version of 3D Mark 2001) making the gap between the specified and obtained performance. 

  3. Now comes a more complicated lighting model (two diffuse light sources): 



  4. Now the T&L hardware emulation of ATI is less efficient than the vertex shader 1.1. And it's comparable in performance with the VS 2.0. As you can see, the second version is not free at all - the loops used cause the performance drop. Moreover, the drop is greater than we could expect from one loop instruction per several tens of usual instructions. Especially considering that the loops on the R300 are unrolled into a linear shader code by the chip's drivers, such losses look really strange. 

    Here is the first question for the ATI drivers developers. Is everything OK, and if it is, then what causes the performance drop? 

    It's also unclear why the RADEON 9700 falls behind all versions of the 9500 (marginally but constantly) but only in the fixed T&L (FFT) emulation modes and in the VS 1.1. 

    Meanwhile, with more complicated shaders the NVIDIA's solutions perform better, especially in the hardware T&L (FFT) emulation modes - NVIDIA is still able to shine with the one-year-old one in new and old games. 

  5. Two more shaders, in the order of increasing complexity (one diffuse specular source and three diffuse specular sources): 



  6. The picture repeats, except the fact that here the FFT performs better for both than the vertex shader. The RADEON 9700 keeps on performing strangely. 

  7. Now look at the dependence on resolution for different degrees of complexity of geometrical calculations: 





  8. Almost no dependence, only a bit of it on the simplest model, in the highest resolution. We have checked again the desired fact of precise and narrow focusing of the described synthetic tests. 

  9. Now let's check the dependence on the VS version with the fixed fill pipeline or pixel shaders of the respective versions used together with VS: 



  10. Nothing strange except the expected performance drop during the software emulation of VS 2.0 on the NV25 based solutions which do not support this version on the hardware level. 

  11. And the last test is dependence on the model's detail level: 





  12. As expected, the more polygons in a model, the higher the scores, but the dependence is quite weak and starting from the second level we can consider it sufficient. It's interesting that the NVIDIA's chips (probably, the vertex caches and other aspects of balancing affect?) reach the optimal results at the middle detail level, while the ATI models keep on growing up - they are meant for more complicated scenes. The difference in the designing time shows that the concept of an ideal scene differs for them. 

Hidden Surface Removal

  1. Support and maximum performance of HSR percentagewise (for the different number of triangles): 



  2. Isn't it shocking? 

    1. The HSR doesn't work in both NVIDIA chips! It's disabled by the drivers (though it remains enabled on the NV20). The register key which allows turning it on/off doesn't work - we failed to enable this function on the NV25. What's the matter? An error in the chip that makes it impossible? 
    2. It does work (HyperZ) in all the RADEON 9700 and RADEON 9500 PRO chips and demonstrates a perfect performance. But it is not supported in all RADEON 9500 chips! It's disabled, and probably on the driver level again. Why? Maybe, for creation of an additional difference in real applications? But the more believable version is that there are some defects in the die, and that is why they are used for cards on the RADEON 9500. Besides, to make performance lower relative to the RADEON 9500 PRO and 9700, such dies have half of their pipelines disabled. In the first part we discussed the problem of turning the RADEON 9500 into RADEON 9700 (9500 PRO) with the RivaTuner, i.e. on the software level. The following events showed that it's not so smooth as we wanted it to be. First of all, not all R9500 chips work without artefacts after the redesigning, about 28% have bugs which indicate some problems in the HyperZ unit. Isn't it the unit that controls the HSR? I think that ATI disables the crippled HSR unit on the software level with half of the pipelines and then uses such chips as RADEON 9500. 

    3.  

      There is more food for thought for owners of the RADEON 9500 which want to increase performance of their cards on the software level. Is there a way to enable 4 pipelines without touching the HSR? Well, we are working on it. 

  3. Here is the efficiency in comparison with the unsorted scene: 



  4. Even when the scene is originally unsorted, there is some gain. It's seen best of all in case of a small number of polygons. So, if you want to use the benefits of the HSR (though half of the chips have it disabled) you should sort the scene before rendering so that you can get a considerable performance increase (several times). In case of the unsorted scene the HSR makes an effect, but it's intangible (just tens of percents). However, portal applications do sort scenes before rendering, and most modern FPS engines belong to them. That is why the game with HSR is worth the candle and, first of all, for games of this class. 

    So, however that may be, we witness again that the most accelerators tested today have the HSR disabled forcedly. It turns out that when you get the RADEON 9500 and read in its specification about HyperZ or get the GeForce4 Ti 4600 and read about EarlyZ Cull you are deceived. 

  5. Dependence on resolution: 



  6. The conclusion is that the HSR works best of all in low resolutions. The explanation is simple - the removed blocks as a rule have a fixed size (say, 8x8), and in higher resolutions the number of blocks to be removed for the same entirely hidden triangle is greater, that is why the HSR effectiveness decreases. This has an effect even in case of the hierarchical buffer of the RADEON. Probably, in future versions of accelerators the developers should use several base block sizes changing them for different resolutions or simply increase this size making users of new accelerators use LCD monitors at, at least, 1280x1024. 

  7. Now, to compare FPS visually, look at the data we used in the beginning to calculate the effectiveness percentagewise: 





Pixel Shading

This test  is only for ATI because the hardware support of PS 2.0 is the minimal requirement of this test. On the good old GeForce4 Ti 4600 coupled with the 2 GHz Pentium 4 the software emulation of the PS 2.0 results in approximately one frame per two seconds, in the minimal window. 

  1. Test: 



  2. Well, the clock speed and the number of pipelines are on the first place. The memory bus doesn't influence much, and the effect takes place only for some shaders (first and seventh). 200 frames against 0.5 is a good argument at least for those who use pixel shaders of the new version of games. We'll see how the GeForce FX will perform. 

  3. Dependence on resolution: 





  4. Well, this is a good dependence. The memory bus has almost no effect because of the reasons mentioned above - the main parameters for more or less complicated pixel shaders are core frequency and number of pipelines. The shift from filling to calculations we were promised with the advent of DX9 is well seen. 

Point Sprites

  1. With lighting and without, depending on the size: 



  2. As expected, it's noticeable only with small sprites whether lighting is on or off, after that it's limited by the fillrate (starting at the size of 8). So, for rendering systems consisting of a great number of particles the optimal size is less than 8. By the way, NVIDIA performs better in this respect than ATI - the performance drop is not so noticeable, and up to the size of 8 you can consider it monotonous and rather small. ATI loses courage between 4 and 8. The peak values are reached without lighting and come to a bit over 20 million sprites per second for the RADEON 9700 PRO and a bit over 10 for the GeForce4 Ti 4600. 

    At the size of 2 and 4 both ATI and NVIDIA are limited only by the geometrical performance which, in case of simple tasks, must be twice better for ATI. Well, this is what we can witness. 

    Also note that the pixel sprites are not a cure-all at all - the figures are quite close to those which can be obtained with usual polygons. However, pixel sprites are more handy for programmers and, in first turn, for all possible systems of particles. 

  3. Let's see how animation impacts the performance: 



  4. Well, the contribution of the animation is not great but noticeable, irrespective of the VS version. 

  5. At last, there are detailed diagrams of dependence on resolution for three sizes: 





  6. As we expected it, the dependence is insignificant; it is noticeable most of all for small sprites. 

Well, our first extended tests of the cards in the synthtic benchmarks for API DX9 from the RightMark 3D packet are over. 

Conclusion

  1. I must admit that the results perfectly correlate with the theoretical peak values and are almost independent of other subsystems of accelerators. 
  2. It's strange that the performance of the ATI cards falls in case of loops in the vertex shaders 2.0. 
  3. It's also unexpected that the HSR is locked in both cards of the GeForce4 Ti family and in all RADEON 9500 cards. Earlier this issue was never brought up. 
  4. Lacking HSR in the RADEON 9500 and the character of the artifacts in some video cards turned into the RADEON 9700 tell us that some of the RADEON 9500 cards use dies with a crippled HSR unit. On one hand, it's beneficial for ATI (less waste), but on the other hand, they trick users because in all specifications for the RADEON 9500 they mention the HyperZ technology which controls the HSR as well. 

Well, it's not the last time we are testing the RADEON 9500-9700 in DX9, we will return if the new drivers deserve it, and now we are preparing for... 




Well, I think you know what I mean :-) 
 
 

Alexander Medvedev (unclesam@ixbt.com)  
Andrey Vorobiev (anvakams@ixbt.com

Write a comment below. No registration needed!


Article navigation:



blog comments powered by Disqus

  Most Popular Reviews More    RSS  

AMD Phenom II X4 955, Phenom II X4 960T, Phenom II X6 1075T, and Intel Pentium G2120, Core i3-3220, Core i5-3330 Processors

Comparing old, cheap solutions from AMD with new, budget offerings from Intel.
February 1, 2013 · Processor Roundups

Inno3D GeForce GTX 670 iChill, Inno3D GeForce GTX 660 Ti Graphics Cards

A couple of mid-range adapters with original cooling systems.
January 30, 2013 · Video cards: NVIDIA GPUs

Creative Sound Blaster X-Fi Surround 5.1

An external X-Fi solution in tests.
September 9, 2008 · Sound Cards

AMD FX-8350 Processor

The first worthwhile Piledriver CPU.
September 11, 2012 · Processors: AMD

Consumed Power, Energy Consumption: Ivy Bridge vs. Sandy Bridge

Trying out the new method.
September 18, 2012 · Processors: Intel
  Latest Reviews More    RSS  

i3DSpeed, September 2013

Retested all graphics cards with the new drivers.
Oct 18, 2013 · 3Digests

i3DSpeed, August 2013

Added new benchmarks: BioShock Infinite and Metro: Last Light.
Sep 06, 2013 · 3Digests

i3DSpeed, July 2013

Added the test results of NVIDIA GeForce GTX 760 and AMD Radeon HD 7730.
Aug 05, 2013 · 3Digests

Gainward GeForce GTX 650 Ti BOOST 2GB Golden Sample Graphics Card

An excellent hybrid of GeForce GTX 650 Ti and GeForce GTX 660.
Jun 24, 2013 · Video cards: NVIDIA GPUs

i3DSpeed, May 2013

Added the test results of NVIDIA GeForce GTX 770/780.
Jun 03, 2013 · 3Digests
  Latest News More    RSS  

Platform  ·  Video  ·  Multimedia  ·  Mobile  ·  Other  ||  About us & Privacy policy  ·  Twitter  ·  Facebook


17

Copyright © Byrds Research & Publishing, Ltd., 1997–2011. All rights reserved.