AMD(ATI) RADEON Graphics Cards Reference Information

Reference Information on RADEON R[V]4XX Graphics Cards
Reference Information on RADEON R[V]5XX Graphics Cards
Reference Information on RADEON R[V]6XX Graphics Cards

Specifications of R[V]4XX and some RV3XX GPUs

Code name	R481	R480	R430	R423	R420	RV410	RV380	RV370
Baseline Article	here					here	here
Fabrication Process (nm)	130		110	130		110	130	110
Transistors (M)	160					120	75
Pixel Processors	16					8	4
Texture Units	16					8	4
Blending Units	16					8	4
Vertex Processors	6						2
Memory Bus	256 (64x4)					128 (64x2)
Memory Types	DDR, DDR2, GDDR3						DDR, DDR2
System Bus	AGP 8x	PEG 16x			AGP 8x	PEG 16x	PEG 16x
RAMDAC	2 × 400 MHz
Interfaces	TV-Out TV-In (a video capture chip is required) 2 × DVI (external interface chips are required)
Vertex Shaders	2.0
Pixel Shaders	2.0b						2.0
Precision of pixel calculations	FP24
Precision of vertex calculations	FP32
Texture component formats	FP32, FP16 (without filtering) I8 DXTC*, S3TC 3Dc
Rendering formats	FP32 and FP16 (without blending and MSAA) I8
MRT	available
Antialiasing	2x, 4x, and 6x MSAA Pseudo random arrangement of samples on the 12×12 grid
Z generation	1x in Z-only mode, 2x in MSAA mode
Stencil buffer	Double-sided						Regular
Shadow technologies	No special technologies

Specifications of reference cards based on R[V]4XX and RV3XX

card	chip bus	PS/TMU/VS units	Core clock (MHz)	Memory frequency (MHz)	Memory capacity (MB)	Memory bandwidth (GB) bit	Texel rate (Mtex)	Fill rate (Mpix)
RADEON X800 PRO	R420 AGP	12/12/6	475	450(900)	256 GDDR3	28.8 (256)	6400
RADEON X800 XT PE	R420 AGP	16/16/6	520	560(1120)	256 GDDR3	35.8 (256)	8320
Radeon X300	RV370 PEG16x	4/4/2	325	200(400)	128 DDR	6.4 (128)	1300
RADEON X300 SE	RV370 PEG16x	4/4/2	325	200(400)	128 DDR	3.2 (64)	1300
RADEON X600 PRO	RV380 PEG16x	4/4/2	400	300(600)	128 DDR	9.6 (128)	1600
RADEON X600 XT	RV380 PEG16x	4/4/2	500	370(740)	128 DDR	11.8 (128)	2000
RADEON X800 XT	R420 AGP	16/16/6	500	500(1000)	256 GDDR3	32.0 (256)	8000
RADEON X800 XT	R423 PEG16x	16/16/6	500	500(1000)	256 GDDR3	32.0 (256)	8000
RADEON X700 [LE]	RV410 PEG16x	8/8/6	400	350(700)	128 GDDR3	11.2 (128)	3200
RADEON X700 PRO	RV410 PEG16x	8/8/6	425	430(860)	256 GDDR3	13.8 (128)	3400
RADEON X700 XT	RV410 PEG16x	8/8/6	475	525(1050)	128 GDDR3	16.8 (128)	3800
RADEON X800 SE	R420 AGP	8/8/6	425	400(800)	256 GDDR3	25.6 (256)	3400
RADEON X800	R430 PEG16x	12/12/6	400	350(700)	256 GDDR3	22.4 (256)	4800
RADEON X800 XL	R430 PEG16x	16/16/6	400	500(1000)	256 GDDR3	32.0 (256)	6400
RADEON X850 PRO	R480 PEG16x	12/12/6	507	520(1040)	256 GDDR3	33.3 (256)	6804
RADEON X850 XT	R480 PEG16x	16/16/6	520	540(1080)	256 GDDR3	34.6 (256)	8320
RADEON X850 XT PE	R480 PEG16x	16/16/6	540	590(1180)	256 GDDR3	37.8 (256)	8640
RADEON X700 LE	RV410 AGP	8/8/6	400	350(700)	128 GDDR3	11.2 (128)	3200
RADEON X700 PRO	RV410 AGP	8/8/6	425	430(860)	256 GDDR3	13.8 (128)	3400
RADEON X800	R430 AGP	12/12/6	400	350(700)	256 GDDR3	22.4 (256)	4800
RADEON X800 XL	R430 AGP	16/16/6	400	500(1000)	256 GDDR3	32.0 (256)	6400
RADEON X850 PRO	R481 AGP	12/12/6	507	520(1040)	256 GDDR3	33.3 (256)	6804
RADEON X850 XT	R481 AGP	16/16/6	520	540(1080)	256 GDDR3	34.6 (256)	8320
RADEON X850 XT PE	R481 AGP	16/16/6	540	590(1180)	256 GDDR3	37.8 (256)	8640
RADEON X300 SE 128 (HM)	RV370 PEG16x	4/4/2	325	300(600)	32 DDR	4.8 (64)	1300
RADEON X300 SE 256 (HM)	RV370 PEG16x	4/4/2	325	300(600)	128 DDR	4.8 (64)	1300
RADEON X740 XL	RV410 PEG16x	8/8/6	425	450(900)	128 GDDR3	14.4 (128)	3400
RADEON X700 SE	RV410 AGP	4/4/6	400	250(500)	128 GDDR3	4.0 (64)	1600
RADEON X550	RV370 PEG16x	4/4/2	400	250(500)	128 DDR	8.0 (128)	1600
RADEON X800 XL	R430 PEG16x	16/16/6	400	500(1000)	512 GDDR3	32.0 (256)	6400
RADEON X850 XT CFE	R480 PEG16x	16/16/6	520	540(1080)	256 GDDR3	34.6 (256)	8320
RADEON X800 CFE	R430 PEG16x	16/16/6	400	500(1000)	128/256 GDDR3	32.0 (256)	6400
card	chip bus	PS/TMU/VS units	Core clock (MHz)	Memory frequency (MHz)	Memory capacity (MB)	Memory bandwidth (GB) bit	Texel rate (Mtex)	Fill rate (Mpix)

Details: R420, RADEON X800 Series

R420 Specifications

Codename: R420
Fabrication Process: 130 nm (TMSC, low-k, copper connections)
160 million transistors
FP¡ package (flip-chip, flipped chip without a metal cap)
256 bit memory interface
Up to 512 MB of DDR/DDR2/GDDR3 memory
AGP 8x bus interface (there is also a PCI-Express modification of the chip - R423)
16 pixel processors, each with one texture unit
6 vertex processors
Calculating, blending, and writing up to 16 full (color, depth, stencil buffer) pixels per clock
Calculating and writing up to 32 Z-values and stencil values per clock
MSAA 2x/4x/6x, with flexibly programmable sample patterns. Compression of frame and Z buffers in MSAA modes. MSAA patterns can be changed from frame to frame (Temporal AA)
16x Anisotropic Filtering
Everything necessary to support Pixel and Vertex Shaders 2.0
Additional features of pixel shaders based on Enhanced Version 2.0 - 2.0.b
Additional features of vertex shaders, besides the basic 2.0 ones
The new texture compression technique, optimized for compressing two-component normal maps (so called 3Dc, 4:1 compression ratio)
Rendering into a floating-point frame buffer, FP16 and FP32 per component, no blending
3D and FP texture formats without texture filtering
Support for a two-sided stencil buffer
MRT (Multiple Render Targets — rendering into several buffers)
2 × RAMDAC 400 MHz
2 × DVI interfaces
TV-Out and TV-In (interface chips are required)
Programmable video processing - pixel processors are used to process the video stream (compression, decompression, and post processing tasks)
2D accelerator supporting all GDI+ functions

R420 Flow Chart

An attentive reader will notice right away that this flow chart is almost a complete match to NV40. There is nothing surprising about it - both companies try to design an optimal solution. Several generations have already used the time-proven and effective organization of the graphics pipeline general structure. Significant differences are hidden inside the units, in pixel and vertex processors in the first place.

Like NV40, it has six vertex processors and four independent pixel processors, each of them working with one quad (2x2 pixel fragment). Unlike NV40, there is probably only one texture caching level. There are four independent quad processors that can be disabled. Thus, the manufacturer can lock one, two, or even three processors, depending on market demands and defective chips, to produce video cards processing 4, 8, 12, or 16 pixel per clock.

And now we'll traditionally examine the most interesting places in more detail:

Vertex processors and sampling

Here is the flow chart of the R420 vertex processor:

The processor itself is indicated by a yellow rectangle, the other units are shown to make the picture complete. The R420 is declared to have 6 independent processors (visualize the yellow unit copied six times). Vertex units comply neither with full VS 3.0 specifications (no texture access and dynamic branching) nor with Extended Specifications 2.0 as NVIDIA understands them (so called VS 2.0.a, which implies support for predicates and dynamic branches.) What concerns arithmetic performance - like an NV40 vertex processor, a vertex processor in the R420 can execute one vector operation (up to four FP32 components) and one scalar FP32 operation simultaneously per clock.

Here is a summary table with parameters of vertex processors in modern accelerators from the point of view of vertex shaders in DirectX 9 API:

Vertex Shader Model	2.0 (R3XX, R42X)	2.a (NV3X)	3.0 (NV4X, G7X, R5XX)
Instructions in shader code	256	256	over 512
The number of executed instructions	65535	65535	over 65535
Predicates	No	Available	Available
Temporal Registers	12	13	32
Constant Registers	over 256	over 256	over 256
Static Branching	Yes	Yes	Yes
Dynamic Branching	No	Yes	Yes
Nesting depth of dynamic branching	No	24	24
Texture Sampling	No	No	Available (4)

Another interesting aspect, which is analyzed in our reviews, is FFP (T&L) emulation performance. Remember that the R3XX was outperformed by NVIDIA chips in many respects due to the lack of special hardware units for calculating lighting, which had been accelerated T&L emulation in three generations of NVIDIA chips.

Pixel processors and the fill process

Let's analyze the pixel architecture of the R420 in the order of the data flow. That's what we get after setting up triangle parameters:

We'll dwell on the most interesting facts. Firstly, while R3XX used to have two quad processors maximum, which processed a block of four pixels (2x2) per clock, there are currently four such processors. They are completely independent, each of them can be excluded from operation (for example, to create a light version of a chip with three processors, if one of them is not effective).

Note that this flow chart resembles that of the NV40 in many respects, but there are also cardinal differences, which we shall examine in more detail. So, at first a triangle is divided into units of the first level (8x8 or 4x4 depending on a rendering resolution) and the first stage of culling invisible units takes place based on the data in the integrated mini Z buffer. Its size is not published, but to all appearances it's a tad less than 200 KB in the R420. Up to four units can be culled per clock at this stage, that is up to 256 invisible pixels.

Then follows the second division stage - this time into 2x2 quads. Completely occluded quads are culled based on the L2 Z Buffer (2x2 granularity) stored in video memory. Depending on MSAA mode, one element of this buffer may correspond to 4 (no), 8 (MSAA 2x), 16 (MSAA 4x), or even 24 (6x MSAA) pixels in a frame buffer. Hence it's detached into a separate structure, taking up an intermediate level between the on-chip mini Z buffer and the final Z buffer of the base level. Thus, NVIDIA products have the bi-level hierarchy of HSR and Z buffer, while ATI products offer three levels in their hierarchy.

Then quads are set and distributed among active pixel processors. Now about the most important differences between R420 and NV40:

Algorithm of NVIDIA's pixel processor:

Cycle of shader commands

Read microcode of the next command
Configure a texture module and all ALUs
Cycle of all the quads in queue

Run a quad through a processor, TMU, and ALU

End of quad cycle

End of shader command cycle

Algorithm of ATI's pixel processor

Cycle of four phases

Cycle of all the quads in queue

Cycle of sampled textures in this phase (up to 8)

Sample a texture

End of the texture cycle

Cycle of math commands in this phase (up to 128)

Execute a command

End of cycle
End of quad cycle

End of cycle of the four phases.

So, NVIDIA gradually executes commands (to be more exact, superscalar batches of commands, including texture sampling commands), driving all the quads to be processed through each command. ATI divides a shader into four phases (hence the limitation on the depth of dependent samples not to exceed 4), each of them starts with sampling all textures necessary for this phase and then follow calculations over the data obtained. Including calculations of new coordinates for texture sampling in the next phase.

Which approach is better? We cannot say for sure. ATI's approach is ill-suited to complex shaders with instruction flow control or multiple dependent samples. On the other hand, calculations within each of the four stages are performed in the similar to CPU way - all instructions are executed one-by-one for one quad, then the next quad is processed, etc. Thus, unlike the NV40, we can use a sterling pool of temporal registers without any performance loss or penalty for using over four registers during calculations. Besides, ATI's approach requires few stages in a pipeline. Consequently, it results in fewer transistors spent and potentially higher clocks reached (or, in other words, higher yield of effective chips at the fixed clock). Shader performance is easily predictable. It's easier to program, when you don't have to take care of the even grouping of texture and math commands or the expenditure of temporal registers.

Among the disadvantages are numerous limitations. Limitation on a number of dependent samples, limitation on a number of commands in one phase, requirement to store the entire microcode of the shader for the four phases "at hand", that is right in the pixel processor. Potential latencies in case of intensive successive dependent texture sampling (it's made less evident by a set of simultaneously processed quads, but their number is not as big as in NVIDIA.)

In fact, ATI's approach is optimal for Shaders 2.0, without dynamic flow control and with seriously limited code length. Any attempts to add an unlimited shader length and especially unlimited flexibility of texture sampling to this pixel architecture inevitably run against problems.

The flow chart of the pixel processor depicts the F-buffer mechanics - writing and restoring parameters of temporal shader variables. This trick allows to execute shaders that exceed the pixel processor limits as far as length or a number of dependent (as well as regular) texture samples are concerned, at the cost of additional passes. It's not a "free of charge" solution and is far from ideal. As a shader grows more complex, the number of passes and the volume of data stored temporarily in video memory will grow as well. It will be accompanied by growing penalties compared to NVIDIA-like architectures, which are not limited by the length or complexity of shaders

But let's get back to architectural peculiarities of pixel processors in R420. Processors handle data in FP24 format. But operations with texture addresses, when a TMU samples textures, are performed with higher precision. There are two ALUs here per pixel, like in R3XX. Each ALU can execute two different operations (3+1 scheme like in R3XX, but that GPU has only one ALU). You can read about it in DX Current. Arbitrary masking and post-operational component rearrangement are not supported, only within Shaders 2.0 and a tad longer Shaders 2.0.b.

Thus, depending on shader code, the GPU can execute from one to four different FP24 operations per clock over vectors (dimensionality up to 3) and scalars, plus a single access to the data, which has been already sampled from a texture in this phase. Performance of this tandem directly depends on a compiler and code. But we obviously have

Minimum: one access to a texture sample per clock
Minimum: two operations per clock without a texture access
Maximum: four operations per clock without a texture access
Maximum: four operations per clock with a texture access

The peak variant exceeds the NV40 capacities. But let's not forget that this solution is actually less flexible (always the 3+1 scheme) from the point of view of combining commands into superscalar batches during compilation. Computing efficiency of new pipelines has grown twofold in comparison with R3XX. Besides, there are twice as many of them. As they operate at higher frequency, this GPU has an advantage over the previous generation.

All new improvements, such as longer shaders and new registers, are available in new Shaders 2.0.b. Let's have a look at the summary table of various shader features:

Pixel Shader Model	2.0 (R3XX)	2.a (NV3X)	2.b (R4XX)	3.0 (NV4X/G7X, R5XX)
Texture sampling nesting up to	4	Not limited	4	Not limited
Texture sampling up to	32	Not limited	Not limited	Not limited
Shader code length	32 + 64	512	512	over 512
Shader instructions	32 + 64	512	512	over 65535
Interpolators	2 + 8	2 + 8	2 + 8	10
Predicates	not available	available	not available	available
Temporal Registers	12	22	32	32
Constant Registers	32	32	32	224
Arbitrary component rearrangement	not available	available	not available	available
Gradient instructions (DDX/DDY)	not available	available	not available	available
Nesting depth of dynamic branching	not available	not available	not available	24

Let's return to the flow chart of pixel processors. Pay attention to the bottom part. You can see there units responsible for comparing and modifying color, transparency, Z, and Stencil values as well as MSAA. Unlike NV40 being able to generate up to 4x MSAA samples based on a single pixel, R420 generates up to 6. Like in NV40, productivity of computing Z and stencil values is doubled versus the base fill rate - 32 values per clock. Correspondingly, 2x MSAA suffers no penalty in performance, while 4x and 6x take up 2 and 3 clocks. However, this penalty is not noticeable and does not play an important role any more in case of pixel shaders at least several commands long. Memory bandwidth becomes more important. Of course, both color and Z values are compressed in MSAA modes. In the optimal case, compression ratio reaches the number of MSAA samples, that is it reaches 6:1 in MSAA 6x mode.

Unlike NV40 that uses RGMS (rotated sample grid), R420 (like all R3XX chips) supports pseudo stochastic MSAA patterns on base 8x8 grid. As a result, antialiasing quality of edges and inclined lines in maximum modes is objectively higher. New drivers offer the so called Temporal AA. It consists in changing patterns from frame to frame. Thus, if your eyes or an inert LCD monitor has no problems with averaging neighboring frames (no excessive flickering), the antialiasing quality will be improved, as if we used more MSAA samples. There is no performance drop, but its effect may vary depending on a monitor and a frame refresh rate in an application.

Technological Innovations in the R420

Here are two main innovations in R4XX versus R3XX (the increased number of temporal registers and longer shaders in the pixel processor are evolutional rather than revolutional changes):

The new F-buffer algorithm that allows not to calculate a given pass of a divided pixel shader for unnecessary pixels. It can significantly optimize performance of pixel shaders with conditions and branches in OpenGL, executed in several passes via F-buffer.
The new method of 3Dc texture compression intended for compressing two-component normal maps. Traditional texture compression methods are intended for regular textures - lossy compression takes into account our eye sight peculiarities. However, they will not do for compressing normal maps - intrinsically vector tables.

Details: RV410, RADEON X700 series

RV410 Specifications

Codename: RV410
Fabrication Process: 110 nm (TMSC, low-k, copper connections)
120 million transistors
FP¡ package (flip-chip, flipped chip without a metal cap)
128 bit memory interface (dual channel controller)
Up to 256 MB of DDR/DDR2/GDDR3 memory
Built-in PCI-Express x16 bus interface
8 pixel processors, each with one texture unit
6 vertex processors
Calculating, blending, and writing up to 8 full (color, depth, stencil buffer) pixels per clock
Calculating and writing up to 16 Z-values and stencil values per clock
MSAA 2x/4x/6x, with flexibly programmed sample patterns. Compression of frame and Z buffers in MSAA modes. MSAA patterns can be changed from frame to frame (Temporal AA)
16x Anisotropic Filtering
Everything necessary to support Pixel and Vertex Shaders 2.0
Additional features of pixel shaders based on Enhanced Version 2.0 - 2.0.b
The new texture compression technique, optimized for compressing two-component normal maps (so called 3Dc, 4:1 compression ratio)
Rendering into a floating-point frame buffer, FP16 and FP32 per component, no blending
3D and FP texture formats without texture filtering
Support for a two-sided stencil buffer
MRT (Multiple Render Targets — rendering into several buffers)
2 × RAMDAC 400 MHz
2 × DVI interfaces
TV-Out and TV-In (the latter requires an interface chips)
Programmable video processing - pixel processors are used to process the video stream (compression, decompression, and post processing tasks)
2D accelerator supporting all GDI+ functions

Specifications of the Reference RADEON X700XT

Core clock: 475 MHz
Effective memory clock: 1.05 GHz (2*525 MHz)
128-bit memory bus
Memory type: GDDR3
Memory capacity: 128 or 256 MB
Memory bandwidth: 16.8 GB/sec.
Theoretical fillrate: 3.8 gigapixel per second.
Theoretical texture sampling rate: 3.8 gigatexel per second.
1 × VGA (D-Sub) and 1 × DVI-I
TV-Out
Consumes less than 70 W (that is there is no need in an additional power connector on PCI-Express cards, recommended power unit is 300 W or more)

As we can see, there are no special architectural differences from R420, which is not surprising - RV410 is a scaled (fewer vertex and pixel processors and memory controller channels) solution based on the R420 architecture. The situation here resembles the one with NV40/NV43. Architecture principles of both competitors are similar in this generation. What concerns the differences between RV410 and R420, they are quantitative (bold elements on the diagram) but not qualitative - from the architectural point of view the chip remains practically unchanged.

Thus, we have six vertex processors (like in R420, they may come in handy in some applications that are limited by geometry performance) and two (instead of four units in R420) independent pixel processors, each working with one quad (2x2 pixel fragment). PCI Express has become a native on-chip bus interface as in case with NV43. AGP 8x cards are equipped with an additional PCI-E -> AGP bridge (it's shown on the diagram).

Architecture of vertex and pixel processors as well as of the video processor remained the same — these elements were described in detail in the review of RADEON X800 XT. And now let's talk about potential tactical considerations as to what was cut down and why. To all appearances, vertex and pixel processors in RV410 remained the same, but the internal caches could have been reduced, at least proportionally to the number of pipelines. However, the number of transistors does not give cause for trouble. Considering not so large cache sizes, it would be more reasonable to leave them as they were (as in NV43), thus compensating for the noticeable scarcity of memory passband). All techniques for sparing memory bandwidth were preserved - Z buffer and frame buffer compression, Early Z with an on-chip Hierarchical Z, etc.

Interestingly, unlike NV43, which can do blending and write no more than 4 resulting pixels per clock, pixel pipelines in RV410 completely correspond to R420 in this respect. In case of simple shaders with a single texture, RV410 must get almost a twofold advantage in fillrate. Unlike NVIDIA, having a large ALU array (in terms of transistors) responsible for post processing, verification, Z generation, and pixel blending in floating point format, RV410 possesses modest combinators, and that's why their number was not cut down so much. However, the narrower memory bandwidth will not allow to write 3.8 full gigapixel per second in most cases anyway.

The idea to retain all six active vertex units is no less interesting. On one hand, it's a strong point for DCC applications. On the other hand, we know that much depends on OpenGL drivers, a traditional strong point of NVIDIA solutions.

Reference Information on RADEON R[V]4XX Graphics Cards
Reference Information on RADEON R[V]5XX Graphics Cards
Reference Information on RADEON R[V]6XX Graphics Cards

Alexander Medvedev (unclesam@ixbt.com)

Alexei Berillo (sbe@ixbt.com)

Updated on August 6, 2007

Write a comment below. No registration needed!