ATI RADEON 9700 preview or DX9=R300*NV30

Lecture 1 - introductory
Lecture 2 - characteristics of the main hero
Lecture 3 - comparison of key characteristics
Lecture 4 - memory controller
Lecture 5 - pixel pipelines and texture units
Lecture 6 - vertex shaders and higher-level languages
Lecture 7 - anti-aliasing and video capabilities
Conclusion

Lecture 1 - introductory

The epoch of flexibly programmable graphics accelerators has at last began. Certainly, there are drawbacks! But they are being corrected. At the same time, programs of realistic graphics show that capabilities of even the latest generation of accelerators are miserable. But the direction in which they are developing is true and soon you will see an incarnation of realistic graphics on every desktop. Thank you for your attention (lengthy applause).

Today we are aimed at preliminary examination of capabilities of the recently announced ATI's new-generation chip, we will also discuss its main competitor, still un announced NV30 and prospects of hardware graphics acceleration.

Lecture 2 - characteristics of the main hero

So, ATI announced the RADEON 9700 in advance - earlier it was known as R300:

This chip unveils a new generation of graphics architectures from ATI by realizing the latest trends in the hardware outlined by the API DirectX 9. Some time ago we already touched upon key requirements that the DX9 sets for accelerators.

Here are promised characteristics of the new chip and a flagship card based - RADEON 9700:

Technology: 0.15 micron;
Transistors: 107 million;
Core clock speed: 300 MHz (315/325 possible);
Memory bus: 256bit DDR (DDR II will probably come later);
Local memory: up to 256 MB;
Memory clock speed: 300 DDR (600) MHz or more, 20 GB/s bandwidth;
Interface bus: AGP 8x, 2 GB/s throughput;
Full support of basic DX9 capabilities:

Floating, 64 and 128bit data formats for textures and frame buffer (vectors of 4 components of F16 or F32);
Pixel shaders with floating arithmetic (4*F32 computation format);
Pixel Shaders 2.0;
4 independent vertex pipelines;
Vertex Shaders 2.0;
Hardware tessellation of N-Patches with Displacement Mapping, and, optionally, adapting detail level;

8 independent pixel pipelines
8 texture units (one per pixel pipeline) able to fulfill trilinear filtering without speed losses and (at last) combine anisotropic and trilinear filtering.
4-channel (4 64bits channels) memory controller connected with the accelerator's core and AGP with a full crossbar;
HyperZ III memory optimization technology (quick cleanup and compression of the Z buffer using 8x8 units, hierarchical Z buffer for quick visibility determination);
Early Z test (pixel shader is used only for visible pixels);
Hardware acceleration of MPEG 1/2 decompression and compression, possibility to process a video stream arbitrarily with pixel shaders (VIDEOSHADER technology);
2 independent CRTC;
2 integrated 10bit 400MHz RAMDACs with hardware gamma correction;
Integrated TV-Out;
Integrated DVI (TDMS transmitter) interface, up to 2043*1536.
Integrated general-purpose digital interface for connection of an external RAMDAC or a DVI transmitter or coupling with a TV tuner.
FC packaging (FlipChip).

Well, the characteristics are really impressive. Later we will comment on each item and now we are turning to

Lecture 3 - comparison of key characteristics

For comparison we have chosen the most popular game solutions as well as the main future competitor of the R300 - NV30.

The given possible specs of the NV30 are not official or precise - they are taken from different sources and based on rumors found on the Net. The considerable part of the parameters is assumed according to the open data on new cross-ÀPI higher-level languages Ñ Graphics / Cine FX which are meant to facilitate programming of such flexible chips. Besides, some assumptions are based on the DX 9 requirements:

Accelerator	R200 (RADEON 8500, 128MB)	NV25(GeForce4 Ti 4600)	RV250 (RADEON 9000 PRO)	R300 (RADEON 9700)	NV30
Technology, transistors	0.15, 62M	0.15, 68M	0.15, ~40M(?)	0.15, 107M	0.13, 120M
AGP	4x	4x	8x	8x	8x
Memory bus, bit	128 DDR	128 DDR	128 DDR	256 DDR (II)(1)	256 DDR II
Memory frequency, MHz	275	325	275	300 (?)	400+ (?)
Core frequency, MHz	275	300	275	300 (?)	400 (?)
Pixel pipelines	4	4	4	8	8
Texture units	4x2	4x2	4x1	8x1 (2)	8x2 (?)
Textures per pass, max.	6	4	6	16 (3)	16 (3)(?)
Vertex shaders	2	2	1	4	4 (?)
Fixed T&L unit	Yes	No	No	No	No (?)
N-Patches	DX8	No	DX8 (4)	DM (DX9)	DM (DX9) (?)
Vertex Shaders, version	1.1	1.1	1.1	2.0	2.0 (5)(?)
Pixel Shaders, version	1.4	1.3	1.4	2.0	2.0 (5)(?)
Memory controller	2x64	4x32	1x128	4x64	4x64 (?)
Integrated RAMDAC	1x400 MHz	2x360 MHz	2x350 MHz	2x400 MHz	2x400 MHz (?)
Memory optimization technology	Yes (HyperZ II)	Yes (LightSpeed II)	Yes (HyperZ II ?)	Yes (HyperZ III)	Yes (LightSpeed III ?)

Notes:

(1) Most likely, DDR II will be supported together with the DDR.
(2) Each texture unit can fulfill trilinear sampling itself, without performance penalty.
(3) According to the DX9 requirements, up to 16 different textures with 8 precalculated (interpolated over the triangle) 4D texture coordinates can be used in a pass. In a pixel shader it's possible to sample up to 32 values from these textures.
(4) Software emulation.
(5) To all appearances, the hardware part will have capabilities exceeding the DX requirements for vertex and especially pixel shaders 2.0.

What general conclusions can be drawn from this comparison?

At the moment the R300 is an undoubted leader among game accelerators (if we ignore rumors on parameters of the yet unannounced NV30), regarding the architecture and a rough performance estimated according to the specs and first results of such cards.

However, its real market position can be estimated only after comparing the specs and performance in applications of the final versions of the R300 and NV30. And the R300 is not available yet. The potential of the new architectures can be entirely revealed only with the DirectX 9 which is due to arrive in autumn. The NV30 will probably be released also by that time. In autumn we will be able to witness a new battle of giants. That is why the calendar advantage of the R300 doesn't give it any trumps except a doubtful priority in the PR sphere.

The .15-micron fab process, typical of the previous generation, allows for the mass manufacture of the R300 - reportedly production volumes with the .13-micron technology used won't be obtainable for ATI till winter. Besides, the .15 technology is not new to ATI as it was used in its previous products; this can help to raise percentage of operable chips in the very beginning. On the other hand, such number of transistors with such technology can cause a low output of good chips, high power consumption and a high prime cost without prospects for price competition.

NVIDIA decided to take risks - being one of the first who got an access to the .13 process, the company is in a completely different situation. The new process must have all its imperfections corrected, the mass production can be time shifted and percentage of operable chips can be very low in the beginning. On the other hand, the process will be tweaked, NVIDIA will get more benefits regarding the prime cost and clock speed (originally higher-frequency architectures of NVIDIA + the finer technology give 400 against 300). So, time works for NVIDIA; that is probably why ATI was in such a hurry with the "paper" release and will possibly put on the market first cards yet before the DirectX 9.

However that may be, the stake on a king for a day is risky.

The R300 complies with the DX9 requirements and is a deliberate hardware incarnation of this API. The rumor has it that the NV30 can offer more.

The question is whether these NV30 capabilities will be included into the DX (for example, as a DX9.1, shaders 2.1 etc.) or will be available only as OpenGL extensions.

We are about to enjoy a tough competition between two products close in characteristics, aimed at the same market niche and probably going to be released at the same time.

Lecture 4 - memory controller

In the new product ATI uses a familiar (from NVIDIA products) approach for memory control, which includes a 4-channel memory controller and an internal switch on the chip:

Well, earlier ATI preferred two- or one-channel controllers and large data blocks, while NVIDIA's caching and operation with memory is based on smaller blocks yet since the NV20. Both approaches have advantages and disadvantages, for example, the NVIDIA's one warms up memory stronger and is more critical to its parameters and quality. As a result, an overclocking potential is lower. The ATI's approach copes with memory better but is less efficient in complex tasks which use a lot of streams to access memory. As accelerators become more flexible, the number of streams which can be simultaneously read from the memory increases - there are several data flows for vertex shaders, 4 or 6 textures in a single pass. That is why the NVIDIA's approach is more effective in modern applications, and since the release of the R300 ATI also uses it :-).

The memory optimization technology got one more Roman one in its name - now it is called HyperZ III. The idea is the same - new techniques are lacking but the old ones are improved. The technology provides quick compression and cleanup of the Z-buffer using 8x8 blocks, and 3 levels of a hierarchical presentation of the Z buffer for early determination of visibility of whole blocks of polygons.

So, we have a shaded polygon (1) located close to an observer. And we want to shade polygon 2 located further and, therefore, partially overlapped. First of all we search at the highest level of the hierarchical Z buffer which stores distances to the largest 4x4 blocks, then we mark the unit which entirely belongs to the above triangle (3) and doesn't need to be shaded. Thus, we get rid of 16 pixels. Then we go to a lower level and cast aside 8 2x2 blocks. At the last level of the 1-pixel precision we find several pixels more which mustn't be shaded. Although this illustration is simplified, it is enough to get an idea of the principle of operation of the Z buffer and of a computation benefit.

Like all modern accelerators, the R300 sports an Early Z Test. Its idea is simple - real color values (hence texture values and test results as well) are calculated for visible pixels. Obviously, with more complicated shaders and methods of texturing this technology will save more on a memory bandwidth and computational clocks of the accelerator. On a typical scene, with an overdraw factor of 2, it will throw off about a quarter or a third of pixels, at best - 50% in case of an ordered rendering of a scene.

It is interesting how NVIDIA is going to name the similar technologies of its new chip - LMA III or not like ATI - LMA 3? However that may be, but clear that NVIDIA won't take the previous name LMA II :-).

Lecture 5 - pixel pipelines and texture units

With the DX9 the requirements to complexity of pixel pipelines of the chip will rise. The main catalyst of these requirements is the 2.0 version of pixel shaders:

Version	1.1	1.4	2.0
Textures per pass, max.	4	6	16
Texture sampling instructions, max.	4	6*2	32
Computational instructions, max.	8	8*2	64
Data formats	I8[4]	I12[4]	F32[4]
Instruction flow management	No	No	No
Output of several values	No	No	Up to 4 values
Z buffer access	No	record	read and record
Constant registers	8	8	16
General registers	2	2	8

When describing the Cine FXm - an API-independent analog of higher-level effect files of the DirectX 9 compiled both for the latest versions of the OpenGL and for the DX9, NVIDIA mentions pixel shaders of 1024 instructions (!) processed continuously in one pass. The pixel shader can enable up to 512 constants each considered as one instruction. It seems that in this respect the NV30 is far ahead of the DX9 requirements.

Earlier, pixel shaders were used with stages - the number of texture stages was equal to the maximum number of textures used, the number of computational stages was equal to the maximum number of instructions. Each computation stage has a normal ALU and could implement any shader instruction. Stages were adjusted for their instructions and then combined in a chain. As a result, data (values of two general registers) when processed passed all stages, and each carried out an instruction over them. It took a clock to fulfill an operation, hence a pipeline of 8 stages which processed up to 8 different pixels at different stages. The pipeline got the following results at a clock:

	1 clock	2 clock	3 clock	4 clock
1 stage (ADD)	1 pixel	2 pixel	3 pixel	4 pixel
2 stage (MUL)	-	1 pixel	2 pixel	3 pixel
3 stage (MUL)	-	-	1 pixel	2 pixel
Result	-	-	-	1 pixel

But the chip makers couldn't actually afford even 8 stages per pipeline - 32 normal ALUs, of even an integer-valued format, would occupy too much space on the chip. Usually each pixel pipeline was given 2 or 4 stages (the Matrox Parhelia 512 had 5), and in case of a longer shader stages of 2 or 4 pipelines were combined in a chain. The number of shaded pixels fell down 2-4 times in that case.

As the shaders are getting more complex, such approach ceases to be advantageous. It is necessary to provide at least 64 single-clock ALUs (for the stage approach), which is unrealizable, especially in case of floating precision of data representation. Besides, the number of temporary registers values of which are to be stored in each ALU and transferred from stage to stage at each clock is increasing. And what should we do when shaders become lengthier?

Let's see what we have on the R300. There are 8 pixel pipelines each equipped with its own processor for pixel shaders. This is not a set of switched stages with ALU but exactly a processor (RISC) which implements an instruction at a clock. Lack of instruction flow management simplifies the matters. The longer the shader, the higher the expected result. On the other hand, complexity of tasks to be fulfilled at a clock is not so crucial anymore: now we can build almost any scene in one or two passes, and this is much more beneficial than several passes of speedier but simpler shaders. The restriction in the number of instructions in the new approach is very conditional - nothing prevents the processor from fulfilling 256 or 1024 instructions in turn - the only thing required is memory on the chip. It's interesting that to provide compatibility with the first versions of the shaders the pixel pipeline of the R300 and the NV30 supports calculations not only in floating formats F32 and F16 but also in the integer format I12. Without such support processing of old shaders could bring some unpleasant problems - emulation of some instructions might require up to 4 operations!

Editor's note: Almost a portrait of the author of this article.

Moreover, to accelerate calculations we can try a superscalar approach, let it be the simplest version like in the first superscalar RISC processors. Each ALU has several functional units - addition and subtraction unit, multiplication unit, division unit, a separate device managing data transfer between registers. It's not a great problem to create a processor which can simultaneously process instructions which relate to different units provided that they are not dependent, i.e. when a following instruction can be processed irrespective of a result of a previous one. That is why accelerator developers and Microsoft recommend taking into account dependences between neighbouring instructions and getting rid of them, if possible.

On the other hand, a more advanced, speculative execution with rearrangement and rollback of instructions and register renaming of results for shader processors makes no sense now - it is too expensive taking into account an unjustified increased of complexity of each shader processor. As usual, in graphics it's more advantageous to make parallel fulfillment of shaders on the object level (level of vertices and pixels) by increasing the number of parallel processors dealing with blocks than to make parallel operation at the instruction level: the algorithms are not great and neighboring instructions are too tightly bound. That is why the number of pixel and vertex pipelines is twice greater as compared with the R200.

In the near future pixel processors will become entire doubles (as to capabilities) of vertex ones because of the same data format and the same arithmetic instructions; the only thing lacking is an instruction order management, but this problem can be solved. The distinction between pixel and vertex processors will be vanishing. In several architecture generations a graphics accelerator will turn into a set of identical general-purpose vector processors which will have flexible configurable queues for asynchronous transfer of parameters between them. Processors' efforts will be distributed on the fly depending on an approach used for making an image of a balance of a required performance on certain tasks:

some will be in charge of animation and tessellation (geometry generation),
some will control geometrical transformations,
some will manage shading and lighting,
some will deal with texture sampling (they will be intelligent texture units able to program arbitrary filtering methods or calculate procedure textures).

The R300 is based on the 8x1 configuration - each pixel pipeline has only one texture unit connected:

One of eight pixel pipelines of the R300

It seems that this is a forced economy caused by the .15-micron fab process. We can come up with a lot of real situations in a pixel shader when expectation of results of one texture unit significantly slows down processing of the shader! And it's possible to avoid such standstill with a second texture unit, thus, lifting the speed of pixel shader processing 1.5 or 2 times. Well, let's leave it for ATI and be happy that in spite of just one texture unit it's possible to enable trilinear filtering using this unit without speed losses. As well as combine trilinear and anisotropic filtering types (which was a well-known downside of the R200).

For such long shaders it's rational to use a bit different approach of organization of texture units. Let's consider that units are not bound to a certain pixel pipeline but service any of them as requests for texture sampling are received. We thus could run shaders on different pipelines with some time shift of several instructions to make up for irregular interleaving of calculations and access to textures. First all units would service those pipelines which are waiting for textures and other pipelines would fulfill calculations. Then the situation would be vice versa. In this case the downtime would be much lower and 8 shared units would be enough for 8 pipelines. It's possible that ATI follows this approach but doesn't want to reveal the details. And it's possible that NVIDIA will take this approach for one of its future chips - because this idea was once discussed by engineers from 3dfx absorbed by NVIDIA.

Lecture 6 - vertex pipelines and higher-level languages

Vertex shaders haven't changed much like pixel ones, but at the same time they are improved by a great margin - they are now able to control an instruction flow. Now we have subroutines, loops, conditional and unconditional jumps.

Versions	1.0	2.0
Instructions, max.	128	256
Instruction flow management	No	Yes
Data format	F32[4]	F32[4]
Constant registers	96	256
General registers	8	16

At present all decisions to change an instruction flow are based on constants coming to the shader; this make problems in making decisions on-the-fly separately for each vertex. It's not clear why Microsoft have decided on it - the ATI R300 (and NVIDIA NV30) are likely not to unroll loops and subroutines into a continuous row of instructions but allow an indicator of the next instruction to move around the memory of instructions inside the chip. Well, in the next DX generation this limitation will be eliminated, and we will be able to call vertex pipelines of any accelerator vertex processors. Contrary to the R300, the NV30 is already able to control an order of instructions according to data from temporary registers - like any usual processor. On the other hand, the R300 allows fulfilling shaders of up to 1024 instructions, the NV30 only up to 256 (and up to 65536 instructions in case of unrolling of loops and subroutines).

Everything that was said in the previous part about the superscalar implementation can be also referred, probably to the greater degree, to vertex shaders. Quite lenthgy shaders make us think about optimization for successful combined execution of instructions.

When the hardware and API developers got a possibility to execute shaders of thousands of instructions they turned to higher-level languages. It's much more pleasant to deal with some Ñ dialect than with an assembler code which isn't used for already 8-10 years. At last the hardware corresponds to the required level, and now instead of thousands of constants and instructions we have hundreds and instead of hundreds we have tens. Soon complexity of programs for an accelerator can become equal to that of programs for ordinary processors, at least, for the part that manages 3D graphics.

For example, NVIDIA announced its Ñ Graphics (CG) dialect which first wasn't user-friendly at all, despite all disadvantages it is a cross API tool - a shader code could be compiled both in the OpenGL and in the Direct3D environments. The compiler comes with a rich set of effects and samples. There is a new CG version - for DX9 - which is more handy regarding data binding and utilization and it can be called a de facto standard.

Microsoft in not in a hurry either - it is debugging its HLSL which is actually the same CG (or it can be vice versa because the development works were carried out by NVIDIA and Microsoft together) but working only within the DirectX. Besides, at present the HLSL works only with vertex shaders.

ATI doesn't stand idle either and announces its Render Monkey. This dialect is different. The NVIDIA's CG and Cine FX (an analog of techniques and effects from DX9, as well as the CG cross API!) are the most convenient ones, at least, due to export plugins for popular packets of 3D modeling and realistic graphics.

Rendered Monkey :-)

Lecture 7 - anti-aliasing, video capabilities

There is no a breakthrough in the anti-aliasing technique, we have the same SMOOTHVISION 2x, 4x and 6x, although it is named SMOOTHVISION 2.0. However, despite the same approach to forming pseudorandom templates now we have the multisampling method (MSAA), which must improve performance of the method as compared with the SSAA SMOOTHVISION in the R200. However, the first one was also good. The speed of the MSAA version has reportedly become greater - maybe because of the wider bus or the optimized algorithm. In the practical part of the review on the R300 we will carefully examine performance drop issues when FSAA and anisotropic filtering are enabled. It should be noted also that on transparent textures (with an alpha channel) the chip switches to the SSAA mode and select all samples for each pixel of triangle (not only for its edges).

It is interesting what NVIDIA is going to offer in its new chip whose various hybrids based on the MSAA look outdated as compared with the SMOOTHVISION.

One more significant aspect of the R300 is a VideoShader technology. It uses computational capabilities of pixel pipelines for some tasks of encoding/decoding of MPEG1/2 video streams, conversion of color spaces, deinterlacing and some other video processing tasks. The following diagram shows which tasks fall on the shoulders of the pixel shaders and which are still fulfilled by hardware units:

In the near future flexibility and performance of shader processors will let them solve quite complicated 2D video tasks (or, rather, parts of such tasks which are most intensive in calculations) up to MPEG4 decoding. It might also be possible to lay on them sound compression and voice recognition! Why not to use the huge power for turning an accelerator into a general-purpose coprocessor?

Conclusion

Well, it's to early to consider ATI and its R300 winners - I'd rather say the company offered the best combination of the price and capabilities with the junior chip of the 9000 line - RV250. It's also unfair to consider the R300 a loser because it is a competitive solution. So, let's wait for the cards and for the DX9.

According to the information that is available now and ignoring yet unknown prices I'd put the competitors into the following order: NV30, R300. Well, friendship loses again.

Alexander Medvedev (unclesam@ixbt.com)

Write a comment below. No registration needed!