ATI Technologies' Counter-Offensive: RADEON X1800 (R520), X1600 (RV530), and X1300 (RV515) Series

Part 6: Interview with Guennadi Riguer (ATI Technologies)

New technologies from ATI Technologies: interview with Guennadi Riguer.

iXBT.com: What's your name and job position in ATI?

My name is Guennadi Riguer. I'm a head of a group of ATI engineers, who are responsible for supporting developers in Northern America and often beyond its boundaries. My team works closely with many game developers, helping them introduce new graphics technologies.

iXBT.com: Do R520 and RV5XX chips fully support SM3, from the point of view of ATI; from the point of view of MS? Does MS certify them (drivers) for SM3? What requirements must be met for this certification? What about the rumours of incomplete compatibility with SM3, are they groundless?

All Radeon X1000 cards fully comply with SM 3.0 support requirements. From the point of view of Microsoft, the cards pass WHQL certification completely. If you didn't know, WHQL (Windows Hardware Quality Labs) are for testing Windows compatibility and compliance with necessary specifications. A part of WHQL graphics tests is Display Compatibility Test (DCT) kit. This test kit is available at Microsoft web site. Sceptics may test their cards themselves and make sure that all Radeon X1000 cards pass all necessary SM 3.0 tests. What concerns meeting developers' requirements, all minimum SM 3.0 functions are there (in terms of API specs). Minimal requirements from DirectX® 9 documentation: Vertex Shader 3.0, Pixel Shader 3.0. Now what concerns the rumours. The rumours are always groundless by their definition, or they wouldn't have been rumours but facts. I prefer operate with facts, rumours are for old ladies sitting at a porch, or for those with plenty of free time. In my opinion, too much time has been paid to rumours about SM 3.0 compatibility of various video card manufacturers. Interestingly, facts are paid much less attention to than rumours.

iXBT.com: Why is there no vertex texture fetch, what reasons might affect this decision, can it hamper the full SM3 compliance certification? Will this move displease developers?

I will not deny that vertex texture fetch is a useful thing. In spite of this, the lack of vertex texture fetch in the X1000 series was a deliberate architectural decision. Sterling texture units, providing fast fetching and supporting filtering in various formats, would have taken up a lot of die surface, especially considering that the X1800 chip has 8 vertex pipelines. On the other hand, in my opinion, inferior texture units limit efficiency of this interesting function significantly. It means that the usage of vertex texture fetch in real games would have been rather weak, and the corresponding piece of the die would have just been idle in the majority of games. For chip manufacturers it means wasting money. I haven't seen a Direct3D game supporting vertex texture fetch so far. The only OpenGL game supporting this feature is IL-2. By the way, we are pleased to note that it's the Russian developer 1C who promotes advanced graphics technologies.
I doubt that developers will be disappointed, or even displeased with our decision. We offer an alternative solution - render-to-VB. But that's another story.

iXBT.com: The chip offers lots of functions to handle FP16 render buffer, even including MSAA - that's what the latest NVIDIA chips lack, and lacks even bilinear filtering of FP16 textures, which is available in the latest series from NVIDIA. What are the reasons for this asymmetry - illogical from developers' point of view, who don't recognize the software implementation of filtering in shaders as efficient, at least with the current generation of hardware? Is FP16 filtering support planned in future (R580? Or later?) Will this decision hit at HDR, actively promoted by ATI.

When a chip is designed, you have to decide what functions are to be implemented and what functions are to be thrown overboard; what functions must be faster, and where performance is not so critical. That's a regular part of designing. This analysis is always done based on key games and technologies, which in our opinion will be most popular among developers in the nearest future. If we don't do that and just put everything we want into a chip, it will be larger, will consume more power, and will be less efficient. Like in the old anecdote about a super large chip invented in Soviet Union - 40 pins and 2 handles.
I will gladly share our considerations for dumping FP16 texture filtering. When we analyzed promising methods of implementing HDR, we found out that 10-bit and 16-bit integer and FP16 formats are of special interest, both for storing textures as well as for rendering. What concerns FP16, in our opinion, alpha blending is the most important feature, followed by multisampling. Filtering goes only after them. The lack of alpha blending will not allow to render scenes with semitransparent objects, smoke, and other effects. Blending emulation in a shader using additional render buffers is theoretically possible, but it will be resource and performance taxing. The lack of multisampling has a negative effect on realism in rendered scenes - the main reason for using such technologies as HDR. At the same time, FP16 texture filtering is not so critical for HDR for several reasons. Let me explain. First of all, a large range of values is inexpedient at this stage of our gaming industry, it's not necessary actually. As game developers usually create their products for a wide spectrum of hardware, games must run well with high-end as well as low-end cards with practically the same set of textures and other elements of artistic design. It applies a peculiar limitation on the range of intensity values.
There are several possible uses of formats with the increased range of values for HDR. Firstly, textures, secondly, post-processing. A wide usage of FP16 format for textures is inexpedient. In the majority of cases, games use compressed texture formats. Switching to FP16 format will require 8-fold increase in memory capacity for HDR textures. Even 512 MB of video memory will be insufficient. To say nothing of a performance drop. Post processing of a scene usually has two objectives: tonal image correction (brightness auto adjustment for optimal lighting range) and soft light simulation and light diffusing from strong light sources. Diffuse light requires no precision, as it's often done empirically just to appease artistic needs of designers. Anyway, a relatively narrow intensity range, governed by artistic design in current games, allows using 16-bit integer formats for processing scenes, supporting filtering on all Radeons, starting from the 9500 model. Here emerges a question: "Are 16-bits sufficient?" Here is a comparison for non-believers: the dynamic range of the 16-bit integer format is 65535, while the dynamic range of a human eye, adapted to the environment, is about 30000. Camera's or monitor's dynamic range is much narrower. In those rare cases, when the dynamic range of the integer format is insufficient, FP16 filtering can be emulated in a shader. But that's an exception from the rule.
We have started experimenting with HDR since Radeon 9700, accumulated a lot of experience, and analyzed the sphere of HDR data representation. If you are interested in this issue, you may read an article from the latest ATI SDK (available at our web site - "HDR Texturing"). It reviews interesting HDR texture compression methods and options of data representation with a very large dynamic range in 16-bit integer textures. All the methods work well with a wide range of DirectX 9 cards from ATI, their quality is often comparable to that of FP16, or sometimes better. By the way, all ATI demos, including "HDR Rendering With Natural Light", use solely integer formats. The latest demo for Radeon X1800 (Toy Shop) uses 10-bit formats for HDR. The lack of FP16 filtering has absolutely no effect on our HDR promotion. On the contrary, our wide experience in this sphere will help developers to implement HDR solutions, as our approaches are more practical and intended not only for the latest video cards from ATI. This will allow developers to count for support of a large part of video cards and to implement HDR faster and more efficiently. What concerns FP16 filtering in solutions to come, I cannot comment on functionality of unannounced products.

iXBT.com: Is the bidirectional ring memory controller better than the classic crossbar (star) topology, used in other products from ATI and NVIDIA? Will a similar controller architecture be used in future products as well? What's the programmability level of this controller. Is it true that its flexibility allows noticeable gains, if configured for a given application? How complex is this controller in terms of a number of transistors and die room?

The new memory controller is designed with reserve to the future. It can be easily adapted to a wide range of future projects. The ring topology of a memory bus, laid out along the die periphery, allows much simpler bus layout inside a chip, very high memory frequencies for current solutions, and guaranteed performance gain reserve for the future.
The new memory controller has a more flexible programmable architecture than all our previous chips. Software programmable memory arbitration allows us to adjust to requirements of various applications, noticeably increasing operation efficiency with them. It's especially important for modern applications using programmable pipelines, where memory bandwidth requirements may vary significantly at various stages of a pipeline depending on a task.
A programmable controller is a tad more complex than a simpler controller with fixed functionality. But the additional room, taken up in the die, is quite justified and serves to increase the overall performance of the chip.

iXBT.com: Are there any innovations and advantages in terms of caching data in the new series - new caching algorithms, their associativity, or size grown considerably or no? What other interesting things were implemented in terms of caching and data compression?

Many caches (texture caches, frame buffer, etc) in new X1000 series solutions have become fully associative, which increased their efficiency. We used to work with direct-mapped caches or caches with limited associativity. This solution reduced cache misses significantly and increased peak performance by 25%.
There also appeared improvements in data compression. 3Dc technology was expanded with a new texture format. And the compressed normal maps format now has a younger brother - ATI1N. This format allows to compress single-channel textures like bump maps, monochromatic maps of shadows or light, etc. Together with increasing video memory, the new formats allow developers to create virtual worlds of unprecedented detail level.

iXBT.com: Why didn't ATI go all the way and make unified shader processors in R520 (which we already saw in Xenos), having implemented such an effective architecture with dynamic scheduling of execution modules and separate texture units? It wouldn't have required significant changes - dynamic execution is well adapted for such an architecture. This would have solved the problem with effective access to textures from vertex shaders and dynamic balancing of vertex and pixel load. When shall we see unified shader processors in PCs - in R6XX? Will the next generation be a WGF 2 solution?

It's not that simple. Highly efficient schedulers of execution threads are just a small part of changes, required to implement the architecture with unified shader processors. Considering strict roadmaps for designing chips, the standard architecture with split shader pipelines was the most rational solution in this generation of graphics processors. But it does not mean that we shall not upgrade to unified shader processors in future. In my opinion, it's is very promising architecture.
We closely cooperate with Microsoft in developing the next generation of DirectX API and design hardware with its support. Unfortunately, I cannot comment on the terms, architecture, and any other details concerning our future solutions.

iXBT.com: Are the current vertex processors in the R520 capable of executing branches and how effectively can they do that (comparable to pixel processors or less efficiently)?

Yes, vertex processors in Radeon X1000 series chips support conditional branching, but their efficiency is not as high as in pixel processors. But they have a good reason. As I have already said, we always have to choose design priorities. In our opinion, effective branching in pixel shaders is much more important than in vertex shaders. As the graphics technologies evolve, an increasing number of calculations migrate from vertex pipelines into pixel ones. Some time ago, vertex shaders reached their "functional plateau": they are basically responsible only for animation and preparing various parameters for pixel shaders (tangential space, various light and camera vectors, etc). You will hardly find a game, where vertex shaders are a bottleneck; their functionality and efficiency are more than enough in the majority of cases even without architectural tricks. What concerns conditional branching in vertex shaders, from the developer's point of view, static branching is more interesting. It allows to simplify shader management in a program.
In modern applications all the beauty is calculated in pixel shaders, which have grown from a dozen to many hundreds instructions for a couple of years. And the industry is not going to stop at that. Efficient conditional dynamic branching in pixel shaders allows to significantly optimize many bottlenecks as well as to implement interesting graphics algorithms, which we couldn't even dream about.

iXBT.com: What considerations have led to such a strange balance in texture units and shader processors in RV530? Isn't the 3:1 ratio sufficient for modern applications? At an average, each pixel processor can execute 1.5-2 operations per clock. And many shaders actively access textures! Was this ratio planned from the very beginning? Was its efficiency confirmed by any tests?

That's a very interesting question. Not long ago we advised developers to keep the ratio between texture and arithmetic operations within 1:1 - 1:2 and use textures as a lookup table instead of shader calculations. But the graphics industry features very few things that make no headway. Time will soon come to review approaches and adjust to developers' requirements. In times of DirectX 8 and early DirectX 9, the real ratio between operations in pixel shaders was 1:1.5 - 1:2.5. A close-to-optimal balance between various pixel pipeline units was reached after all optimizations and filtering. If we take a look at high-tech games, released last year (Half Life 2, Far Cry) and this year (Age of Empires 3, FEAR, and Splinter Cell), you will see the ratio between texture and arithmetic operations already within 1:3.5 - 1:5. Extrapolating to the next couple of years, we can expect a larger ratio. That's not surprising. If we consider the performance growth of graphics pipelines, we can see that it leaves memory performance growth far behind. Here is an example. The memory bandwidth in video cards has grown approximately by 16 times from 1998 to 2004. Not bad, actually. But compared to the performance growth of pixel pipelines, which performance has grown approximately by 84 times for the same period, the memory is left far behind. This growth rate should survive for at least a couple of years. This way or another, developers will have to rely on arithmetic operations in shaders on an increasing scale, not to be limited by memory performance.
Radeon X1600 is a graphics adapter of a middle price range and lower. Users of these cards usually don't upgrade them to the latest solutions each six months. Such cards are used in computers of their owners for at least a couple of years. In our opinion, it's important to give them a solution with reserve to the future. Yes, I agree that this solution is not ordinary and perhaps even a tad risky, but as they say, nothing venture, nothing gain. We are sure that our forecast is true and X1600 users will not be disappointed.

iXBT.com: Why did ATI mention GDDR4 in the list of supported memory formats for R520? Is it a mistake, or is this support planned in a future standard, which is not ready yet?

ATI has been closely cooperating with memory manufacturers over developing new standards for many years. So there is nothing surprising about some chips supporting standards, which are still in development. This fact means that these chips are used to verify and fine-tune new technologies. As soon as this technology is available, we shall be able to introduce it into real products painlessly and fast. Flexibly programmable architecture of a memory controller allows us to support any existing GDDR standard, as well as new standards still in development.

iXBT.com: Will R520-based all-in-wonder support audio-out via DMI?

You'll have to wait for the launch of this card, I'm afraid, in order to learn its capacities.

iXBT.com: What innovations are planned for this architecture in future? Will the number of pixel and texture processors grow in the next product based on this architecture?

We keep tabs on graphics requirements and the current architecture is flexible enough to adjust it to all requirements we need. I promise that future solutions will be faster and better, but I cannot reveal the way we are going to achieve that.

iXBT.com: Is there support for the OpenGL extension GL_ARB_texture_non_power_of_two?

This extension is a constituent part of OpenGL 2.0 standard. All cards, starting from Radeon 9700, support this extension on the hardware level at least partially. But even X1000 cards don't support some features of this extension on the hardware level. It may lead to the driver temporarily switching to the software emulation mode, as all OpenGL functionality must be supported irregardless of performance. This mode change may have a negative effect on performance, but that's not a big problem. We cooperate with many developers. They know how they can squeeze maximum performance and functionality from our products.

iXBT.com: What's the number of texture indirections? And why this value for the ARB_fragment_program extension is Max. native texture indirections 4?

I'll start with the second part of the question. There are several ways to write shaders. It all started with the appearance of extensions to program using a special shader assembler with an instruction set close to real hardware instructions. Such an approach was optimal for relatively simple and short shaders, used in the dawn of shader era. It allowed a good balance between programming expenses and the resulting performance. ARB_fragment_program is an example of such an extension. As shaders grew more complex, there appeared a necessity in higher-level languages to simplify the development process. In OpenGL, this language is called GLSL, which is an integral part of API 2.0. Higher-level languages for programming modern shaders are very promising. That's why the majority of our efforts on the driver development go to improve GLSL compiler instead of adding new features to ideologically obsolete extensions. For example, the ARB_fragment_program extension lacks a lot of GLSL programming features available in Radeon X1000 series - branching, specifying a mip-map level for texture fetch, polygon orientation info, etc. Assembler extensions are preserved for back compatibility. But wasting efforts on them, knowing that there will be no new projects requiring new functionality, is just not rational.
What concerns the number of texture indirections, there are no limitations on the dependence level in the Radeon X1000 series. You can implement as many dependence levels as you can fit into 512 instructions.

iXBT.com: How do ToyShop & ParallaxMapping demos (fresh SDK version) demonstrate such a large performance gain compared to GF7800? Is this the "fault" of dynamic branching only?

As I have already said, effective dynamic branching in pixel shaders can help optimize performance significantly. Besides, dynamic branching helps implement such interesting technologies as Parallax Occlusion Mapping. Toy Shop demo uses dynamic branching for both purposes: optimizations and special effects. Together with texture compression, branching helps us leave our competitors far behind in performance. Following our advice on how to optimize and implement graphics effects, developers can reveal a full potential of the platform and demonstrate what our cards are really capable of.

iXBT.com: How well is the hardware tessellation unit implemented in the Xbox chip? Can we get more details about it? How is this functionality emulated on a regular PC?

I advise to address all your questions about Xbox to Microsoft.

iXBT.com: What did you fail to implement in X1000 chips from the initial plans? Should we expect these features in the nearest year?

We currently implemented all functions we planned, keeping a reasonable, in our point of view, balance between functionality and performance. I wouldn't count on any radical changes in the architecture before the release of DirectX 10.

iXBT.com: Do you plan to grant public access to the hardware-accelerated DS filter for video decoding/processing?

Yes and no. It depends on the interpretation of "public access". We have a special SDK called "Cobra", which grants access to DirectShow filters with hardware encoding/decoding. These filters can be used to develop a transcoder, working faster than in real time. As for now, we have no plans to distribute this SDK to developers from our web site, but we grant access to this SDK for seriously interested developers at request.

iXBT.com: X1000 was planned to have H.264 decoding. What's the current situation?

Hardware H.264 decoding is built into Radeon X1000 series chips. We demonstrated a working sample of this solution long ago (in May, if my memory does not fail me). Our developers are about to finish fine-tuning and optimizing drivers. Users will soon have an opportunity to take advantage of hardware decoding. After the release of our drivers with H.264 support, we also expect software developers to produce various multimedia applications, accelerated on our cards.

iXBT.com: Do you plan to launch professional modifications of X1000?

We've launched professional solutions after gaming cards for many generations of video cards. I see no reasons not to do it in case of Radeon X1000 series.

iXBT.com: Do you plan on adding SM 3.0 to integrated video?

I think we'll come to integrated solutions with SM 3.0 support in due course, but I cannot really say when it happens. The fact is that SM 3.0 support in chipsets is currently just a checkmark in the list of marketing excesses. Executing long SM 3.0 shaders with various bells and whistles is just impossible in real time with the initially very low performance of integrated video compared to discrete video. PS 2.0 and even PS 1.x are quite sufficient for shaders that can really be executed by modern chipsets. So why waste extra transistors on what cannot be actually used? Such a solution can just raise chipset prices and thus reduce its profitability and competitiveness.

We express our thanks to Guennadi for finding time in his tight schedule to give us detailed answers. We also express our gratitude to Nikolay Radovsky (ATI Russia) for his help in arranging this interview.

Andrey Vorobiev (anvakams@ixbt.com)
Alexander Medvedev (unclesam@ixbt.com)

November 22, 2005.

Write a comment below. No registration needed!

ATI Technologies' Counter-Offensive:RADEON X1800 (R520),X1600 (RV530), and X1300 (RV515) Series

Part 6: Interview with Guennadi Riguer (ATI Technologies)

TABLE OF CONTENTS

New technologies from ATI Technologies: interview with Guennadi Riguer.

ATI Technologies' Counter-Offensive:
RADEON X1800 (R520),
X1600 (RV530), and X1300 (RV515) Series