Asymmetric Repartee. New technology: ATI CrossFire

On May 24 in high hot spring ATI officials conducted a conference in Moscow devoted to the technology to be reviewed in this article, details on the new Xbox 360, and other no less useful things. That was great, I'd like to thank Nikolai Radovski and other representatives of the company for useful information and competent answers to our questions!
And now let's proceed straight to the kernel of the problem:

ATI CrossFire is the official name of the Canadian answer to NVIDIA SLI, which was rumoured about in technical forums all over Internet even six months ago. Are there any differences? Certainly. Are there advantages? To all appearances, the answer is yes, quite significant at that. Soon we shall publish our tests and analysis of the quality aspects, but now we are going to review theoretical and architectural aspects and to try and forecast tendencies and consequences.

General architecture of CrossFire.

The main objective of this technology is to organize rendering team-work for two video accelerators. This architecture must be not only effective (high efficiency, low costs of additional schemes, availability for common users and enthusiasts), but also convenient to use (compatibility with the existing programs and even with existing hardware solutions, limpidity, simplicity, and reliability). There are a lot of requirements, but running a few paragraphs forward I want to congratulate ATI with the thorough and well thought-out approach to resolving these tasks. So, we are offered the following architecture:

Several accelerators (consumer modification comprises two cards) render their own parts of the image and output it via TDMS transmitters in the conventional DVI format. Then the data goes to the black box (red box in the diagram) called Composing Engine. This device actually combines rendering results to obtain the aggregate image. This red box outputs the standard DVI signal, but this time it's a final frame assembled from two portions of data calculated by both VPUs. To resolve synchronization problems, Composing Engine contains its own buffer storage that allows this device to accumulate data asynchronously and then assemble and output the resulting frame (when both accelerators are ready). That is there is no need is accurate synchronization of VPU operations. There are only two conditions: each VPU must know what part of the data to render and each VPU must complete transferring rendered data into this "red box", Composing Engine. After that the frame will be transferred to the output device in DVI format or (if we need an analog signal) to the external graphics DAC, which converts the DVI stream into the standard analog VGA signal.

And now the most vital question – how are VPUs going to share the data to be calculated? A little theory:

The main interaction algorithms of accelerators.

We can easily single out three main algorithms, that are used nowadays for this purpose in various consumer and professional solutions:

Scissor aka Slicing This solution is used in modern NVIDIA SLI and in many special solutions, such as flight simulators (several windows of the simulator, aircraft model), large information multi-displays, etc.

In case of two VPUs, the final frame is split vertically into two zones. Interestingly, the boundary between the zones should not necessarily divide the frame in the middle, it can be drawn dynamically, depending on complexity of a given image fragment. Roughly speaking, the upper part of an image (sky) may have fewer objects than in the bottom so that one of the accelerators will be idle, which can be compensated by increasing its responsibility zone. Such a dynamic balance is not a trivial task, it requires scene analysis, which is not always convenient. This method is good for applications that are balanced in terms of geometric calculation / fill, because ideally (in case of correct adaptive frame division into responsibility zones) it will allow them to distribute both geometric and pixel load equally between two accelerators.
Tiling is the most convenient and limpid method as far as its organization, when accelerators calculate neighboring lines (SLI from 3dfx, where odd and even lines take turns) or pixels as on a chess-board (it's practically the same), or neighboring samples for AA within the same resulting pixel. Thus, the fill load is distributed evenly, no matter what a given scene looks like. But the geometric load has to be replicated by VPUs – both accelerators calculate the same geometry. So if an application is not limited by geometric performance of an accelerator (it's nearly all games these days), this method can provide serious fill performance gain, up to twofold (in case of a double reserve of idle geometric performance). Thus, we evenly distribute pixel operations at almost 100% efficiency without any noticeable compatibility problems or difficulties in balancing and splitting the data stream. This method requires minimum meddling with drivers, it's transparent to applications, and currently seems the most optimal solution for consumer gaming. Especially considering the increasing number of applications with heavy pixel load and shader special effects. Moreover, this method can be used for efficient FSAA, based on averaging samples calculated by different accelerators, which in addition to MSAA implemented in each VPU will also provide super sampling (SSAA), which can resolve some problems, not very efficiently handled by MSAA.
Alternate Frame Rendering – this method has been familiar since the very first multi-chip ATI solution in the user segment - RAGE Fury MAXX. It's good for applications that are limited by the geometric performance of an accelerator and are not critical to smooth succession of frames. It's a rare thing in games for these days, but it may take place in DCC/CAD/CAM/CAE applications (for example, when you interactively edit models in applications to create 3D realistic graphics).

So, let's sum up pros and cons of the above methods:

Method	Pros	Cons
Scissor (Slicing)	Distributes both geometrical and pixel load High degree of VPU asynchronism Accelerator completely owns its image responsibility zone	Requires on-the-fly zone balancing to distribute the load evenly There may be problems with AA at zone boundaries Requires considerable meddling with the drivers resulting in high probability of glitches in some applications
Tiling +SuperAA	Distributes pixel load equally Accurate load balancing between VPUs Can be used for new AA methods (SSAA) It's transparent to applications and requires almost no driver modifications, low chances for glitches in applications	Does not distribute geometrical load and thus requires significant reserve of geometric efficiency Requires synchronous operation of accelerators and the corresponding lack of differences between their performance and other characteristics
Alternate Frame Rendering	Distributes both pixel and geometric load; geometry data transferred along the bus is not duplicated – different accelerators get different data sets Accelerator is fully responsible for its frame, no traces of linking, even in case of complex post processing, no limitations on the rendering method.	Uneven succession of frames and load distribution Efficiency depends much on a CPU and a system, as well as on a scene, it drops with the FPS increase Problem with a considerable pause between the frame we see and the currently rendered frame.

Which one was chosen by ATI specialists? Stay with us, we shall dwell on this issue later. And now let's proceed to the hardware specifics of CrossFire. How did ATI implemented the above mentioned "red box" in practice? Like this:

CrossFire specifics.

So, we have two video cards installed in a system with two graphics PCI-Express slots. A regular ATI card and a special ATI card with CrossFire technology:

That's the reason why we entitled the article "Asymmetric repartee" ;-) It turns out that ATI engineers decided to place the above mentioned "red box" (C Engine on the diagram) on a single card, the main one, and to transfer data to it from the second card via the regular external DVI connector. Thus they designed a solution that is compatible with the existing cards, manufactured before CrossFire! Isn't it just great – if you already have a PCI-Express video card from ATI with DVI out, all you need to get a super system is to buy an additional special CrossFire card, connect the DVI out of the old card with the new card by a special bundled cable, and here you go. The new card will output an image, assembled by Composing Engine from the results calculated by both cards, in DVI or analog VGA format.

The CrossFire card is equipped with a special connector, which resembles DVI but has more pins. It's marked as DMS on the diagram. This connector is used to apply a DVI signal from the first video card, it's also used to output DVI and analog VGA of the resulting image, assembled by the red box. Besides, the initial card still has a vacant second output (DVI+VGA or just VGA), as well as TV-Out, and the CrossFire card also has the second DVI+VGA. All these outs, not used to render the image, can certainly be used for additional monitors and other standard applications at "peace-time", when no games are played. But they cannot output the image, calculated by both accelerators in CrossFire mode – it is applied only to the DMS connector.

And now the most interesting question. Attention, please. What algorithm did ATI choose for its "red box" to split the image?

The correct answer is any of the three described above!

The red box on a CrossFire card is not a special chip with a hardcoded operating algorithm, but a small all-purpose chip with a programmable gate array. This small chip contains a flexibly configured circuit of logical elements and buffer storage to keep intermediate results, its algorithm is dictated by drivers that upload a corresponding scheme of relationships. ATI currently implemented all the three methods, described above. But it doesn't mean that there will be no improved or hybrid load distributions in future. All we'll ever need is to update drivers. I again cannot help congratulating ATI engineers on this smart solution – their approach reduced considerably design and introduction costs of CrossFire, it allowed to choose an optimal mode (among the available ones) for each application in terms of efficiency and thus secured our investments into the multi-chip solution from whims of games and applications.

Thus, when you use CrossFire:

You can use an old video card, already installed into our system. But you should buy a second CrossFire card and a motherboard with two graphics PCI-Express slots (if you don't have it already).
You can choose an optimal interaction method between accelerators for each application. This choice can be given to the driver. In this case it will check with the list of previously tested by ATI applications with already selected optimal settings, or it will set the most reliable (in terms of transparency for applications) Tiling method, in case the application is unknown. Or you can choose the mode on your own, having experimented with the results in a given application, paying attention to the efficiency or the maximum image quality.
We may get new modes and interaction methods in future.
We can enable and disable CrossFire as well as to change its modes on the fly, without rebooting a system.
We get new AA methods, when 2xSSAA (averaging results in Composing Engine) is added to 2, 4, or 6-sample MSAA in each chip. As a result, we get a hybrid formula, which is already familiar by NVIDIA products. In case of ATI, there are two new modes available (so far) - SS2x(MS4x) SS2x(MS6x), which for some reason are called "10xAA" and "14xAA". That's not quite correct ;-) They should have been called "2*4xAA" and "2*6xAA". It goes without saying that such modes have different locations of MSAA samples for the first and the second accelerator, only then this antialiasing will make sense. But as we know, the sampling pattern in ATI chips is flexibly specified on the 4x4 grid and thus we can place two sets of 6 samples there so that they don't cross.
We can use together video cards from different manufacturers (for example, ASUS and Sapphire in tandem)!

The current limitations of this technology:

This technology will be available (at first) only for X800 and X850 cards. Note that usual X800 cards require an X800 CrossFire card, and an X850 card correspondingly requires an X850 CrossFire card.
Any cards from this series can be used together (any X800 with an X800 CrossFire and any X850 with an X850 CrossFire) but the number of pipelines will be limited to the common minimum – that is if one of the cards has 12 pipelines, the second card, even if it has 16 pipelines, will operate in CrossFire mode with 12 pipelines. This is done to balance performance.
The cooperative rendering mode outputs only to a single monitor.
So far, the company guarantees (!) compatibility only with motherboards on ATI Xpress 200 chipsets. But in the course of testing and streamlining this technology, motherboards from other manufacturers will also be certified – theoretically there should be no problems in such interaction, but there may appear specific incompatibilities.

Prospects of this technology:

It can be easily adapted to other existing (X700 and the like) and future ATI solutions. In fact, any new flagship card from ATI may be released together with a modification supporting this technology
New motherboards with two graphics slots will be tested and certified as compatible, including chipsets from Intel and probably even NVIDIA.
It looks like this technology can be scaled up. It's no secret that there may appear multi-core or multi-chip accelerators in a single package by analogy with processors in a couple of years, which will allow 2*2 schemes (two cards with two accelerators each).

Prices, dates, forecast

And now several words about down-to-earth specifics. Prices and availability for a start:

CrossFire cards will already appear in stores in late June, early July.

ATI provides the following performance data on solutions with two video cards, CrossFire X850 XT compared to NVIDIA SLI 6800 Ultra (attention: two cards are used in both cases):

1600x1200 (4xAA 8xAF)

We shall abstain from comments until we get our own results on performance and quality of this technology. So far, we can note that SLI only operates with a limited (greatly limited) number of games and thus it's heavily outscored by CrossFire here. It also requires purchasing two new video cards, which is also not good compared to CrossFire. ATI's technology can be (potentially) used by almost a million of all X800 and X850 series cards owners, no need to sell your old card.

Two most vital questions: will ATI manage to retain this technological leadership? The next generation of NVIDIA products may add the best discoveries of the canadian experts to their armory, this way or another. Why is this technology called CrossFire – isn't it a reference to the Chrysler model of the same name ;-)

It goes without saying that much will depend on the price/performance (in certain games) ratio. It will also be up to problems with image quality and compatibility. We are going to analyze all these aspects in the nearest future, but now let's draw a preliminary conclusion:

ATI engineers designed an advantageous, flexible, and convenient architecture of multi-chip rendering, intended for end users and games. The prospects of CrossFire look better than those of NVIDIA SLI on paper. And the architectural solution can be (and should be) admitted as more refined and better thought-out. Its assets include compatibility with existing video cards and all applications, as well as a flexible choice of a collaboration mode. Of course, this technology is intended for a narrow segment of enthusiasts and it will not yield super profits. But we shouldn't forget that the total leadership, which can be brought by CrossFire, will certainly have a positive effect on the mainstream ATI sales. And technological leadership in such a sphere is no less tangible and precious contribution into the company image.

Alexander Medvedev (unclesam@ixbt.com)

May 31, 2005.

Write a comment below. No registration needed!