DX Current: The Present State of Hardware Graphic Acceleration

What makes this article special is that is has no introduction or conclusion. Perhaps, it symbolises a continuity of technical progress in this sphere. Or, perhaps, it doesn't symbolise anything.

Three commandments

There are three factors that have the primary influence on the architecture of modern hardware graphic solutions:

Highly parallel graphic algorithms.
Streaming image-building.
Graphic algorithms focusing on calculations, not on decision making.

Although these sentences may sound trite, I suggest you meditate on them as it is there that the understanding of modern graphic architectures and their development is hidden.

Meditation on them

Most primary data modern computer graphics deals with (vertices, vectors, colour values) are vectors. Interestingly, their dimension almost never exceeds 4. And it is no less interesting that statistically, most operations executed by an accelerator are usually vector ones, 3D or 4D. That is why modern accelerators almost exclusively consist of 4D vector ALUs that execute one operation on four components of this or that format. To illustrate it:

We need to add two colours. Each of them is a 4-component vector: R, G, B, A — red, green, blue, and optional alpha coefficients of transparency degree.

	R	G	B	A
Colour1	0	255	0	0
Colour2	100	0	100	0
Operation +	+	+	+	+
Result	100	255	100	0

Obviously, because the data do not depend on each other during these calculations, we can execute them in parallel, that is, in one step. And we don't need four normal ALU to do it, we can use just one vector ALU (the so-called SIMD: one instruction, many data) that would have common control logic and could execute one operation on four source data sets. But it's all much more complicated in reality. First, the data may begin to interact. For example, we want:

Result.R = Colour1.R + Colour2.G

to estimate the red light of the result as a sum of the green and the red component of the source data. For this, we need to use the ALU to realise random commutation of vector components before they are processed:

	R	G	B	A
Colour1	0	255	0	0
Colour2	100	0	100	0
Commutation 1	Colour1.R (0)	Colour1.G	Colour1.B	Colour1.A
Commutation 2	Colour2.G (0)	Colour2.G	Colour2.B	Colour2.A
Operation +	+	+	+	+
Result	0	255	100	0

Thus, we save the time of data transfers. And although we can make a separate operation to rearrange the vector components instead of making the ALU so functional, our approach is more rational in terms of performance.

But let us not stop our meditation here, as it's just the beginning. In real graphic algorithms, we may be discontented with the fact that all the components are subjected to one and the same operation. For example, the A component is often processed by different rules:

	R	G	B	A
Colour1	0	255	0	255
Colour2	100	0	100	0
Operation +	+	+	+	*
Result	100	255	100	0

Here we must make up our mind. We can either divide this operation into two successive ones or to make our ALU execute 3+1 (RGB+A) operations, which is, in fact, equal to having two ALUs: a three-component and a two-component ones. Of course, it is more complex but it often brings about a performance gain as two operations are executed practically in parallel, within one.

The next step is to let these operations be executed on random components and depend on neither 3+1, nor 2+2 schemes:

	R	G	B	A
Colour1	0	255	0	255
Colour2	100	0	100	0
Operation +	+	+	*	*
Result	100	255	0	0

In real tasks, we'll come across the situations where we'll have to process only 2D vectors or scalar values (especially in the chip's pixel pipelines and pixel algorithms), and then we'll be able to optimise our calculations and execute two operations simultaneously. And it will be nice if even in this case half of ALU transistors aren't idle but are busy with some other operations.

Thus, modern graphic ALUs are up-to-4-component vectors that enable a random rearrangement of their components before calculations and can execute different operations on 3+1 and 2+2 schemes.

But component parallelism is just the first level: we are limited by 4 components. Whereas the most attractive thing about graphic algorithms is that objects processed in the graphic pipeline are usually independent of each other. Let's take, for example, triangle vertices. All three vertices will be processed basing on one algorithm and the order of processing is of no importance (when operations are being executed on the second vertex, we're never interested in the results or the course of the first processing). Therefore, there are no obstacles for us to process several vertices simultaneously. That is why modern graphic accelerators can have 3 or 4 vertex processors. The picture with pixels is even more optimistic: as we know, there are many more pixels to fill than vertices to transform. So, modern accelerators can process 4 to 8 pixels simultaneously, and future ones will surely process more than that.

There is one important thing here. If the processing algorithm provides for no check of conditions during the execution and the operations on all vertices or pixels are always the same, then it's all simple as that:

We have a control device that acts according to a certain instruction and prepares the ALU set for some operations and several objects (e.g., pixels or vertices) processed in parallel. The device configurates the ALU to execute an instruction and all the data go through it in parallel, step by step. Then the device configures another instruction and so on, until the program is exhausted. Such programs can be very long in modern accelerators, they are called pixel/vertex shaders (depending on the objects they deal with).

What is nice about all this is that there can be any amount of data and any number of ALUs (within sensible limits), which enables us to increase the power of the accelerator by simply cloning the ALUs. And we still have one control device which allows us to use less transistors. However, new standard DirectX shaders (pixel and vertex shaders 3.0) have certain conditions. Processing can be different depending not only on the constants (which is, in fact, nothing but an illusion of choice, as all vertices and pixels are processed identically) but also on the source data we have. It means that we can no longer use one control device for all ALUs. Thus, we need to create really parallel processors that would be guided by a common instruction but would execute it asynchronously.

And it's here that troubles appear, such as, for example, synchronisation issues. Some vertices or pixels can be processed within a fewer number of operations than others. So, what is better: to wait until all objects are processed or to start processing new objects as the processors are getting free? The latter approach provides for a more optimal use of hardware but requires a more complex control logic and consequently, more transistors.

However, a compromise is also possible here. We still have one control device and each object passes through the same set of operations. But if a certain condition was fulfilled or not, the results are either accepted or removed without changing the state of the object. Thus, we execute all possible condition branches (both "yes's" and "no's") with half of our actions being idle. But we don't lose synchronisation as all the objects are processed simultaneously, being controlled by one device:

Steps:	Object 1	Object 2	Object 3	Object 4
Condition check	Yes	No	Yes	No
If yes +	+	Idle	+	Idle
If no +	Idle	-	Idle	-
If yes *	*	Idle	*	Idle
+	+	+	+	+

The convenience of such approach depends on the share of conditional operations. In our case, 6 operations out of 20 were idle. Soon, as graphic data processing programs become more flexible, the profit will diminish and finally, fully independent processors will be the most preferrable. But for the time being, the compromise seems quite to the point.

The deeper we get into the meditation the more we understand that even object-level parallelism is not the last step to performance increase. We can introduce a rather paradoxical notion of time parallelism and show that trading bad for worse can sometimes bring considerable profit.

This is a typical succession of actions during pixel filling:

Texture 1 value selection.
...
Texture N value selection.
Calculations.
Writing results.

Texture selection (and also filtering, texture coordinate preparation, MIP level estimation and other actions) can take over a hundred steps and last more than a hundred clocks. Of course, we can't be contented with this state of affairs, but it is due to the independence of pixels that are processed in parallel and will be processed later that we can create a more-than-a-hundred-stage-long pipeline which would only deal with texture selection and provide a result per clock if everything is all right. In CPUs, ~20 stages is considered a lot and data processing itself takes not more than 2..3 stages. But the independence of pixels enables an effective use of such long pipelines in graphics, hiding huge latencies of such operations as texture selection. And this is what can be called "time parallelism" :-)

However, modern graphic algorithms become increasingly flexible. Now we have the notion of a dependent texture selection, where selection coordinates are specified separately for each pixel — probably, right in the pixel processor, basing on previous selections:

Texture 1 value selection.
Estimation of coordinates for a new selection.
Texture 2 value selection basing on the estimated coordinates.
More calculations.
Writing results.

Even a very long pipeline won't save us in this situation. Having started to select the second texture value, we can't execute the shader till the selection is completed. So we'll have to wait about a hundred clocks during which we won't even be able to start the second part of the calculations ("More calculations") as they are probably dealing with a value selected from the second texture.

So, what do we do?

Let us get back to time parallelism. We can create a more-than-a-hundred-pixel-long queue of objects prepared for processing beforehand. Then we can take one shader instruction, run ALL our pixels through it, and store the resulting data (for this, we'll need a large pool where intermediate data for the whole queue will be stored). Then we'll take a second instruction and run all our pixels through it and so on, until the shader is completed. Thus, we'll have a situation where an instruction can be executed in a hundred clocks but it will be supplied with a very long pipeline that will give out a result per clock, irrespective of whether other shader instructions are waiting for the previous results or not. This means that texture selection latency, as well as latency of any other operation will cease to exist for us, as we can deal with over a hundred pixels within one instruction (operation) before the next one.

This is the scheme we have:

We'll come back to it later. What is paramount for us now is that independence of pixels/vertices enables us to exploit different variants of parallel processing, making graphic accelerators more effective. To the extent that they can give out multiple new results per each clock even in the case of relatively complex procedures.

The last point we'll dwell upon during our meditation concerns provision of the source data. It is a truly burning issue for GPUs. As we know, memory technologies do not develop as swiftly as computative power, and there is only one pleasant fact (a streaming character of graphic algorithms) that enables us to provide modern accelerators with data. A GPU is like a funnel: a lot of various data at the entrance and one resulting image at the exit. All the incoming data are streaming, they are read serially or almost serially in a more or less convenient way. Consequently, they can be cached, preselected, queued, etc., not to let the memory subsystem be idle and to increase its efficiency. Luckily, modern accelerators have virtually no random access to the memory as it "kills" caching effectiveness. Thus, in contrast to CPUs, GPU caches are relatively small, separate, and mostly read-only. This makes them increasingly effective: even a frame buffer cache can be divided into two parts, one of which will be read-only, and the other one will be an ordinary write queue.

And this is where our meditation smoothly merges with reality.

Reality and our conceptions of it

Using modern DX9's of NV3X and R3XX families as illustrations, we'll see how our meditation priciples are (or can be) brought into life. Many architecture details mentioned in this article are nothing but our guesses as accelerator developers are not too willing to share them with anybody. So, read this at your own risk: we don't guarantee anything.

First of all, to understand HOW it all works, let's look at the logic structure of the accelerator (a graphic pipeline):

The way the image is being built:

Vertex data are selected from the memory and get into a preliminary vertex cache. The process is not so easy as it may seem. Modern accelerators support several topologies that store geometry (triangles) in the memory, such as strips, fans, index buffer (where indices of each triangle's three vertices from the common array are specified).
Besides, there is hardware support of a flexible data format: the memory can store not only "classical data" of each vertex (such as coordinates or the normal vector) but any other attribute sets of possible vector and scalar types. There is hardware support of operations with several data flows, when different parts of the vertex's attributes are stored in different arrays. Selection from the memory must be accomplished in several flows in such a case. Geometry selection unit is responsible for all this.
Then each vertex gets into the vertex processor. We'll discuss it in details a bit later.
After the vertex processor, where the they are transformed and lighted (that is, processed by a vertex shader or a fixed T&L block), the vertices get into a small intermediate buffer (about 32 vertices in modern architectures) called "Post T&L vertex buffer". It plays a double role: first, it serves for collecting the results that are ready to go to the next pipeline stages and thus, reduces the possibility of potential idles of accelerator units waiting for the data. Second, it helps the vertex to avoid a second transformation and processing if it will soon be used again. This is often the case with neighbouring triangles: despite different approaches to geometry description, one vertex of each of them will be used several times.
Then the vertices are grouped in threes according to the triangles they belong to, and move to the triangle installation unit where the data necessary for filling the whole triangle undergo a preliminary preparation. It is also here that invisible triangles (those beyond the screen or the specified clipping plane) and the reverse-side ones (if this option is enabled) are removed.
Then the triangle is divided into fragments, part of which are stated to be invisible and are removed during a fragment-level Z test (HSR, as we call it). As a rule, the ultimate result of the process are visible (or partially visible) 2x2 pixel fragments, the so-called quads, that are subject to filling. They are the most convenient fragments for a quick pixel filling (due to mostly mathematical reasons related to the interpolation of texture coordinates). We'll discuss it in details:
This process is divided into two stages in modern accelerators, and it is largely due to the technologies of fast operations with the Z buffer, such as Z compression and hierarchical visibility control. Modern accelerators mostly use two levels: 4x4 blocks (16 pixels), convenient for storing and compressing the Z buffer in the memory, and 2x2 quads that are to be filled in the pixel processor.

The scheme illustrates how a triangle is divided into fragments. First of all, we divide it into large ones that must cover it all, even if only one fragment pixel belongs to the triangle. Beforehand, we calculate the nearest Z coordinate of its every point. Then we try to figure out if we can remove a whole 4x4 block. For this, we use a special hierarchical Z buffer -- a small buffer that only stores the maximal (the farthest) Z value for the whole 4x4 block. Then we compare this value with the nearest Z coordinate of the whole block. If we see that all the points of the 4x4 block, that we're going to fill are farther, we can call the block an invisible one and exclude it. Otherwise, we divide it into 4 quads and calculate Z coordinates for each of them. Now it's time we addressed the full Z buffer to read and unpack the 4x4 block from it (in case it is not yet stored in our cache). Then we compare the Z values, remove fully invisible quads (if there are any), and send the rest of them to the pixel processor for installation and filling. We supply them with the calculated Z values and a special bit mask that indicates which of the quad pixels are visible and which should be ignored. In the end, we came to the following algorithm:
check Z for a 4x4 block of a small auxiliary Z buffer; remove if the whole block is farther (not visible) otherwise read and unpack the block from the main Z buffer; check each quad; remove if the whole quad is farther (not visible) otherwise check each quad pixel, and execute operations with the stencil buffer, and supply the pixels with the visibility flags; figure out which of them should be written, and which should be ignored; send the quad to the pixel processor for installation; Not very simple, eh?
Although it all depends on the application, at least half of the pixels on average are removed before the filling. Therefore, Z check and calculation should be more productive than filling. Which is exactly the case with modern accelerators as they can check (remove) over 16 points per clock and certainly calculate more Z values than filled pixels.

For example, NV30/NV35/38 can calculate, check, and write 8 Z and stencil buffer values per clock, providing no operation with colour is being done (i.e. the pixel processor is not functionning). It visibly speeds up some image-building algorithms that require separate passes for a preliminary Z and stencil buffer calculation. But pixel processor bandwidth is limited by one quad (4 pixels) per clock. Besides, we shouldn't forget about the MSAA mode when we need to calculate two or four times more Z values while there's still only one colour value to calculate.
Another interesting thing about realisation is the small Z buffer we use for a preliminary calculation of 4x4 block visibility. The same buffer is used to clear the Z buffer quickly: we only have to write a special value into it that makes non-liquid the contents of the main buffer of the local memory.

It is stored fully on the chip in the R3XX family, that is why some time ago you could buy cheaper card variants with chips that had this function disabled. Thus, some removed dies were used (remember RADEON 9500 and the attempts to make it into RADEON 9500 PRO/9700 and a similar situation with RADEON 9800SE). So, it would be logical to suspect (though there are no official confirmations of it) that some NV3X chips, too, have this buffer on their die, fully or in rather massive pieces. This probably explains why NV34 can do with less transistors despite the 4x1 formula, similar to NV31/NV36.
As we know, R3XX fills 2 quads at a time (i.e. can provide up to 8 pixels per clock) and, according to some information, removes up to 64 invisible points per clock (that is, four 4x4 blocks). But in contrast to NV3X, it can't write more than eight Z and stencil buffer values.
Then 2x2 fragments (quads) are sent for the installation of fragments. It is also here that many necessary parameters (texture coordinates, MIP level, installation anisotropic parameters, etc.) are calculated (interpolated) for each of them. And it is also the place where the fact of the 2x2 block plays its optimising role: only base parameter values for the whole block, as well as special dx and dy coefficients are calculated. After it, one parameter set turns into four:

But why is it so complicated? The thing is, as pixel shaders gain complexity, the number of parameters sent and interpolated for each point grows too. There can't be too many interpolators in the chip: the operation is rather capacious, and the interpolation of a typical parameter set can take more than one clock thus slowing down the filling. The use of quads increases significantly the effectiveness of the process.
Interestingly, in practical realisations, parameter interpolation can take place, fully or partially, in the pixel processor, using its special or common resources (ALUs).
After the parameters are installed and interpolated, the fragments are being filled. We'll dwell on it a bit later, only mentioning now that texture selection and filtering are an important part of the process.
When the colour values are calculated, the pixel processor blends (if a corresponding mode is enabled) or simply writes the resulting Z and colour values into the frame buffer. At this stage, additional operations can take place, such as gamma correction or calculation of the farthest Z value of the whole 4x4 block for a correct renewal of mini Z buffer, compression of Z coordinates, etc.
After the image is built, an additional pass can be executed to average the results of a full-screen AA. Sometimes, this process is united with displaying.

Now it's time we dwelt on vertex and pixel processors. But before that, let's discuss some peculiarities of practical realisation. This is a schematic of a modern accelerator:

It is the multi-channel memory controller that catches the eye first of all. Instead of one very long 128/256-bit memory bus, it uses two or four fully independent ones with a 32 and 64 bit width, respectively. What was it done for? Let's take a closer look at the data flows that pass while the accelerator is working. As a rule, textures and geometry are only being read, the frame buffer (colour) is only being written, and the Z buffer is being both read and written. So we have 4 continuous data flows. If we manage to place them (partially, at least) in different controllers, we'll get a substantial gain in terms of latency time at data access. We won't have to switch the memory back and forth (from read to write modes) and "go bufferhopping". And that, in turn, means that we can make very small and effective caches. Their typical sizes (presumable or found experimentally) would be as follows:

Vertex cache: ~50..100..200 vertices;
Post vertex cache: ~16..32 vertices;
Texture cache: ~32..64 KB (~512 4x4 blocks);
Z and frame buffer cache: 16..32 KB (~256 4x4 blocks);
Mini Z buffer: ~256 KB maximum, but probably less in reality;

And that is an important difference between GPUs and CPUs. The streaming and predictable character of the data enables us to do with small but very effective caches. The data (frame buffer, Z buffer, textures) are often stored and selected in rectangular blocks, which makes memory operations more effective. Most transistors are spent on multiple ALUs and long pipelines, which puts considerable restraints on clock frequencies: a synchronous work of a specialised complex pipeline over 100 stages long requires substantial deviations from an even signal spread. That is why several-GHz frequencies typical for CPUs are yet unachievable for GPUs.

Vertex processor

It is a sort of summary of our knowledge about modern vertex processors:

	DX9 2.0	R3XX	NV3X
Shader version	2.0	2.0	2.X*
Static branchings	Yes	Yes	Yes
Dynamic branchings	No	No	Yes
Nested loops and subprograms	No	No	Yes
Input registers (vertex attributes)	16	16	16
Constant floating-point registers	128	256	256
Constant integer registers	16	16	**
Constant logical registers	16	16	**
Loop counter	1	1	**
Temporary registers	12	16	32
Address registers	1	1	2
Predicates	No	No	1
Output parameters (textures coordinates)	8	8	8
Clipping plane setup	No	No	Yes
Second facet colour setup	No	No	Yes
Shader code, size, assembler instructions	256	256	256
Assembler instructions executed, maximum	65536	65536	65536

*) in terms of DX9, 2.X means 2.0 + additional capabilities;
**) for these purposes, NV3X allows to use any constant floating-point register with no restrictions and thus, not only emulates this functionality but provides more.

Comments on the table:

Shader version: the one officially supported in DX9 (some additional NV3X functionality, such as a second address register, is only available in OpenGL).
Static branchings: branchings, loops, and subprograms, that only depend on constants specified from outside of the shader.
Dynamic branchings: branchings, loops, and subprograms, based on decisions made right during shader execution (like in the CPU).
Nested loops and subprograms: the possibility to execute them.
Input registers: floating-point vector 4D registers which receive source information about the vertex in processing, selected from the memory by the accelerator.
Constant registers: can be specified from outside, from the application, but can't be changed during shader execution.
Loop counter: a vector register that stores minmial, maximal, and current values of loop iteration. In NV3X chips, it can be represented by any constant register, and there are no restrictions on the loop nesting; consequently, several counters can be used simultaneously.
Temporary registers: general-purpose floating-point vector registers for intermediate calculations.
Predicates: a kind of dynamic conditions, a flag pre-specified as a result of this or that comparison, that can further influence execution of specially marked instructions. In case of an instruction marked with a predicate, its execution result will be removed or accepted depending on the flag status. Thus, we can realise small conditions iwthout interrupting the flow of instructions (which is more profitable if there's a pipeline). To illustrate this, let's take a following algorithm:
if a>1 then b=a*2 otherwise b=a;
Using predication, it can be written as:
identify predicate if a>1; predicate(true) b=a*2; predicate (false) b=a;
where each line only corresponds to one vertex processor instruction as predicate condition is part of it.
Output parameters: eight 4D floating-point registers that are interpolated by the surface of the triangle being filled (see quad installation) and then go into the pixel shader as input parameters during each pixel filling.
NV3X has a useful capability to identify six clipping planes for each vertex personally in the vertex shader, and can specify not only two main colours interpolated by the triangle surface, but two different colours for the reverse facet of the triangle as well.

In general, it is obvious that while an R3XX vertex processor realises stanbard 2.0 almost "to a T", an NV3X one all but reaches vertex shaders 3.0. We'll dwell on shaders 3.0 in our DX Next article; as for the current one, we'll only mention that the main difference which somehow wasn't realised hardwarily (or included in 3.0 specs) is the possibility to access textures from the vertex shader.

Other differences in realisation include such subtle things as hardware support of SINCOS or EXP calculations (replaced with macros of several instructions in the base DX9), but we won't go that much into details. The main difference remains dynamic control of shader execution in NV3X.

Evidently, to realise an R3XX vertex processor, we can choose a scheme with common control logic but several parallel ALUs that process several vertices simultaneously, basing on the same instruction. There are no dynamic jumps and no predicates, the constants aren't changeable and can be used together. This is what a vertex processor of senior R3XX models looks like:

Apart from common control logic and separated constants, we'll also note two ALUs: a 4D vector one and a scalar one. They can work in parallel, executing up to two different mathematical operations per clock. This strange configuration (5 ALUs on a 4+1 scheme instead of 4 ALUs on a 3+1 scheme) was probably predetermined by the need to execute a scalar product quickly and then generalised to a full-value superscalar execution. Junior RV3XX chips have two sets of ALUs and registers instead of four.

In principle, NV3X must contain separate, independent vertex processors, at least the presence of dynamic execution control is a good reason to suggest it. However, there are a lot of things that indicate a quite another picture. An NV3X vertex processor is close to what we've seen in the latest 3dlabs chip. Let's try to draw its presumable scheme:

Now we'll voice our presumptions. Actually, we have a very wide array of independent scalar floating-point ALUs, a very wide register pool, and a shader instruction written as a VLIW microcode that specifies actions for each next clock for all ALUs. What are the merits and demerits of this approach? On one hand, we can make an optimal distribution of computation resources, processing several vertices, including a maximal use of all ALUs by means of different tricks with the microcode compilation where different operations for different vertices are executed simultaneously. If the shader is less complex, we can process more vertices simultaneously as more registers become free. It's easier for us to include specialised units (e.g. in order to accelerate a fixed T&L) into this scheme. On the other hand, what to do with flexibility? We can't manage all these devices separately, in several flows, depending on dynamic conditions. There are two ways out here: first, we can build all dynamic conditions on the predicates, supplying each line of the VLIW microcode with a set of 4 or even more predicates and creating corresponding registers. Second, we can go farther and realise several (e.g. 2) instruction counters (flows) in the control device. But in most shaders that have no jumps or dynamic control of execution, we can use just one instruction flow although with the maximal efficiency and on the maximal number of vertices. We don't know the truth, but at least, we shared our ideas with you. Vertex processors in junior NV3X chips can be scaled in a very subtle way: starting to narrow down the register pool and reduce the number of ALUs and functional units.

NVIDIA officials say that NV31 and NV34 vertex processors show a 2.5-time slower performance than senior models. The NV36 vertex processor has been taken from NV35 and underwent no changes; its performance allows no compromises. What was it done for? Obviously, its performance is superfluous for a game mainstream card. But don't forget that the same chips serve as a basis for professional NVIDIA and Quadro FX solutions where vertex performance is often crucial. On the other hand, in contrast to the fragment processor, the vertex one uses fewer transistors, and the most effective variant can be installed in a middle-price solution.

Pixel (fragment) processor

We'll start with a table once again:

	DX9 2.0	R3XX	NV3X
Shader version	2.0	2.0	2.X*
Hardwarily supported formats	FP	FP24	I12 FP16 FP32
Different textures (samplers)	16	16	16
Input registers (texture coordinates)	8	8	8
Input registers (colour values)	2	2	2
Constant floating-point registers	32	32	512*
Temporary registers	12	32	32
Predicates	No	No	1
Output registers (colour)	1**	4	1***
Shader code, size, assembler instructions	96	160	1024
Texture access instructions executed, maximum	32	32	1024
Arithmetic instructions executed, maximum	64	64+64****	1024

*) In fact, the constants are stored in the shader code and each used constant reduces the available number of instructions.
**) From 1 to the number of MRT rendering buffers supported simultaneously (i.e. up to 8).
***) As a matter of fact, an output register is one of the temporary registers, in contrast to R3XX where it exists separately. Thus, NV3X has a lower effective number of temporary registers.
****) Up to 64 vector and 64 scalar instructions, but with certain conditions (see further).

Shader version: the one officially supported in DX9 (again, some additional NV3X functionality, such as simultaneou execution of floating-point and integer operations, is only available in OpenGL).
Hardwarily supported formats: data formats that fragment processor ALUs really operate with.
Different textures (samplers): the total number of different textures that can be accessed from one shader. Note that even having fewer interpolated texture coordinates, we can select data from different textures using one coordinate vector or calculate all necessary coordinates right in the shader.
Input registers: values of two colours and eight texture coordinates (including a perspective correction), interpolated for each pixel.
Constant floating-point registers: constants loaded into the pixel shader from the application. NV3X has no separate hardware registers for constants, they are stored in the shader microcode, occupying the slots reserved for instructions. As one constant occupies the place of two instructions, their number can't exceed 512. Interestingly, the pixel shader instruction itself is stored in the local memory of NV3X chips and is byloaded from there as execution is going on. As for the code of the vertex shader, it is always allocated in the accelerator.
Temporary registers: general-purpose floating-point vector registers used for intermediate calculations.
Predicates: NV3X pixel processors support predication, which means we can make decisions dynamically even in pixel shaders.
Output registers: ATI supports MRT, and we can simultaneously write the results into four identical buffers with colour values. NVIDIA doesn't support it (which is one of the most disappointing moments), although it enables to pack several values by means of special instructions, thus making one resulting structure of any data, that musn't exceed 128 bits per pixel (FP32[4]). However, packing and unpacking requires pixel shader instructions, which is not the best thing for practical tasks.
Number of instructions: NV3X allows to execute up to 1024 instructions of any kind, in any succession, with any number of texture accesses and any degree of dependent selections' nesting (when a value from the texture is used to calculate new coordinates of access to the next texture). In the case of ATI, the hardware part is very simple, hence the software part is much more complicated. All in all, we can execute 32 accesses to the textures, but nesting degree mustn't exceed 4. We can use as many as 64 arithmetical vector instructions and 64 scalar ones, but these additional scalar calculations merge with the vector ones on a 3+1 scheme, and thus, no 4D vector instruction can be merged with a scalar one.

So, NV3X look much like a leader in terms of specification and flexibility. But we know too well that in practice, it's all the other way round in terms of performance in pixel shaders 2.0. Complexity comes at the expense of a long debugging and optimisation of the compiler in the drivers (the latest visible increase in shader speed was some months ago) and peak performance as well.

And now we'll show you the most interesting thing. This is a schematic of an NV35 fragment processor pipeline (we aren't giving you many details as they add nothing to the description of the process):

How does it all work? Several hundred 2x2 quads are being processing simultaneously. As far as the fragment processor is concerned, a quad is a structure of data, that contains the following information for each of four pixels:

Pixel activity flag (as not all quad points are visible);
Pixel predicate value flag;
Z and maybe stencil buffer values;
Two temporary vector FP32[4] registers that can be divided into four FP16[4]'s.

All the quads pass one by one through this long pipeline consisting of an ALU, two texture modules and two more ALUs. The length of the pipeline is over 200 clocks most of which (~170) is needed for texture selection and filtering and is hidden in the texture units. The pipeline is capable of delivering one quad per clock in a normal mode.

So, one turn of this giant round can execute (or not execute) the following operations for each of the four quad pixels:

Full-size ALU: one 4D vector operation of any complexity from an available set (exceeds PS 2.X specs) in FP32 or FP16 formats
Selection and filtering of two texture values
Simpler ALUs (marked with * in the scheme): two simple vector operations (addition, multiplication) in floating-point or I12 formats, as well as any operation corresponding to the functions of FFP combiners (stages).

Then, if two texture selections and three operations aren't enough, the quad can go round through the quad buffer to reenter the pipeline. In modes compatible with older applications (FPP and shaders up to 1.3), integer capabilities of the mini ALUs are used while the floating-point ALU helps to calculate and interpolate texture coordinates. In this case, two temporary registers per pixel are always enough (specification limits).

Now let's examine how shaders 2.0 (and 1.4) are executed.

Pixel shader microcode instructions are selected one by one from the local memory of the card. One nicrocode line configures the whole pipeline, including two texture selection and filtering modules and three ALUs set identically for each of the four quad pixels. Then all the quads go through the pipeline. After this, the pipeline is configured to a new set of operations and the whole thing repeats. Thus, the shader compiler has a lot of work cut out for it: it must try to gather shader instructions into packs that would use pipeline resources as fully as possible. This is the shader's code:

Calculations
Texture 1 selection
Calculations with its results
Texture 2 selection
Calculation of coordinates basing on the result of the second selection
Texture 3 selection basing on the found coordinates
Result calculations
Result calculations
Result write

It will be grouped into two microcode lines (the brackets contain the number of the source shader line):

      ALU: Calculations (1)
      TEX1: Texture 1 selection (2)
      TEX2: Texture 2 selection (4)
      Mèhè ALU1: Calculations with TEX1 () results
      Mèhè ALU2: Idle (not fit for a complex operation 5)

      ALU: Calculations of the coordinates basing on second selection results (5)
      TEX1: Texture 3 selection basing on the found coordinates (6)
      TEX2: Idle
      Mèhè ALU1: Result (7) calculations
      Mèhè ALU2: Result (8) calculations

The effectiveness is amazing: a nine-line shader was grouped in just two clocks. But there is some pitfalls here as well:

Mini ALUs can execute only some of the operations, otherwise they will be idle. It is often the case with real shaders 2.0 and much more seldom with shaders 1.4 as usually all of their ALUs work, which brings such a visible differents in the test results.
Textures should be selected in separate pairs, otherwise one of the texture modules will be idle.
It is advisable that two texture selection should alternate with mathematical operations; and the schemes should be 2+1, 2+2, or 2+3, depending on the complexity.
Temporal variables have a significant influence on performance.

The last point needs to be explained. The thing is that temporary registers are used during calculations. Formally, there are 32 of them but in reality, only 2..4 temporary registers are normally used for this or that operation. And that makes a great difference for NV3X. If we want to combine our instructions into one microcode line and ensure their execution within one clock, we have to see to it that all of them should use not more than two FP32[4] (or four FP16[4]) temporary variables in total. And this is what makes performances of FP16 and NV3X-based FP32 so different. It's not about the ALUs, it's all about the number of available temporary variables. If we exceed this limit, we'll have to make another round turn, dividing the pack of instructions into two (which takes one more clock). Or else, we'll have to use two quad structures to represent a real one, which will again double the number of clocks. That is why NV3X is so sensitive to manual optimisations of the shader code and to the quality of the built-in compiler. We can write effective shaders only if we have a perfect understanding of fragment processor architecture. ATI R3XX has a much simpler architecture (see further) and it only takes a standard set of rules to write an effective shader. But then again, it can execute fewer operations per clock.

Interestingly, the above scheme of the NV35 (and, consequently, NV38) fragment processor was a sort of a correction of past mistakes. This was the supposed scheme of NV30:

There were only integer mini ALUs! Although they were more productive, as floating-point arithmetics requires more pipeline steps

and integer mini ALUs benefited from using the then free steps. They could execute a MAD operation (addition and multiplication simultaneously). On one hand, it sometimes allowed to show higher results in FFP and PS1.1. On the other hand, results in PS2.0 were neraly always lower. Evidently, NV3X developers mostly focused on performance in DX8 applications, thinking (not without grounds) that PS 2.0 will spread slowly in real applications (not tests). But negative response form analysts and enthusiasts largely caused by low performance of shaders 2.0 was an impetus to "correct the mistakes", so to speak, and replace integer ALUs with floating-point ones.

Now let's speak about shortened chip versions. This is, presumably, what NV31 looks like:

In most cases, the chip works as 2x2, i.e. operates with so-called "half-quads", at least when they are passing through the pipeline that has twice as few ALUs and texture units. However, there is a special 4x1 commutation mode that can be available only if there is no round:

It is fit for simple one-texture filling tasks that demand much fillrate and few intellect from the pixel processor. In fact, this is the main class of tasks where the 4x1 configuration looks better than 2x2.

So, NV36 differs from NV31 by mini ALUs, as was the case with NV35/38.

Now we'll show you a presumable scheme of an R3XX pixel processor. In fact, the chip has two separate quad processors. They work asynchronousally and moreover, one of them can be switched off (shortened chip versions known as 9500 were the result of blocking one working or not working quad processor). Well, this is one of the two R3XX processors:

The main difference is that the fragment processor really looks like a processor, not like a big loop, which was the case with. It executes

shader operations (up to three operations per clock on four pixels: two arithmetic ones (3+1) and a texture value request). But if arithmetic operations are executed quickly, we can't wait over a hundred clocks of texture selection and filtering. Therefore, the number of quads simultaneously processed is big here too, though it's not so evident. The shader is divided into four independent phases (with the microcodes stored right in the pixel processor, which causes the restraint on dependent texture selection depth), and each phase is executed separately. During the execution, there is a growing number of requests to the texture processor. Thus, a certain number of simultaneously processed quads (though they are processed consecutively from the pixel processor's point of view) makes up to four accesses to the texture processor with a task packet. To all appearance, this number is considerably lower that the one we saw in NV and as a result, the queue is much shorter too. The expense is a potential latency during dependent texture selections. The profits are a much simpler pipeline scheme, a possibility to make more pipelines, and the absence serious problems with the temporary registers.

It can be debated what is better. Some will choose a good performance on almost any shader (ATI) complying with the general DX9 principles. Others will prefer a higher performance on a shader (NV) specially optimised for concrete hardware, although they run the risk of getting a times lower performance on a bad code (in terms of the compiler). In total, considering eight R3XX pixel pipelines, ATI won this stage of the battle, at least in what concerns pixel shaders 2.0, which is proved by the test results. As for 1.4 and certainly 1.X, NV3X looks much better here.

Obviously, future generations (NV4X and R4XX) will consider the lessons of the present. First of all, the following problems are likely to be solved: a low number of temporary registers in NV and a low number of ALUs in ATI.

Alexander Medvedev (unclesam@ixbt.com)

01.06.2004

Write a comment below. No registration needed!