[an error occurred while processing this directive]
Code Name | G70 (NV47) |
NV45 |
NV40 |
NV42 |
NV41 NV41M |
NV43 |
NV44A |
NV44 |
Baseline Article | ||||||||
Process Technology (nm) | 110 |
130 |
110 |
130 |
110 |
110 |
110 |
|
Transistors (M) | 300 |
222 |
190 |
143 |
77 |
|||
Pixel Processors | 24 |
16 |
12 |
8 |
4 |
|||
Texture Units | 24 |
16 |
12 |
8 |
4 |
|||
Blending Units | 16 |
12 |
4 |
2 |
||||
Vertex Processors | 8 |
6 |
3 |
|||||
Memory Bus | 256 (64x4) |
128 (64x2) |
64 (32x2) |
|||||
Memory Types |
DDR, GDDR2, GDDR3
|
|||||||
System Bus | PEG 16x |
AGP 8x |
PEG 16x |
AGP 8x |
PEG 16x |
|||
RAMDAC | 2 x 400 MHz |
|||||||
Interfaces | TV-Out TV-In (a video capture chip is required) 2 x DVI (external interface chips are required) HDTV-Out (only in G70) |
|||||||
Vertex Shaders | 3.0 |
|||||||
Pixel Shaders | 3.0 |
|||||||
Precision of pixel calculations | FP16 FP32 |
|||||||
Precision of vertex calculations | FP32 |
|||||||
Texture component formats | FP32 (without filtering) FP16 I8 DXTC*, S3TC 3Dc (emulation) |
|||||||
Rendering formats | FP32 (without blending and MSAA) FP16 (without MSAA) (NV44 has no blending) I8 |
|||||||
MRT |
available |
|||||||
AA | TAA (only G70 offers AA for transparent polygons) 2x and 4x RGMS SS (in hybrid modes) |
|||||||
Z generation | 2x without color |
|||||||
Stencil buffer | Double sided |
|||||||
Shadow technologies | Hardware shadow maps Geometry shadow optimizations |
card
|
chip
bus |
PS/TMU/BLD/VS units
|
Core frequency (MHz)
|
Memory frequency (MHz)
|
GeForce 6800 Ultra |
NV40
AGP |
16/16/16/6
|
400
|
550
|
GeForce 6800 |
NV40
AGP |
12/12/12/6
|
325
|
350
|
GeForce 6800 GT |
NV40
AGP |
16/16/16/6
|
350
|
500
|
GeForce 6800 LE |
NV40
AGP |
8/8/8/4
|
320
|
350
|
GeForce 6600 |
NV43
PEG16x |
8/8/4/3
|
300
|
350
|
GeForce 6600 GT |
NV43
PEG16x |
8/8/4/3
|
500
|
500
|
GeForce 6800 GTO |
NV45
PEG16x |
12/12/12/5
|
350
|
450
|
GeForce Go 6800 |
NV41M
PEG16x |
12/12/12/6
|
275
|
300
|
GeForce 6800 |
NV41
PEG16x |
12/12/12/6
|
325
|
350
|
GeForce 6600 GT |
NV43
AGP |
8/8/4/3
|
500
|
450
|
GeForce 6800 GT |
NV45
PEG16x |
16/16/16/6
|
350
|
500
|
GeForce 6800 Ultra |
NV45
PEG16x |
16/16/16/6
|
400
|
550
|
GeForce 6200 32TC |
NV44
PEG16x |
4/4/2/3
|
350
|
350
|
GeForce 6200 64TC |
NV44
PEG16x |
4/4/2/3
|
350
|
350
|
GeForce Go 6200 |
NV44
PEG16x |
4/4/2/3
|
300
|
300
|
GeForce 6800 LE |
NV41
PEG16x |
8/8/8/4
|
325
|
350
|
GeForce 6600 |
NV43
AGP |
8/8/4/3
|
300
|
275
|
GeForce 6200 |
NV43
PEG16x |
4/4/4/3
|
300
|
275
|
GeForce Go 6600 |
NV43
PEG16x |
8/8/4/3
|
375
|
350
|
GeForce Go 6800 Ultra |
NV42
PEG16x |
12/12/12/5
|
450
|
530
|
GeForce 6800 Ultra |
NV45
PEG16x |
16/16/16/6
|
400
|
525
|
GeForce 6200 A |
NV44A
AGP |
4/4/2/3
|
350
|
250
|
GeForce Go 6800 |
NV42
PEG16x |
12/12/12/5
|
450
|
550
|
GeForce 7800 GTX |
G70
PEG16x |
24/24/16/8
|
430
|
600
|
card
|
Memory capacity (MB)
|
Memory bandwidth (GB)
|
Texel rate (Mtex)
|
Fill
rate (Mpix) |
GeForce 6800 Ultra |
256 GDDR3
|
35.2
(256) |
6400
|
|
GeForce 6800 |
128
DDR |
22.4
(256) |
3900
|
|
GeForce 6800 GT |
256
GDDR3 |
32.0
(256) |
5600
|
|
GeForce 6800 LE |
128
DDR |
22.4
(256) |
2560
|
|
GeForce 6600 |
128
DDR |
11.2
(128) |
1200
|
2400 |
GeForce 6600 GT |
128
GDDR3 |
16.0
(128) |
2000
|
4000 |
GeForce 6800 GTO |
256 GDDR3
|
28.8
(256) |
4200
|
|
GeForce Go 6800 |
256 GDDR3
|
19.2
(256) |
3300
|
|
GeForce 6800 |
128
DDR |
22.4
(256) |
3900
|
|
GeForce 6600 GT |
128
GDDR3 |
14.4
(128) |
2000
|
4000 |
GeForce 6800 GT |
256
GDDR3 |
32.0
(256) |
5600
|
|
GeForce 6800 Ultra |
256 GDDR3
|
35.2
(256) |
6400
|
|
GeForce 6200 32TC |
32
GDDR |
2.8
(32) |
700
|
1400
|
GeForce 6200 64TC |
64
GDDR |
5.6
(64) |
700
|
1400
|
GeForce Go 6200 |
16
GDDR |
2.4
(32) |
600
|
1200
|
GeForce 6800 LE |
128
DDR |
19.2
(256) |
2600
|
|
GeForce 6600 |
128
DDR |
8.8
(128) |
1200
|
2400 |
GeForce 6200 |
128
DDR |
8.8
(64) |
1200
|
|
GeForce Go 6600 |
128
DDR |
11.2
(128) |
1500
|
3000
|
GeForce Go 6800 Ultra |
256 GDDR3
|
33.9
(256) |
5400
|
|
GeForce 6800 Ultra |
512 GDDR3
|
33.6
(256) |
6400
|
|
GeForce 6200 A |
128
GDDR |
4.0
(64) |
700
|
1400
|
GeForce Go 6800 |
128 GDDR3
|
35.2
(256) |
5400
|
|
GeForce 7800 GTX |
256 GDDR3
|
38.4
(256) |
10320
|
6880
|
Theoretical materials and reviews of video cards, which concern functional properties of the GPU ATI R4XX and NVIDIA NV4X
Here is a flow chart of the NV40 vertex processor:
The processor itself is indicated by a yellow rectangle, the other units are shown to make the picture complete. NV40 is declared to have 6 independent processors (visualize the yellow unit copied six times), each one executing its own instructions and having its own flow control (that is different processors can execute different conditional branches over different vertices simultaneously). A vertex processor of the NV40 can execute the following operations per clock: one vector operation (up to four FP32 components), one scalar FP32 operation, and one access to a texture. It supports integer and floating point texture formats and mip-mapping. One vertex shader may use up to four different textures. But there is no filtering – only the simplest discrete access to the nearest value by specified coordinates.
Here is a summary table with the NV40 vertex processor parameters from the point of view of DX9 vertex shaders compared to R3XX and NV3X families:
Vertex Shader Model | 2.0 (R 3 XX) | 2. a (NV 3 X) | 3.0 (NV40) |
Instructions in shader code |
|
|
|
The number of executed instructions |
|
|
|
Predicates |
|
|
|
Temporal Registers |
|
|
|
Constant Registers |
|
|
|
Static Branching |
|
|
|
Dynamic Branching |
|
|
|
Nesting depth of dynamic branching |
|
|
|
Texture Sampling |
|
|
|
Let's analyze the pixel architecture of the NV40 in the order of the data flow.
We shall dwell on the most interesting facts. Firstly, while the NV3X had only one quad processor, processing a block of four pixels (2x2) per clock, there are four such processors now. They are completely independent, each of them can be excluded from operation (for example, to create a light version of a chip with three processors, if one of them is not effective). There is still a queue for the quad "carrousel" in each processor (see DX Current). Consequently, there remains a similar approach to pixel shader execution (like in NV3X): running more than a hundred quads through one setting (operation) and subsequent change of the setting in accordance with the shader code. But there are noticeable differences as well. The number of TMUs in the first place – now we have only one TMU for each quad pixel. There are 4 quad processors all in all, each having 4 TMUs, 16 in total.
The new TMUs support 16x anisotropic filtering (NV3X offered only 8x) and they have finally learnt to apply all kinds of filtering with floating point texture formats. But that's only in case of 16 bit component precision (FP16). Filtering is still unavailable for FP32, but even FP16 is a noticeable progress – now floating point textures will be full alternative to integer textures in any applications. Especially as FP16 texture filtering comes with no performance loss (however, the increased data flow may and must have an effect on performance of real applications).
Note the bi-level organization of texture caching – each quad processor has its own L1 Texture Cache. Its necessity is conditioned by two facts – (1) fourfold increase in the number of simultaneously processed quads (the quad queue in a processor did not grow longer, but there are four processors now) and (2) competitive access from vertex processors to the texture cache.
There are two ALUs for each pixel, each of them can perform two different (!) operations over a diverse number of arbitrary selected vector components (up to 4x). That is the available schemes include 4, 1+1, 2+1, 3+1 (like in the R3XX) and the new 2+2 configuration, previously unavailable. For more details read DX Current. Arbitrary masking and post-operational component rearrangement are supported. Besides, ALU can normalize a vector as a single operation, which may have a significant effect of performance of some algorithms. Hardware SIN and COS calculations were removed from the new NVIDIA architecture. Practice showed that transistors used for this feature had been just wasted – access by the simplest grid (1D texture) provides better results in terms of performance, especially considering that ATI does not offer this support.
Thus, depending on code, from one to four different FP32 operations can be performed over vectors and scalars per clock. You can see on the diagram that the first ALU is used for overhead operations during texture sampling. Thus, a single clock can be spent either to get one texture sample and use the second ALU for one or two operations, or to use both ALUs, if we don't get a texture sample during this pass. Performance of this tandem directly depends on a compiler and code. But we obviously have
Minimum: one texture sample per clock
Minimum: two operations per clock without texture sampling
Maximum: four operations per clock without texture sampling
Maximum: one texture sample and two operations per clock
We have information that the number of temporal registers for each quad was doubled, that is now we have 4 temporal FP32 registers per pixel or 8 temporal FP16 registers. This fact must boost the performance of complex shaders. Besides, any hardware limitations on the length of pixel shaders and the number of texture samples are lifted – it's now up to API only. The most important improvement is the dynamic flow control.
Here is a summary table of features:
Pixel Shader Model | 2.0 (R3XX) | 2.a (NV3X) | 2.b (R420) | 3.0 (NV40) |
Texture sampling nesting up to |
|
|
|
|
Texture sampling up to |
|
|
|
|
Shader code length |
|
|
|
|
Shader instructions |
|
|
|
|
Interpolators |
|
|
|
|
Predicates |
|
|
|
|
Temporal Registers |
|
|
|
|
Constant Registers |
|
|
|
|
Arbitrary transposition of components |
|
|
|
|
Gradient instructions (D D X/ D DY) |
|
|
|
|
Nesting depth of dynamic branching |
|
|
|
|
And now let's get back to our scheme and pay attention to its bottom part. You can see a unit there that is responsible for comparing and modifying color, transparency, Z and stencil values. We have the total of 16 such units. As the comparison and modification task is rather homogeneous, we can use this unit in two modes:
Standard mode (the following operations are completed per clock):
Turbo mode (the following operations are completed per clock):
It goes without saying that the latter mode is possible only when there is no calculated color value being written. That's why the specifications run that in case of no color, the chip can fill 32 pixels per clock. Besides, it will calculate both the Z value as well as the stencil value. In the first place, such a turbo mode will come in handy to accelerate rendering shadows based on the stencil buffer (like in Doom III) and for a rendering pre-pass, which calculates only Z buffer (this technique often allows to save time on long shaders, as the overlay factor will certainly go down to 1).
They finally repaired the annoying omission of the MRT support (Multiple Render Targets – rendering into several buffers) in the NV3X family – that is one pixel shader can calculate and write up to four different color values to be put into different buffers (of the same size). The lack of this feature in the NV3X made up a serious case for the R3XX for developers. Now this feature appeared in the NV40. Another important difference from previous generations is the intensive support for floating point arithmetic in this unit. All operations (comparing, blending, writing colors) may be performed in FP16 component format. Finally we get the so called full (orthogonal) support for 16bit floating point operations, in terms of filtering and texture sampling as well as frame buffer operations. The next in turn is FP32, but it's probably up to the next generation.
There is another interesting fact – MSAA support. Like its predecessors (NV 2X and NV 3 X), NV40 is capable of 2 x MSAA without performance losses (two Z values for one pixel are generated and compared). In case of 4 x MSAA, one penalty clock should be added (in practice, there is no need to calculate all the four values per clock – it will be hard to write all these values into Z and frame buffers per clock – the memory bandwidth is limited). MSAA higher than 4x is not supported – like in the previous family, all more complex modes are hybrids between 4x MSAA and subsequent SSAA of this or that size. But now RGMS is finally supported (MSAA Rotated Grid Sample):
This separate programmable NV40 unit is responsible for video stream processing:
This processor contains four functional units (INT ALU, INT SIMD ALU with 16 components, load/write data unit, and branch unit) and thus can execute up to four different operations per clock. Data format – integer numbers, probably 16 bit or 32 bit precision (we don't know for sure, but 8 bit wouldn't be enough for some algorithms). The processor includes special sampling, dispatch, and writing data steams functions. Classic decoding and encoding video tasks (IDCT, deinterlacing, color model converting, etc) can be performed without loading CPU. But CPU management is still required – data preparation and selection of conversion parameters are still performed by CPU, especially in case of complex compression algorithms including decompression as one of intermediate steps.
This processor can significantly unload CPU, especially in case of high video resolutions, like HDTV formats, which are getting increasingly popular. We don't know whether these processor capacities are used for 2D acceleration, especially some complex GDI+ functions — it would be logical to use it here. But we don't have information on this aspect yet. Anyway, NV40 complies with the highest requirements to 2D hardware acceleration – all necessary calculation-intensive GDI and GDI+ functions are executed on the hardware level.
We didn't find any special architectural differences from NV40, which is not surprising - NV43 is a scaled (by means of reducing vertex and pixel processors and memory controller channels) solution based on the NV40 architecture. The differences are quantitative (bold elements on the diagram) but not qualitative - the chip remains practically unchanged from the architectural point of view.
Thus, we have 3 (there were 6) vertex processors and 2 (there were 4) independent pixel processors, each working with one quad (2x2 pixel fragment). Interestingly, this time PCI Express has become a native (i.e. on-chip) bus interface, and AGP 8x cards will have an additional bidirectional PCI-E <-> AGP bridge (shown with dotted line), which has been already described.
Besides, note an important limiting factor - a two-channel controller and a 128-bit memory bus - we'll analyze and discuss this fact later on.
The architecture of vertex and pixel processors remained the same - these elements were described in detail in the NV40/NV45 section.
Vertex and pixel processors in NV43 remained the same, but the internal caches could be reduced proportionally to the number of pipelines. However, the number of transistors does not give cause for trouble. Considering not so large cache sizes, it would be more reasonable to leave them as they were in NV40, thus compensating for the noticeable scarcity of the memory pass band. It's quite possible that the ALU array, which contains rather large quantities of transistors responsible for post processing, verification, Z generation, and pixel blending to write the results to frame buffer, was also reduced in each pipeline relative to NV40. The reduced memory band will not allow to write 4 full gigapixels per second anyway, and the fill rate potential (8 pipelines for 500 MHz) will be validly used only with more or less complex shaders with more than 2 textures and attendant shader calculations.
There are no global architectural differences from NV40 and NV43, there are just some innovations in the pixel pipeline aimed at effective operations with system memory as a frame buffer.
On the whole NV44 is a scaled (reduced number of vertex and pixel processors and memory controller channels) solution based on the NV40 architecture. The differences are quantitative (bold elements on the diagram) but not qualitative - the chip remains practically unchanged from the architectural point of view, for the only exception - no FP16 blending.
We have 3 vertex processors, as in NV43, and one (instead of two) independent pixel processor operating with one quad (2x2 pixel fragment). PCI Express has become a native (i.e. on-chip) bus interface as in case with the NV43. AGP 8x cards based on this chip (TurboCache modification) are not manufactured, as the idea of efficient usage of system memory for rendering requires the adequate bidirectional throughput of the graphics bus.
Besides, note an important limiting factor — two-channel controller and a 64-bit (!) memory bus — we'll analyze and discuss this fact later on. Judging from the chip package and the number of pins, 64 bit is the hardware limit for the NV44 and there will be no 128 bit cards based on this design, they are based on the NV43 in the 6200 family.
The architecture of vertex and pixel processors as well as of the video processor remained the same — these elements have been described in detail above. Except for the declared updates for effective addressing of system memory from texture and blending units. But that's only what is said out loud — we have solid reasons to think that all these features, not so critical and most likely based on Common Cache and Crossbar manager, were included into the NV4X family from the very beginning. There were just no reasons to use them (on the level of drivers) in senior cards with faster and wider local memory of a larger capacity. There is also no point in this technology in cards with the AGP interface, which will inevitably become the bottleneck (because of the low write speed into the system memory, comparable to PCI).
That's how NVIDIA explains the differences in its articles:
… regular architecture and NV44 with TurboCache:
You can obviously see the difference due to data feed for textures and the additional way to write frame data (blending) into the system memory. However, the initial architecture of the chip with a crossbar, treating the graphics bus almost as the fifth channel of the memory controller, may be initially capable of this (starting from the NV40 and even earlier). It's hard to tell whether the NV44 has architectural changes as far as writing and reading data is concerned or these features are just implemented on the driver level.
On the other hand, we shall not deny that it would be optimal to have some paging MMU and dynamic swapping of data from system to local memory, which would be treated as L3 Cache. In case of such architecture everything falls to its place. The efficiency will be noticeably higher than discrete allocation of objects and minor hardware revisions will be justified. Especially as having tested this paging unit, one can use it in future architectures, which to all appearances will be equipped with such units without fail (because of WGF requirements).
Continuity towards the previous flagships based on NV40 and NV45 is quite noticeable. Let's note the key differences:
So, the designers obviously pursued two objectives in the process of creating the new accelerator — to reduce power consumption and to drastically increase performance. As Shader Model 3.0 was already implemented in the previous generation of NVIDIA accelerators and the next rendering model (WGF 2.0) is not yet worked out in detail, this situation looks quite logical and expectable. Good news: pixel processors are not only increased in number, they also have become more efficient. We have just one question — why is there no filtering during texture sampling in vertex processors? This step seems quite logical. But this solution would probably have taken too much resources, so NVIDIA engineers decided to use them differently — to reinforce pixel processors and increase their number. The next generation of accelerators will comply with WGF 2.0 and will finally get rid of the disappointing asymmetry in texture unit capacities between vertex and pixel shaders. Another objective is the large-scale introduction of HDTV support as a new universal (in future) standard.
The key differences of this diagram from NV45 are 8 vertex processors and 6 quad processors (all in all, 4*6=24 pixels are processed) instead of 4 with more ALUs for each processor. Pay your attention to the AA, blending, and writing unit, located outside the quad processor on the diagram. The fact is that even though the number of pixel processors is increased by 1.5, the number of modules responsible for writing the results remains the same — 16. That is the new chip can calculate shaders much faster, simultaneously for 24 pixels, but it still writes up to 16 full pixels per clock. It's actually quite enough — memory wouldn't cope with more pixels per clock. Besides, modern applications spend several dozens of commands before calculating and writing a single pixel. That's why increasing the number of pixel processors and retaining the same number of modules responsible for writing looks quite a balanced and logical solution. Such solutions were previously used in low end NVIDIA chips (e.g. GeForce 6200), which had a sterling quad processor, but curtailed writing modules (in terms of the number of units and no FP16 blending).
Here is the architecture of the pixel section:
Have a look at the yellow unit of the pixel processor (quad processor). One can say that the architecture used in NV40/45 has been "turboed" — two full vector ALUs, which can execute two different operations over four components, were supplemented with two scalar mini ALUs for parallel execution of simple operations. Now ALUs can execute MAD (simultaneous multiplication and addition) without any penalty.
Adding small simplified and special ALUs is an old NVIDIA's trick, the company resorted to it several times to ensure noticeable performance gain in pixel units by only slightly increasing the number of transistors. For example, even the NV4X had a special unit for normalizing FP16[4] vectors (it is connected to the second main ALU and entitled FP16 NORM on the diagram). The G70 continues the tradition - such a unit allows considerable performance gain in pixel shaders due to free normalization of vectors each time a quad passes though a pipeline of the processor. Interestingly, the normalization operation is coded in shaders as a sequence of several commands, the driver must detect it and substitute it with a single call to this special unit. But in practice this detect process is rather efficient, especially if a shader was compiled from HLSL. Thus, NVIDIA's pixel processors don't spend several clocks on vector normalization as ATI does (it's important not to forget about the format limitation - FP16, that is half-precision).
What concerns texture units, everything remains the same — one unit per pixel (that is four units in a quad processor), native L1 Cache in each quad processor, texture filtering in integer or FP16 component format, up to 4 components inclusive (FP16[4]). Texture sampling in FP32 component format is possible only without hardware filtering — you will either have to do without it or program it in a pixel shader, having spent a dozen of instructions or more. However, the same situation happened before - sterling support for FP32 components will probably be introduced only in the next generation of architectures.
The array of six quad processors is followed by the dispatch unit, which redistributes calculated quads among 16 Z, AA, and blending units (to be more exact, among 4 clusters of 4 units, processing an entire quad - geometric consistency must not be lost, as it's necessary to write and compress color and Z buffer.) Each unit can generate, check, and write two Z values or one Z value and one color value per clock. Double-sided stencil buffer operations. Besides, one such unit executes 2x multisampling "free-of-charge", 4x mode requires two passes through this unit, that is two clocks. But there are exceptions. Let's sum up features of such units:
There appear so many conditions due to many hardware ALUs, necessary for MSAA operations, generating Z-values, comparing and blending color. NVIDIA tries to optimize transistor usage and employs the same ALUs for different purposes depending on a task. That's why the floating point format excludes MSAA and FP32 excludes blending. A lot of transistors are one of the reasons to preserve 16 units instead of upgrading to 24 ones according to the number of pixel processors. In this case the majority of transistors in these units may (and will) be idle in modern applications with long shaders even in 4xAA mode. Memory, which pass band has not grown compared to the GeForce 6800 Ultra, will not allow to write even 16 full pixels into a frame buffer per clock anyway. As these units are asynchronous to pixel processors (they are calculating Z-values and blending, when shaders calculate colors for the next pixels), 16 units are a justified, even obvious solution. But some restrictions due to FP formats are disappointing but typical of our transition period on the way to symmetric architectures, which will allow all operations with all available data formats without any performance losses, as allowed by flexible modern CPUs in most cases.
Vertex Pipeline Architecture
Everything is familiar by the NV4x family, only the number of vertex processors is increased from 6 to 8.