iXBT Labs - NVIDIA GeForce GTX 460 Graphics Card - Page 2: Architecture, secondary features

GF104 architecture

The GF104 codename means that the chip is mid-end, based on the Fermi architecture. As you may already know, NVIDIA's new architecture supports all features of the DirectX 11 API, including hardware tessellation and DirectCompute. In general, as we have already stated, GF104 is very much like GF100, the differences being quantitative.

GF104 uses the same Streaming Multiprocessors with even more CUDA cores per each, even compared with the top-class GF100. But the most important difference is the improved geometry pipeline in all new GPUs. NVIDIA's new architecture has much higher peak geometry performance to meet the needs of DirectX 11.

The graphics pipeline of GF104 can provide high performance in applications that use tessellation and handle lots of geometry data. The new architecture uses several PolyMorph and Raster Engines that work in parallel. It also has the new memory subsystem -- full-fledged L1 and L2 caches that provide quick data access.

Like GF100, GF104 consists of Graphics Processing Clusters, each of which has several Streaming Multiprocessors, each of which, in turn, has several stream processors.

GF104 has 2 GPC clusters, 8 SM and 4 64-bit memory controllers interconnected with L2 cache and with 8 ROPs in each. As stated above, so far there is only one model based on GF104, but it has either 768MB or 1GB memory.

GF104 has 384 stream processors in total, organized as 8 Multiprocessors with 48 processors in each. But GeForce GTX 460 has one SM disabled, so there are only 336 active stream processors.

The full-fledged variant of the GPU (which is not present on the market yet) features PCI Express interface, GigaThread engine, 2 GPC, 4 memory controllers and wide ROPs, as well as 384KB/512KB L2 connected to ROPs.

Unlike the top-class GF100 that has 6 64-bit memory controllers, GF104 only has 4 which equals to 256 bits. But thanks to GDDR5 memory the bandwidth remains high enough. The junior GeForce GTX 460 768MB has one wide ROP disabled, and only has the 192-bit bus and 24 ROPs.

Each Graphics Processing Cluster has 4 SM and a separate Raster Engine that sets triangles, handles rasterization and removes hidden surfaces. Each GPC has own PolyMorph Engines that handle vertex attributes and tessellation. Those are connected to each SM in a cluster. GF104 has 8 PolyMorph Engines in total, but only 7 of those are enabled.

Each SM now has 48 CUDA cores, which is 1.5 times more than GF100 has. Each stream processor has an execution unit for integer (INT) and floating-point (FPU) calculations. Each SM has 16 load/store units (LD/ST or LSU) that determine sources and destinations for 16 streams per clock.

The number of Special Function Units (SFU) has also increased. SFUs handle complex operations like sine, cosine, square root, etc. GF104 has 8 of those, while GF100 has 4. Theoretically, this may increase performance in some situations.

To supply enough data to all those stream processors, the number of Dispatch Units has been doubled for each SM -- each SM has 2 Warp Schedulers, but 4 Dispatch Units. This allows each SM to run two instructions per clock on each of 2 Warps, which makes the total of 4 instructions per clock per SM. Theoretically, this should increase the efficiency of stream processors.

The number of texture mapping units is one of the key GPU features. As you can see on the diagram, each SM has 8 TMUs, unlike GF100 that has 4 TMUs per SM. Each TMU determines address and fetches data for 4 texture fetches per clock.

In all other aspects, TMUs have remained the same as those in GF100. But their total number (enabled + disabled) has also remained the same, while the number of other units has decreased. This indicates a different GPU balance, and perhaps NVIDIA wasn't quite right to give GF100 just 64 TMUs total. Later we'll see how the number of TMUs affects GeForce GTX 460 performance.

As expected, the memory subsystem remained the same. Each SM of GF104 has 64KB of on-chip memory that can be configured as 48KB shared and 16KB L1 cache or, vice versa, 16KB shared and 48KB of L1 cache. Besides, GF104 has 512KB of unified L2 cache that handles all load/store data requests and texture fetches.

Other innovations of GF104

Now let's say a few words about secondary innovations of GeForce GTX 460. According to NVIDIA, the new GPU supports Dolby True HD and DTS-HD bit streaming via HDMI to external receivers. This may come in handy for HTPC applications. In other words, NVIDIA has finally eliminated one of the drawbacks that users criticized in the previous GPUs.

One of the key changes is improved power gating that allows disabling unused functional devices. There are no other details, but we're sure that GF104 is better than GF100 in this aspect. Solutions based on the new GPU consume much less power.

Finally, there's support for 3D Vision Surround. This technology provides stereo image output to three monitors at once. Actually, this is not something new, because GF104 supports this feature on the driver level. It's just that the previous drivers were beta, while the ones released with the roll out of GeForce GTX 460 are full-fledged release.

Conclusions on theory and architecture

Obviously, GF104 is based on the improved architecture that first saw the light with the roll out of GF100. The new mid-end GPU features improvements aimed primarily at graphical calculations -- like the increased number of TMUs per SM, for example. But there are non-graphical improvements as well, including the doubled amount of instructions per clock per SM, more ALUs per SM.

GF104 has all the advantages of the Fermi architecture. The changes to the graphics pipeline are the most important. The new GPU has 8 tessellation engines -- though only 7 of those are enabled in GeForce GTX 460 -- and two Raster Engines. This will come in handy in DirectX 11 applications. Compared with the previous solutions, GTX 460 should also provide high geometry performance. But we'll check that out in the tests section.

What's important the new GPU doesn't have the potential bottleneck of GF100. In the corresponding review we mentioned that the number of TMUs in GF100 was insufficient in some situations, leading to decreased performance in certain tests sensitive to texture performance. Despite the fewer stream processors and ROPs, GF104 has the same total amount of TMUs as the GF100 has. This should help increase texture and rendering performance in games.

As for actual GeForce GTX 460 graphics cards, these solutions should be a success thanks to balanced features and affordability. The 1GB variant is especially attractive. It has a full-fledged 256-bit bus and more ROPs. While the 768MB variant may sometimes be bottlenecked by lower memory bandwidth, as well as lower fillrate -- where high resolutions and fullscreen antialiasing is involved.

Is this worth $30 (12.5% of the price)? Well, you decide. But it's not quite clear why they had to give the two variants the same name. These two solutions should have 10-20% performance difference (even more in heavy modes). This may confuse regular customers who don't care much about ROPs and whatnot.

We shall be waiting for the next product in the series that will hopefully have the full-fledged GF104 GPU -- with all those stream processors, TMUs, tessellation engines, memory controllers and ROPs enabled. Given that GPU/memory clock rates will be higher as well, the product will be able to compete with GeForce GTX 470 based on the cut-down GF100. That's probably the reason why NVIDIA hasn't rolled it out yet -- they need to sell the remaining GF100 chips suitable for GeForce GTX 470 graphics cards.

Write a comment below. No registration needed!

<< Previous page

Next page >>