Graphics Processing Clusters
As we have already mentioned, GF100 architecture consists of 4 Graphics Processing Clusters, each containing 4 Streaming Multiprocessors and a dedicated Raster Engine.
The new GPC has two key differences. Firstly, it has own scalable raster engine that handles triangle setup, rasterization and z-cull. Secondly, a GPC now also contains dedicated PolyMorph Engines that fetch vertex attributes and handle tessellation. Note that Raster Engine is a part of a GPC, while a PolyMorph Engine is a part of every Streaming Multiprocessor in a GPC.
A GPC features all primary GPU units, excluding ROPs. Essentially, it's a standalone graphics chip. NVIDIA's previous GPUs had Multiprocessors and TMUs grouped into Texture Processing Clusters, while in GF100 every SM has 4 dedicated TMUs. More information on this is provided below.
The third generation of NVIDIA's Streaming Multiprocessors demonstrate a number of innovations aimed at improving performance, programmability and flexibility of use.
Each SM has 32 CUDA cores -- 4 times as many as in GT200 (though don't forget that the total number of SM became smaller). Those remained scalar, and thus highly efficient for any applications, not just optimized. For example, Z buffer operations (1D) and texture access (2D) can fully load GPU execution units, unlike less efficient superscalar architecture ALUs.
Stream processors features ALUs and FPUs. GF100 computing meet the new floating-point standard IEEE 754-2008 and enable fused multiply-add (FMA) for single and double precision computing.
FMA, unlike multiply-add (MAD), performs these two operations with only one rounding. This eliminates precision losses in adding and minimizes rendering errors in some situations, e.g., in case of overlapping triangles.
The new integer ALU introduced in GF100 supports full-fledged 32-bit precision for all instructions, as programming languages require. It's highly efficient in terms of 64-bit operations. Each SM has 16 load/store units (LD/ST or LSU) that allow determining sources and destinations for 16 streams per clock.
Four Special Function Units (SFU) perform more complex operations, including sines, cosines, square roots, etc. These units are also used for interpolation of graphics attributes. Each SFU unit performs 1 instruction per stream in 1 clock, meaning that a warp of 32 streams will be processed in 8 clocks. The SFU pipeline is separated from the dispatcher unit, enabling the latter to access other execution units while SFUs are busy.
It's important that while double-precision computing is greatly sped up in the Fermi architecture, the actual gaming solutions based on GF100 are deliberately slowed down. For example, GeForce GTX 480 performs 64-bit computing at only a quarter of peak speed -- at 168 GFLOPS instead of 672 GFLOPS.
This seems logical, because double-precision computing is not something graphics solutions need very much. But this will still enable good sales of Fermi-based Tesla products. GeForce cards do not need either highly efficient 64-bit computing or ECC correction. These features will be offered where they are needed most -- in Tesla.
Double Warp Scheduler
As we have mentioned, Multiprocessors handle stream in groups of 32 which are called warps. Each SM has two Warp Schedulers and two Instruction Dispatch Units. This allows each SM to process 2 warps.
The double Warp Scheduler selects 2 warps and executes 1 instruction from each on a group of 16 computing cores, 16 LSUs or 4 SFUs. Since warps are processed independently to each other, GPU's scheduler doesn't have to check the instruction stream for dependent commands. Using this dual-issue model allows achieving high performance close to the theoretical peak.
Most instructions can be executed in pairs: two integer, two floating-point, or a combination of integer, floating-point, load/store data, special SFU instructions. But this is only applicable to single-precision instructions. A double-precision command cannot be executed simultaneously with any other instruction.
Texture Mapping Units
The number of TMUs and their capabilities are very important for any GPU. As you can see on the Streaming Multiprocessor flowchart, each SM has 4 TMUs. Each of the latter determines address and collects data for 4 texture fetches per clock. The result can be output unfiltered (for Gather4) or with bilinear/trilinear/anisotropic filtering applied -- with a corresponding slowdown.
The Texture Mapping Units of GF100 have not been changed radically compared with the previous products. According to NVIDIA, GF100 TMUs are primarily aimed at improving texture fetch performance. Advantages include the transition of TMUs into Streaming Multiprocessors, as well as higher caching efficiency and TMU clock rates.
The previous GPU, GT200, had up to 3 Streaming Multiprocessors utilize one large TMU unit consisting of 8 TMUs. The new architecture provides every SM with own TMUs and texture caches. Theoretically, this should boost efficiency, but we'll check it in tests just to be sure.
NVIDIA promises an especially big texturing performance boost in shadow mapping and algorithms like screen space ambient occlusion. Both techniques utilize Gather4, a standard DirectX feature that allows simultaneous fetching of 4 values per clock.
What's more important, GF100 has a more efficient dedicated L1 cache. Together with the unified L2 cache, it provides thrice as much cache for textures compared with GT200. Nevertheless, GT200 has more TMUs, and we shall see, if the new GPU provides high texturing performance in real applications.
Other changes introduced into TMUs include support for new compression formats BC6H and BC7 added to DirectX 11 and intended for HDR textures and render targets.
Write a comment below. No registration needed!