iXBT Labs - Computer Hardware in Detail






NVIDIA GeForce GTX 480 Graphics Card

Architecture, synthetic tests and games.

April 3, 2010

<< Previous page

     Next page >>

Parallel geometry processing

Let's get back to the key innovations of GF100. All the previous GPU generations used one unit for fetching, setup and rasterization of triangles. This kind of graphics pipelines provided limited performance and could often be a general bottleneck.

Another reason was the complexity of parallelization, given the lack of corresponding API improvements. Such pipelines, with single raster units, used to provide sufficient performance. But the growing complexity and large-scale use of geometry computing made rasterization the key bottleneck of increasing geometry complexity of 3D scenes.

For example, active use of tessellation changes the load on various GPU units. With tessellation, the density of triangles grows by factors of ten. This puts heavy load on previously serial parts of the graphics pipeline like triangle setup and rasterization. To provide high tessellation performance, they needed to solve this issue by introducing changes to the architecture and rebalancing the entire GPU graphics pipeline.

To achieve high geometry processing performance, NVIDIA developed a scalable geometry processing unit dubbed PolyMorph Engine. Each of 16 PolyMorph units GF100 has contains own vertex fetch unit and a Tessellator. This significantly improves geometry processing performance.

Besides, GF100 features four Raster Engines. Those work in parallel and allow setting up up to 4 triangles per clock. Together, these units provide a nice boost to triangle processing, tessellation and rasterization performance.

PolyMorph Engine has five stages of operation: Vertex Fetch, Tessellation, Viewport Transform, Attribute Setup, and Stream Output. Results of each stage are transferred to a Streaming Multiprocessor. The latter executes a shader program, returning the data for the next PolyMorph Engine operation stage. After all stages are completed, the results are transferred to Raster Engines.

The first stage starts with fetching vertices from a global vertex buffer. The fetched vertices are sent to the SM for vertex shading and hull shading. During these two phases vertices are converted from object space coordinates to world space coordinates. Besides, parameters required by tessellation (e.g., tessellation factor) are calculated and sent to the Tessellator.

During the second stage, the PolyMorph Engine reads the tessellation parameters and breaks the patch (smooth surface defined by control points), outputting the resulting mesh. These new vertices are send to the SM, where domain and geometry shaders are executed.

The Domain Shader calculates the final position of every vertex based on data from the Hull Shader and the Tessellator. This stage is when a displacement map is usually applied to added detail to the patch. The geometry shader does additional processing, adding or removing vertices or primitives if necessary.

During the last stage, the PolyMorph Engine performs viewport transformation and perspective correction. Then attributes are set up, while vertices may be output to memory for further processing by means of stream output.

The previous architectures, had similar fixed function operations performed by only one pipeline. Theoretically, GF100 should have both fixed function and programmable operations parallelized. This, in turn, should boost performance if it's bottlenecked by such operations.

Raster Engines

After primitives have been processed by a PolyMorph Engine, they are sent to a Raster Engine (GF100 has 4 of those). Raster Engines also work in parallel to achieve high geometry processing performance.

A Raster Engine performs three pipeline stages. During edge setup stage, it fetches positions of vertices and calculates triangle edge projections. Reversed triangles are discarded as invisible (back face culling). Each edge setup units processes a single dot, line or triangle per clock.

The Raster Engine uses edge projections for every primitive and calculates pixel coverage. If antialiasing is enabled, coverage for every color and coverage fetch is calculated. Each of the four Raster Engines outputs 8 pixels per clock, e.g., 32 rasterized pixels per clock in total.

From a Raster Engine pixels are sent to a Z-cull unit. The latter compares the depth of pixels from a tile with the depth of pixels in the screen buffer and discards ones that are positioned behind the pixels in the screen buffer. Z-culling saves resources by eliminating unnecessary pixel calculations.

We believe that the new GPC architecture is the most important innovation in GF100's geometry pipeline. Because tessellation requires much higher performance of triangle setup and rasterization units. 16 PolyMorph Engines considerably speed up triangle fetch, tessellation and Stream Out, while 4 Raster Engines provide a high speed of triangle setup and rasterization.

In the next parts of the review we'll see if our tessellation estimations are correct. Dedicated Tessellators in every Streaming Multiprocessor and Raster Engine of every GPC should provide up to 8x geometry performance boost compared with GT200. We'll see if this is so.

Memory subsystem

An efficiently organized memory subsystem is very important for a GPU today. All the more so because general-purpose computing are drawing more attention. So, NVIDIA's new GPU features a more advanced memory subsystem. GF100 has a dedicated L1 cache in every Streaming Multiprocessor.

The cache works together with Multiprocessor's shared memory, complementing it. The shared memory speeds up memory access from predictable-access algorithms, while the L1 cache speeds up access from irregular algorithms in which addresses of requested data is not known by default.

Each Multiprocessor in GF100 has 64KB of onchip memory that can be configured as 48KB shared + 16KB L1 or 16KB shared and 48KB L1.

For graphics GF100 uses the first mode, cache working as a register buffer. For calculations, the cache and shared memory let streams of one unit exchange data, working together, thus reducing bandwidth requirements. Besides, the shared memory itself allows using many algorithms efficiently.

Besides, GF100 has 768KB of unified L2 cache that handles all requests for loading and storing data, as well as texture fetches. The L2 cache enables efficient and high-speed data exchange across the entire GPU. So algorithms in which data requests are unpredictable (physical computing, ray tracing, etc.) get a considerable performance boost from hardware cache. Besides, postprocessing filters in which several Multiprocessors read the same data get sped up due to fewer data calls from external memory.

Unified cache is more efficient than different caches used for different needs. In the latter case there can be situations when one of the caches is used completely, but it's impossible to use other idling caches at that. This may make caching efficiency lower than it's theoretically possible. But the unified L2 cache of GF100 allocates space for different calls dynamically, thus providing higher efficiency.

In other words, one L2 cache has replaced texture L2 caches, ROP caches and onchip buffers of the previous-generation GPUs. GF100's L2 cache is used to both read and write data and is fully coherent, while that of GT200, for example, is only used to read data.

In general, the new GPU enables more efficient data exchange between pipeline stages and can considerably save external memory bandwidth, increasing the efficiency of use of GPU execution units.

New ROPs and improved antialiasing

ROPs, as well as blending & antialiasing subsystem, of GF100 has also undergone significant changes aimed at improving their efficiency. One ROP section in GF100 contains 8 ROPs -- twice as much as in the previous generation. Each ROP can output a 32-bit integer per clock, an FP16 pixel per two clocks, or an FP32 pixel per four clocks.

The key ROP-related drawback of the previous GPUs is low 8x MSAA performance. NVIDIA has considerably improved this mode in GF100, having increased buffer compression efficiency, as well as ROPs efficiency in rendering of small primitives. The latter innovation is also important, because tessellation increases the number of smaller triangles, thus increasing requirements to ROP performance.

Image quality is another thing we're interested in. The new GTX 400 series introduce a new antialiasing algorithm 32x CSAA (Coverage Sampling Antialiasing) that enables very high antialiasing of both geometry and semitransparent textures using alpha-to-coverage. In this case 32 stands for 8 fair multisampling fetches + 24 pixel coverage fetches.

The previous-generation GPUs only used 4 or 8 fetches that couldn't eliminate aliasing completely and resulted in banding. The new 32x CSAA mode uses 32 coverage fetches that minimize all aliasing artefacts.

TMAA (Transparency Multisampling Antialiasing) also benefits from the new CSAA mode. TMAA is usually used in older DirectX 9 applications that do not use alpha-to-coverage, because it's not available in this API. In this case alpha test technique is used that provides sharp edges to semitransparent textures.

The left image shows the best TMAA mode GT200 is capable of: 16xQ with 8 multisampling and 8 coverage fetches. The right image shows TMAA on GF100: 32x CSAA with 8 multisampling and 24 coverage fetches.

Using coverage fetches doesn't increase memory bandwidth and size requirements much. In case of GF100 the performance of the 32x CSAA mode is similar to that of the usual 8x MSAA mode, the difference being only 10% or less. Considering there's little difference between 4x and 8x modes, 32x CSAA becomes the best mode in terms of performance/quality ratio. Especially, in case of solutions as powerful as GTX 470 and GTX 480.

Write a comment below. No registration needed!

<< Previous page

Next page >>

blog comments powered by Disqus

  Most Popular Reviews More    RSS  

AMD Phenom II X4 955, Phenom II X4 960T, Phenom II X6 1075T, and Intel Pentium G2120, Core i3-3220, Core i5-3330 Processors

Comparing old, cheap solutions from AMD with new, budget offerings from Intel.
February 1, 2013 · Processor Roundups

Inno3D GeForce GTX 670 iChill, Inno3D GeForce GTX 660 Ti Graphics Cards

A couple of mid-range adapters with original cooling systems.
January 30, 2013 · Video cards: NVIDIA GPUs

Creative Sound Blaster X-Fi Surround 5.1

An external X-Fi solution in tests.
September 9, 2008 · Sound Cards

AMD FX-8350 Processor

The first worthwhile Piledriver CPU.
September 11, 2012 · Processors: AMD

Consumed Power, Energy Consumption: Ivy Bridge vs. Sandy Bridge

Trying out the new method.
September 18, 2012 · Processors: Intel
  Latest Reviews More    RSS  

i3DSpeed, September 2013

Retested all graphics cards with the new drivers.
Oct 18, 2013 · 3Digests

i3DSpeed, August 2013

Added new benchmarks: BioShock Infinite and Metro: Last Light.
Sep 06, 2013 · 3Digests

i3DSpeed, July 2013

Added the test results of NVIDIA GeForce GTX 760 and AMD Radeon HD 7730.
Aug 05, 2013 · 3Digests

Gainward GeForce GTX 650 Ti BOOST 2GB Golden Sample Graphics Card

An excellent hybrid of GeForce GTX 650 Ti and GeForce GTX 660.
Jun 24, 2013 · Video cards: NVIDIA GPUs

i3DSpeed, May 2013

Added the test results of NVIDIA GeForce GTX 770/780.
Jun 03, 2013 · 3Digests
  Latest News More    RSS  

Platform  ·  Video  ·  Multimedia  ·  Mobile  ·  Other  ||  About us & Privacy policy  ·  Twitter  ·  Facebook

Copyright © Byrds Research & Publishing, Ltd., 1997–2011. All rights reserved.