iXBT Labs - Computer Hardware in Detail






Pixel Shader 2.0 precision


In this article I would like to look into the current situation with the second version of pixel shaders in DirectX9. For a start let's have a look at the history.

John Carmack

Yet in 2000 John Carmack mentioned the necessity of floating point numbers in a pixel pipeline in addition to those in a geometry pipeline of graphic cards. Let's quote his words:



We need more bits per color component in our 3D accelerators.

I have been pushing for a couple more bits of range for several years now, but I now extend that to wanting full 16 bit floating point colors throughout the graphics pipeline. A sign bit, ten bits of mantissa, and five bits of exponent (possibly trading a bit or two between the mantissa and exponent). Even that isn't all you could want, but it is the rational step.


There are other more subtle issues [due to limited precision - editor], like the loss of potential result values from repeated squarings of input values, and clamping issues when you sum up multiple incident lights before modulating down by a material. Range is even more clear cut. There are some values that have intrinsic ranges of 0.0 to 1.0, like factors of reflection and filtering. Normalized vectors have a range of -1.0 to 1.0. However, the most central quantity in rendering, light, is completely unbounded. We want a LOT more than a 0.0 to 1.0 range. Q3 hacks the gamma tables to sacrifice a bit of precision to get a 0.0 to 2.0 range, but I wanted more than that for even primitive rendering techniques. To accurately model the full human sensable range of light values, you would need more than even a five bit exponent.


64 bit pixels. It is The Right Thing to do. Hardware vendors: don't you be the company that is the last to make the transition.

The whole text of Carmack's plan can be found here: http://www.bluesnews.com/cgi-bin/finger.pl?id=1&time=20000429013039

The idea of the abstract quoted above is the need for floating point precision operations in a pixel pipeline. A bit later I'll point to other parts of Carmack's plan.


In February 2001 Microsoft presented their DirectX9 architecture vision (very close to what we've finally got in ATi R300). At the presentation they announced that the next pixel shader version in DirectX9 known as "PS 2.0" would operate with single precision floating-point numbers and will be functionally more close to vertex shaders.

Floating point representation of numbers in PS 2.0 was implemented in DirectX9 released in December 2002.

What are floating-point numbers?

There are several ways to represent real numbers on computers.

1) Fixed point places a radix point somewhere in the middle of digits, and is equivalent to using integers that represent portions of a unit. For example, one may represent 1/100ths of a unit; if you have four decimal digits, you could represent 10.82, or 00.01.

2) Rational is another approach where a number is represented as a ratio of two integers.

3) Floating-point representation - the most common solution - basically represents reals in scientific notation, like this one - 1.45*1019. Later we will have a closer look at it.

Floating-point representation

The scientific notation represents numbers as a base number and an exponent. For example, 123.456 could be represented as 1.23456 x 102. In the hexadecimal system, the number 123.abc can be represented as 1.23abc x 162.

Floating-point representation solves a number of problems. Fixed-point numbers have a fixed range of representation, which limits them from representing very big or very small numbers. Also, fixed-point numbers may lose precision when two large numbers are divided.

Floating-point numbers, on the other hand, employ a kind of a "sliding window" of precision depending on the scale of the number. This easily allows representing numbers from 1,000,000,000,000 to 0.0000000000000001.

In this article I will focus only on the main difference between integer and floating points numbers - ranges and precision, and compare currently available CPU implementations and GPU ones.

But now a bit of the history again.

Intel's way to do floating point operations

Today the IEEE-754 floating-point standard is the most common representation of real numbers on computers, including Intel-based PC's, Macintoshes, and most Unix platforms. But how was it formed?

In 1976 Intel began to design a floating-point co-processor for its i8086/8 and i432 microprocessors. At Stanford, ten years earlier, Dr. John Palmer (Manager of Intel's floating-point effort) recruited William Kahan as a consultant for the upcoming i8087 coprocessor for i8086/8. Subsequently Silicon Valley caught some rumors about the i8087, and the developers were so worried that it resulted in foundation of a committee working on a standard for floating-point arithmetic for microprocessors. In 1977 after several committee meetings Professor Kahan, his student Jerome Coonen at U.C. Berkeley, and a visiting Prof. Harold Stone prepared a draft specification in the format of an IEEE standard and brought it back to the IEEE p754 meeting. This draft was called "K-C-S" until p754 adopted it. By 1985 when IEEE Standard 754 was canonized it has already became a de-facto standard.

Modern x86 compatible microprocessors support 32, 64 and 80 bit floating point formats.

Storage Layout

IEEE floating-point numbers have three basic components: sign, exponent, and mantissa. The mantissa is composed of a fraction and an implicit leading digit (explained below). The exponent base (2) is implicit and doesn't need to be stored.

The following figure shows the layout for single (32-bit), double (64-bit), quadruple (128-bit) and extended (80-bit) precision floating-point values. The number of bits for each field is indicated (bit ranges are in square brackets):


  Sign Exponent Mantissa Bias

Single Precision

  1 [31]   8 [30-23]   23 [22-00]


Double Precision

  1 [63]   11 [62-52]   52 [51-00]


Quadruple Precision

  1 [127]   15 [126-112]   112 [111-00]


Extended Precision

  1 [79]   15 [78-63]   64 [63-00]


One of the common representations of floating point numbers is "sXXeYY" where XX represents the number of mantissa bits and YY represents the number of exponent bits. Here: single - s23e8; double - s52e11; extended - s64e15; quadruple - s112e15.

Here is how the bits memory are ordered:

  sign   exponent   mantissa

Let see what's stored in these fields:

The Sign Bit

There are two possible values: 0 equals to a positive number; 1 to a negative number.

The Exponent

The exponent field must represent both positive and negative exponents. For this purpose, a bias is added to the actual exponent in order to get the stored exponent. For IEEE single-precision floats, this value is 127. Thus, an exponent of zero means that 127 is stored in the exponent field. A stored value of 200 indicates an exponent of (200-127), or 73.

The Mantissa

The mantissa represents precision bits of the number. It is composed of an implicit leading bit and fraction bits.

To find out the value of the implicit leading bit we should take into account that any number can be expressed in scientific notation in many different ways. For example, the number five can be represented as any of these:

  • 5.00 x 100
  • 0.05 x 102
  • 5000 x 10-3

In order to maximize the quantity of representable numbers, floating-point numbers are stored in the normalized form. This basically puts the radix point after the first non-zero digit. In the normalized form, five is represented as 5.0 x 100.

A nice little optimization is available to us in base two, since the only possible non-zero digit is 1. Thus, we can just assume a leading digit of 1.

Ranges and precision of Floating-Point Numbers

  Single Double Quadruple Extended
Decimal digits of precision
  p / log2(10)
7.22 15.95 34.01 19.26
Emax +127 +1023 +16383 +16383
Emin -126 -1022 -16382 -16382
Range Magnitude Maximum
  2Emax + 1
3.4028E+38 1.7976E+308 1.1897E+4932 1.1897E+4932
Range Magnitude Minimum
1.1754E-38 2.2250E-308 3.3621E-4932 3.3621E-4932

A closer look at PS 2.0 standard and current PS 2.0 capable hardware

At the presentation of PS 2.0 and later with the release of the first beta of DirectX9 Microsoft established unified requirements for the minimal range and precision of floating-point numbers used in PS 2.0. Ideally the floating-point arithmetic precision should comply with s23e8 (32bit single precision) numbers. Later, obviously after some lobbying from NVIDIA, PS 2.0 standard was extended with "partial precision" execution of the floating point operations. Note that this "partial precision" flag for PS operation is only a hint for a videocard's driver that operations do not need a fully precise result. But the driver can ignore this flag and execute PS command in the normal/full precision mode.

Below you can see a part of the current specification of PS 2.0 standard concerning the floating-point precision:

Internal Precision

- All hardware that support PS2.0 needs to set


- MaxTextureRepeat is required to be at least (-128, +128).

- Implementations vary precision automatically based on precision of

inputs to a given op for optimal performance.

- For ps_2_0 compliance, the minimum level of internal precision for

temporary registers (r#) is s16e7

- The minimum internal precision level for constants (c#) is s10e5.

- The minimum internal precision level for input texture coordinates (t#) is s16e7.

- Diffuse and specular (v#) are only required to support [0-1] range, and high-precision is not required.

As we can see, only r#, c# and t# registers require the high precision representation, and colors (both diffuse and specular) can be represented using the same fixed point registers as in DriectX8 PS 1.x.

So, we have reached the central part of our article, - determination of precision of floating point numbers used in the current generation of videochips. For this purpose we developed a special test utility. The utility stores the test results in a log-file formatted the following way:

PixelShader 2.0 precision test. Version 1.3
Copyright (c) 2003 by ReactorCritical / iXBT.com
Questions, bug reports send to: clootie@ixbt.com


Device: RADEON 9500 SERIES
Driver: ati2dvag.dll
Driver version:


Registers precision:
Rxx = s16e7 (temporary registers)
Cxx = s16e7 (constant registers)
Txx = s16e7 (texture coordinates)


Registers precision in partial precision mode:
Rxx = s16e7 (temporary registers)
Cxx = s16e7 (constant registers)
Txx = s16e7 (texture coordinates)

In this log-file you can see six values reflecting precision of floating-point numbers in videochips, one for each register type in two different op execution modes.

Below is the summary of the results obtained on the NVIDIA and ATI videocards. The link to this test utility can be found at the end of the article. The program package also contains pixel shader stencils used in determining precision of registers.

Registers ATI
rXX s16e7 s10e5 s23e8
cXX s16e7 s10e5 s23e8
tXX s16e7 s10e5 s23e8
rXX partial precision s16e7 s10e5 s10e5
cXX partial precision s16e7 s10e5 s10e5
tXX partial precision s16e7 s10e5 s23e8

If it were not the NV35's results, the numbers in the table wouldn't be so different, right? It's well known that ATi chips use 24 bit floating-point numbers internally in the R300 core and this precision is not influenced by the partial precision modifier. But it's interesting that NVIDIA uses 16 bit floating-point numbers irrespective of the operation precision requested(!), though the partial precision term was introduced by NVIDIA's request, NV3x GPUs support 32 bit floating-point precision under OpenGL NV_fragment_program extension, and NVIDIA advertised their new-generation videochips as capable of TRUE 32bit floating-point rendering!

The NV35 demonstrates various and the most correct behavior among NVIDIA's video chips. We can see that calculations are fulfilled with the 32bit precision in the standard mode in line the with the Microsoft specifications, but when it's indicated that partial precision is supported, temporary and constant registers use 16 bit precision and texture registers use 32 bit precision, though according to the Microsoft specification texture registers can also use 16 bit precision.

Note that the NV3x results were obtained with the WHQL certified drivers, and I'm very sorry that Microsoft does not keep control over implementation of its own DirectX specifications. Also note that the 16 bit floating point numbers format used by NVIDIA is identical to that suggested by John Carmack in 2000.

Let's analyze the results obtained. Below you can see properties of 16 and 24 bit floating-point numbers and 32 bit numbers as the standard ones.


  s10e5 s16e7 s23e8
Size (bits) 16 24 32
Mantissa (bits) 10 16 23
Exponent (bits) 5 7 8
Decimal digits of precision   p / log2(10) 3.31 5.11 7.22
Mantissa distinct values 1024 65536 8388608
Emax +15 +63 +127
Emin -14 -62 -126
Range Magnitude Maximum
  2Emax + 1
65536 1.8446E+19 3.4028E+38
Range Magnitude Minimum
0.000061 2.1684E-19 1.1754E-38

It's clear that the s10e5 floating-point format is left behind all other formats in most areas. It may look like a paradox but it's more correct to compare s10e5 numbers and fixed point numbers used in PS 1.x. Precision of the numbers in PS 1.x even on the NV30 is equal to 12 bit, which is equal to precision of the s10e5 FP numbers (if we take into account the sign bit and the implicit leading bit). And the advantage of the s10e5 format can be noticed exactly in comparison with the fixed point numbers - much bigger absolute values: 1 (or 2 or 8 in different chips) in comparison with 65536 and simultaneously much smaller absolute values.

If you remember, John Carmack indicated the areas where he would like to use s10e5 numbers - it's lighting. The extended range allows using overbright lighting, when someone needs to emulate very bright light sources and when details do not get lost in shadows.

But the s10e5 numbers precision is the area where programmers should be very accurate. Obviously, precision of 16 bit numbers won't let making a correct raytracer, like it was demonstrated by ATi, but even calculation of texture coordinates in pixel shaders may lead to undesirable results. Precision of s10e5 numbers won't even let us correctly address textures of the size larger than 1024 pixels for one dimension with the bilinear filtering enabled. NVIDIA perfectly understands these limitations and has already started training game developers so that they can find areas where the insufficient precision of s10e5 numbers lead to incorrect results. NVIDIA also pushes ahead all high precision calculations in vertex shaders.

What's next?

I this article I've described floating-point numbers, current formats of these numbers used in microprocessors and what kind of support for floating-point numbers is provided by videochip companies today. I must say that 16 bit floating-point numbers are not sufficient for execution of general mathematic computations. But I hope that NVIDIA will let game developers choose when 32 bit floating-point numbers should be used and when the 16 bit version with limited precision. Moreover, such choice should be available not only to NVIDIA's flagman - NV35, but also to other representatives of the GeForce FX family.

Probably, all video chips of the next generation supporting pixel shaders 3.0 will also support full precision 32 bit floating-point numbers. But programmers who use floating-point numbers in their work are well aware that one should be very careful when working with 32 bit single-precision numbers and range overflow and precision loss happen quite often. So what? Should we wait for the next step - double-precision floating-point numbers (64 bit)? It seems they won't come so soon. Here is one more quote regarding usage of floating-point numbers in RenderMan rendering software packages.

In article <875uvp$t23$1@nnrp1.deja.com>, <rminsk@my-deja.com> wrote:
>I noticed that the binary RIB file specification does not support double
>precision arrays only double precision values. Is there anywhere in
>PRMan or BMRT where values are stored as double precision? Should I
>ever output double precision values in my binary RIB?

The Ri routines are all single precision (so all input is parsed and

put into floats), and thus both BMRT and PRMan are almost completely
float on the inside. Of course, both use doubles occasionally as
temporaries for intermediate calculations in certain parts of the
renderers where that last little bit of precision is vital. But it's
almost correct to say that both renderers are just single precision
Larry Gritz Pixar Animation Studioslg@pixar.com Richmond, CA

Original quote can be found at: http://groups.google.com.ru/groups?hl=ru&lr=&ie=UTF-8&oe=UTF-8&selm=87a9n5%2482b%241%40sherman.pixar.com

What does it mean to us? Probably, we should not expect much benefit from double precision numbers in DirectX. And when they will be finally introduced, it won't be a basic type but just an additional type for programmers to use. Single-precision (32bit) floating-point numbers will remain the basic type for DirectX API yet for a long time.

Links to Pixel Shader precision test utility



Alexey Barkovoy (clootie@ixbt.com)

Write a comment below. No registration needed!

Article navigation:

blog comments powered by Disqus

  Most Popular Reviews More    RSS  

AMD Phenom II X4 955, Phenom II X4 960T, Phenom II X6 1075T, and Intel Pentium G2120, Core i3-3220, Core i5-3330 Processors

Comparing old, cheap solutions from AMD with new, budget offerings from Intel.
February 1, 2013 · Processor Roundups

Inno3D GeForce GTX 670 iChill, Inno3D GeForce GTX 660 Ti Graphics Cards

A couple of mid-range adapters with original cooling systems.
January 30, 2013 · Video cards: NVIDIA GPUs

Creative Sound Blaster X-Fi Surround 5.1

An external X-Fi solution in tests.
September 9, 2008 · Sound Cards

AMD FX-8350 Processor

The first worthwhile Piledriver CPU.
September 11, 2012 · Processors: AMD

Consumed Power, Energy Consumption: Ivy Bridge vs. Sandy Bridge

Trying out the new method.
September 18, 2012 · Processors: Intel
  Latest Reviews More    RSS  

i3DSpeed, September 2013

Retested all graphics cards with the new drivers.
Oct 18, 2013 · 3Digests

i3DSpeed, August 2013

Added new benchmarks: BioShock Infinite and Metro: Last Light.
Sep 06, 2013 · 3Digests

i3DSpeed, July 2013

Added the test results of NVIDIA GeForce GTX 760 and AMD Radeon HD 7730.
Aug 05, 2013 · 3Digests

Gainward GeForce GTX 650 Ti BOOST 2GB Golden Sample Graphics Card

An excellent hybrid of GeForce GTX 650 Ti and GeForce GTX 660.
Jun 24, 2013 · Video cards: NVIDIA GPUs

i3DSpeed, May 2013

Added the test results of NVIDIA GeForce GTX 770/780.
Jun 03, 2013 · 3Digests
  Latest News More    RSS  

Platform  ·  Video  ·  Multimedia  ·  Mobile  ·  Other  ||  About us & Privacy policy  ·  Twitter  ·  Facebook

Copyright © Byrds Research & Publishing, Ltd., 1997–2011. All rights reserved.