Pixel Shader 2.0 precision

Foreword

In this article I would like to look into the current situation with the second version of pixel shaders in DirectX9. For a start let's have a look at the history.

John Carmack

Yet in 2000 John Carmack mentioned the necessity of floating point numbers in a pixel pipeline in addition to those in a geometry pipeline of graphic cards. Let's quote his words:

4/29/00

-------

We need more bits per color component in our 3D accelerators.

I have been pushing for a couple more bits of range for several years now, but I now extend that to wanting full 16 bit floating point colors throughout the graphics pipeline. A sign bit, ten bits of mantissa, and five bits of exponent (possibly trading a bit or two between the mantissa and exponent). Even that isn't all you could want, but it is the rational step.

………………

There are other more subtle issues [due to limited precision - editor], like the loss of potential result values from repeated squarings of input values, and clamping issues when you sum up multiple incident lights before modulating down by a material. Range is even more clear cut. There are some values that have intrinsic ranges of 0.0 to 1.0, like factors of reflection and filtering. Normalized vectors have a range of -1.0 to 1.0. However, the most central quantity in rendering, light, is completely unbounded. We want a LOT more than a 0.0 to 1.0 range. Q3 hacks the gamma tables to sacrifice a bit of precision to get a 0.0 to 2.0 range, but I wanted more than that for even primitive rendering techniques. To accurately model the full human sensable range of light values, you would need more than even a five bit exponent.

……………………

64 bit pixels. It is The Right Thing to do. Hardware vendors: don't you be the company that is the last to make the transition.

The whole text of Carmack's plan can be found here: http://www.bluesnews.com/cgi-bin/finger.pl?id=1&time=20000429013039

The idea of the abstract quoted above is the need for floating point precision operations in a pixel pipeline. A bit later I'll point to other parts of Carmack's plan.

Microsoft

In February 2001 Microsoft presented their DirectX9 architecture vision (very close to what we've finally got in ATi R300). At the presentation they announced that the next pixel shader version in DirectX9 known as "PS 2.0" would operate with single precision floating-point numbers and will be functionally more close to vertex shaders.

Floating point representation of numbers in PS 2.0 was implemented in DirectX9 released in December 2002.

What are floating-point numbers?

There are several ways to represent real numbers on computers.

1) Fixed point places a radix point somewhere in the middle of digits, and is equivalent to using integers that represent portions of a unit. For example, one may represent 1/100ths of a unit; if you have four decimal digits, you could represent 10.82, or 00.01.

2) Rational is another approach where a number is represented as a ratio of two integers.

3) Floating-point representation - the most common solution - basically represents reals in scientific notation, like this one - 1.45*10¹⁹. Later we will have a closer look at it.

Floating-point representation

The scientific notation represents numbers as a base number and an exponent. For example, 123.456 could be represented as 1.23456 x 102. In the hexadecimal system, the number 123.abc can be represented as 1.23abc x 162.

Floating-point representation solves a number of problems. Fixed-point numbers have a fixed range of representation, which limits them from representing very big or very small numbers. Also, fixed-point numbers may lose precision when two large numbers are divided.

Floating-point numbers, on the other hand, employ a kind of a "sliding window" of precision depending on the scale of the number. This easily allows representing numbers from 1,000,000,000,000 to 0.0000000000000001.

In this article I will focus only on the main difference between integer and floating points numbers - ranges and precision, and compare currently available CPU implementations and GPU ones.

But now a bit of the history again.

Intel's way to do floating point operations

Today the IEEE-754 floating-point standard is the most common representation of real numbers on computers, including Intel-based PC's, Macintoshes, and most Unix platforms. But how was it formed?

In 1976 Intel began to design a floating-point co-processor for its i8086/8 and i432 microprocessors. At Stanford, ten years earlier, Dr. John Palmer (Manager of Intel's floating-point effort) recruited William Kahan as a consultant for the upcoming i8087 coprocessor for i8086/8. Subsequently Silicon Valley caught some rumors about the i8087, and the developers were so worried that it resulted in foundation of a committee working on a standard for floating-point arithmetic for microprocessors. In 1977 after several committee meetings Professor Kahan, his student Jerome Coonen at U.C. Berkeley, and a visiting Prof. Harold Stone prepared a draft specification in the format of an IEEE standard and brought it back to the IEEE p754 meeting. This draft was called "K-C-S" until p754 adopted it. By 1985 when IEEE Standard 754 was canonized it has already became a de-facto standard.

Modern x86 compatible microprocessors support 32, 64 and 80 bit floating point formats.

Storage Layout

IEEE floating-point numbers have three basic components: sign, exponent, and mantissa. The mantissa is composed of a fraction and an implicit leading digit (explained below). The exponent base (2) is implicit and doesn't need to be stored.

The following figure shows the layout for single (32-bit), double (64-bit), quadruple (128-bit) and extended (80-bit) precision floating-point values. The number of bits for each field is indicated (bit ranges are in square brackets):

	Sign	Exponent	Mantissa	Bias
Single Precision	1 [31]	8 [30-23]	23 [22-00]	127
Double Precision	1 [63]	11 [62-52]	52 [51-00]	1023
Quadruple Precision	1 [127]	15 [126-112]	112 [111-00]	16383
Extended Precision	1 [79]	15 [78-63]	64 [63-00]	16383

One of the common representations of floating point numbers is "sXXeYY" where XX represents the number of mantissa bits and YY represents the number of exponent bits. Here: single - s23e8; double - s52e11; extended - s64e15; quadruple - s112e15.

Here is how the bits memory are ordered:

sign

exponent

mantissa

Let see what's stored in these fields:

The Sign Bit

There are two possible values: 0 equals to a positive number; 1 to a negative number.

The Exponent

The exponent field must represent both positive and negative exponents. For this purpose, a bias is added to the actual exponent in order to get the stored exponent. For IEEE single-precision floats, this value is 127. Thus, an exponent of zero means that 127 is stored in the exponent field. A stored value of 200 indicates an exponent of (200-127), or 73.

The Mantissa

The mantissa represents precision bits of the number. It is composed of an implicit leading bit and fraction bits.

To find out the value of the implicit leading bit we should take into account that any number can be expressed in scientific notation in many different ways. For example, the number five can be represented as any of these:

5.00 x 10⁰
0.05 x 10²
5000 x 10^-3

In order to maximize the quantity of representable numbers, floating-point numbers are stored in the normalized form. This basically puts the radix point after the first non-zero digit. In the normalized form, five is represented as 5.0 x 10⁰.

A nice little optimization is available to us in base two, since the only possible non-zero digit is 1. Thus, we can just assume a leading digit of 1.

Ranges and precision of Floating-Point Numbers

	Single	Double	Quadruple	Extended
Decimal digits of precision p / log₂(10)	7.22	15.95	34.01	19.26
E_max	+127	+1023	+16383	+16383
E_min	-126	-1022	-16382	-16382
Range Magnitude Maximum 2^E_max + 1	3.4028E+38	1.7976E+308	1.1897E+4932	1.1897E+4932
Range Magnitude Minimum 2^E_min	1.1754E-38	2.2250E-308	3.3621E-4932	3.3621E-4932

A closer look at PS 2.0 standard and current PS 2.0 capable hardware

At the presentation of PS 2.0 and later with the release of the first beta of DirectX9 Microsoft established unified requirements for the minimal range and precision of floating-point numbers used in PS 2.0. Ideally the floating-point arithmetic precision should comply with s23e8 (32bit single precision) numbers. Later, obviously after some lobbying from NVIDIA, PS 2.0 standard was extended with "partial precision" execution of the floating point operations. Note that this "partial precision" flag for PS operation is only a hint for a videocard's driver that operations do not need a fully precise result. But the driver can ignore this flag and execute PS command in the normal/full precision mode.

Below you can see a part of the current specification of PS 2.0 standard concerning the floating-point precision:

Internal Precision

- All hardware that support PS2.0 needs to set

D3DPTEXTURECAPS_TEXREPEATNOTSCALEDBYSIZE.

- MaxTextureRepeat is required to be at least (-128, +128).

- Implementations vary precision automatically based on precision of

inputs to a given op for optimal performance.

- For ps_2_0 compliance, the minimum level of internal precision for

temporary registers (r#) is s16e7

- The minimum internal precision level for constants (c#) is s10e5.

- The minimum internal precision level for input texture coordinates (t#) is s16e7.

- Diffuse and specular (v#) are only required to support [0-1] range, and high-precision is not required.

As we can see, only r#, c# and t# registers require the high precision representation, and colors (both diffuse and specular) can be represented using the same fixed point registers as in DriectX8 PS 1.x.

So, we have reached the central part of our article, - determination of precision of floating point numbers used in the current generation of videochips. For this purpose we developed a special test utility. The utility stores the test results in a log-file formatted the following way:

PixelShader 2.0 precision test. Version 1.3
Copyright (c) 2003 by ReactorCritical / iXBT.com
Questions, bug reports send to: clootie@ixbt.com

Device: RADEON 9500 SERIES
Driver: ati2dvag.dll
Driver version: 6.14.1.6292

Registers precision:
Rxx = s16e7 (temporary registers)
Cxx = s16e7 (constant registers)
Txx = s16e7 (texture coordinates)

Registers precision in partial precision mode:
Rxx = s16e7 (temporary registers)
Cxx = s16e7 (constant registers)
Txx = s16e7 (texture coordinates)

In this log-file you can see six values reflecting precision of floating-point numbers in videochips, one for each register type in two different op execution modes.

Below is the summary of the results obtained on the NVIDIA and ATI videocards. The link to this test utility can be found at the end of the article. The program package also contains pixel shader stencils used in determining precision of registers.

Registers	ATI R3x0/Rv350	NVIDIA NV30/NV31/NV34	NVIDIA NV35
rXX	s16e7	s10e5	s23e8
cXX	s16e7	s10e5	s23e8
tXX	s16e7	s10e5	s23e8
rXX partial precision	s16e7	s10e5	s10e5
cXX partial precision	s16e7	s10e5	s10e5
tXX partial precision	s16e7	s10e5	s23e8

If it were not the NV35's results, the numbers in the table wouldn't be so different, right? It's well known that ATi chips use 24 bit floating-point numbers internally in the R300 core and this precision is not influenced by the partial precision modifier. But it's interesting that NVIDIA uses 16 bit floating-point numbers irrespective of the operation precision requested(!), though the partial precision term was introduced by NVIDIA's request, NV3x GPUs support 32 bit floating-point precision under OpenGL NV_fragment_program extension, and NVIDIA advertised their new-generation videochips as capable of TRUE 32bit floating-point rendering!

The NV35 demonstrates various and the most correct behavior among NVIDIA's video chips. We can see that calculations are fulfilled with the 32bit precision in the standard mode in line the with the Microsoft specifications, but when it's indicated that partial precision is supported, temporary and constant registers use 16 bit precision and texture registers use 32 bit precision, though according to the Microsoft specification texture registers can also use 16 bit precision.

Note that the NV3x results were obtained with the WHQL certified drivers, and I'm very sorry that Microsoft does not keep control over implementation of its own DirectX specifications. Also note that the 16 bit floating point numbers format used by NVIDIA is identical to that suggested by John Carmack in 2000.

Let's analyze the results obtained. Below you can see properties of 16 and 24 bit floating-point numbers and 32 bit numbers as the standard ones.

	s10e5	s16e7	s23e8
Size (bits)	16	24	32
Mantissa (bits)	10	16	23
Exponent (bits)	5	7	8
Decimal digits of precision p / log₂(10)	3.31	5.11	7.22
Mantissa distinct values	1024	65536	8388608
E_max	+15	+63	+127
E_min	-14	-62	-126
Range Magnitude Maximum 2^E_max^+ 1	65536	1.8446E+19	3.4028E+38
Range Magnitude Minimum 2^E_min	0.000061	2.1684E-19	1.1754E-38

It's clear that the s10e5 floating-point format is left behind all other formats in most areas. It may look like a paradox but it's more correct to compare s10e5 numbers and fixed point numbers used in PS 1.x. Precision of the numbers in PS 1.x even on the NV30 is equal to 12 bit, which is equal to precision of the s10e5 FP numbers (if we take into account the sign bit and the implicit leading bit). And the advantage of the s10e5 format can be noticed exactly in comparison with the fixed point numbers - much bigger absolute values: 1 (or 2 or 8 in different chips) in comparison with 65536 and simultaneously much smaller absolute values.

If you remember, John Carmack indicated the areas where he would like to use s10e5 numbers - it's lighting. The extended range allows using overbright lighting, when someone needs to emulate very bright light sources and when details do not get lost in shadows.

But the s10e5 numbers precision is the area where programmers should be very accurate. Obviously, precision of 16 bit numbers won't let making a correct raytracer, like it was demonstrated by ATi, but even calculation of texture coordinates in pixel shaders may lead to undesirable results. Precision of s10e5 numbers won't even let us correctly address textures of the size larger than 1024 pixels for one dimension with the bilinear filtering enabled. NVIDIA perfectly understands these limitations and has already started training game developers so that they can find areas where the insufficient precision of s10e5 numbers lead to incorrect results. NVIDIA also pushes ahead all high precision calculations in vertex shaders.

What's next?

I this article I've described floating-point numbers, current formats of these numbers used in microprocessors and what kind of support for floating-point numbers is provided by videochip companies today. I must say that 16 bit floating-point numbers are not sufficient for execution of general mathematic computations. But I hope that NVIDIA will let game developers choose when 32 bit floating-point numbers should be used and when the 16 bit version with limited precision. Moreover, such choice should be available not only to NVIDIA's flagman - NV35, but also to other representatives of the GeForce FX family.

Probably, all video chips of the next generation supporting pixel shaders 3.0 will also support full precision 32 bit floating-point numbers. But programmers who use floating-point numbers in their work are well aware that one should be very careful when working with 32 bit single-precision numbers and range overflow and precision loss happen quite often. So what? Should we wait for the next step - double-precision floating-point numbers (64 bit)? It seems they won't come so soon. Here is one more quote regarding usage of floating-point numbers in RenderMan rendering software packages.

In article <875uvp$t23$1@nnrp1.deja.com>, <rminsk@my-deja.com> wrote:
>I noticed that the binary RIB file specification does not support double
>precision arrays only double precision values. Is there anywhere in
>PRMan or BMRT where values are stored as double precision? Should I
>ever output double precision values in my binary RIB?

The Ri routines are all single precision (so all input is parsed and
put into floats), and thus both BMRT and PRMan are almost completely
float on the inside. Of course, both use doubles occasionally as
temporaries for intermediate calculations in certain parts of the
renderers where that last little bit of precision is vital. But it's
almost correct to say that both renderers are just single precision
throughout.
--
Larry Gritz Pixar Animation Studioslg@pixar.com Richmond, CA

Original quote can be found at: http://groups.google.com.ru/groups?hl=ru&lr=&ie=UTF-8&oe=UTF-8&selm=87a9n5%2482b%241%40sherman.pixar.com

What does it mean to us? Probably, we should not expect much benefit from double precision numbers in DirectX. And when they will be finally introduced, it won't be a basic type but just an additional type for programmers to use. Single-precision (32bit) floating-point numbers will remain the basic type for DirectX API yet for a long time.

Links to Pixel Shader precision test utility

PSPrecision13.zip (150 Kb)

Bibliography

Alexey Barkovoy (clootie@ixbt.com)

Write a comment below. No registration needed!