Matrox Parhelia-512

By Alexander Medvedev

On the 14th of May 2002 Matrox Graphics Inc., already forgotten by 3D visualization enthusiasts, is to announce the Parhelia-512 chip, belonging to the new generation of graphic accelerator architectures. I'd rather characterize it as the "Generation X9". The new chip from Matrox is the second specimen of this generation as 3Dlabs has the priority (with some stipulations), having already announced its flexibly-programmable P10 VPU.

Before we get closer to the new brainchild of Matrox, this article's top hero, let's discuss another symbolic one:

The Ninth Mister X (DirectX 9)

Let us quote the key features (from the angle of accelerators) of the popular API's future version. Confidential style.

The improved data precision

There are new texture and frame buffers formats. Each of the four components (RGBA) can now be presented as a 32- or 16-bit floating point value in the standard IEEE F32 and F16 formats. Each pixel will occupy 128 or 64 bits, this will considerably increase the requirements to the accelerator memory capacity. Though there's still no need to use these formats everywhere. Their main intent is the realization of various effects and lighting models (for example, the storage of pixel shader tables). The floating point content of frame buffers can't be output directly to the monitor via RAMDAC or digital interface, they are only for internal usage. There's the new format for improved integer data representation on-screen: 10:10:10:2 (RGBA), where each color component has the 10-bit precision. This exceeds the capabilities of current display devices and considerably extends the dynamic range and the resulting image quality.

Displacement Mapping

The Displacement Mapping technology, licensed from Matrox Graphics Inc. (included into the Parhelia-512, of course) enables to increase considerably the realism and detail of bumped surfaces. Unlike the traditional relief variations, layered to triangles' surfaces at rendering and not affecting the visibility of pixels (the traditional bump mapping affects only pixel luminance, not its actual location), the displacement mapping enables to create geometrically correct bumps, that do not look perfectly straight, intersecting. Actually this technology modifies the location of triangle vertexes, pushing them towards the surface normal by value, proportional to another value of a special texture (a displacement map):

The displacement mapping is responsible for the "rough" bumps, pushing triangle vertexes to this surface perpendicular. One can use a traditional bump map along with the displacement map to create more precise "by-pixel" details, if needed.

Of course, there must be enough triangles to show all nuances of a rough bump, set by displacement mapping. In order to create these triangles automatically (without bothering programmers, CPU, and AGP bus), we use the already familiar N-Patches:

I.e. the technology of model detail improving based on the additional triangle tesselation. The displacement mapping can't be used without the N-Patches hardware support - from the angle of DX9 this technology is just an important addition to N-Patches algorithm:

It's interesting that DX9 features considerably improved patches. For example, the detail level (granularity value) can now be chosen automatically depending on the distance from a triangle to an observer. Thus we get more optimal, scene granularity close to even (from observer's viewpoint) as all triangles now have about the same visible size:

Besides, you can use mip-mapping for the displacement mapping as well as for usual textures:

As a result we get something nice "without serious troubles":

Of course, the displacement mapping can be used for 3D models as well as for flat surfaces:

In comparison with traditional bump maps (including by-pixel shader bumps), the displacement mapping obviously consume more resources - it's no joke to represent all bump details as triangles. But they produce more realism as well:

Of course, they'll need to combine these technologies reasonably to get dynamics and realism in actual game applications.

Vertex Shader 2.0

The brief of new vertex program features:

256 assembler instructions
The minimum of 256 constants
12 general registers
Execution management: conditions, cycles, jumps, subroutines.
Important limitations: the management is constant value-based. Only forward conditional jumps. Only backward cycles. No nested subroutine calls.
1 cycle index register
16 registers for cycle index setting constants
16 registers for logical jump constants
The opportunity to write shaders on a C-like higher-level language, automatically compiled to assembler code.

Pixel Shader 2.0

The brief of new pixel program features:

Only floating point calculations. The result can be transformed into integer at memory write, if needed.
The opportunity to write shaders on a C-like higher-level language, if needed, automatically compiled to assembler code.
Up to 8 texture iterators (i.e. the corresponding texture coordinates calculated for every triangle pixel).
Up to 16 texture samples, including dependent samples based on coordinates, previously calculated in the shader. Up to 4 levels of sample nesting.
The shader can include up to 32 assembler instructions for texture addressing and up to 64 assembler arithmetical instructions in random-order.
Floating point calculation instructions (exp, log, rsq, etc.)
12 general temporary vector registers
32 constants
Value gamma correction. Improves the realism of color nuances, in synthesized image shaders in particular. The direct gamma correction at writing shader results into frame buffer and backward correction at reading data from frame buffer or texture; gamma value control.

Besides, there's the very important DXVC format of 3D-texture compression, licensed from NVIDIA.

The bottom line. Much is staked at accelerator's programmability (as well as in case of OpenGL 2.0). The higher-level language and considerably improved programmable block flexibility will do their good. The dream of many graphic chip developers - to offload the most difficulties (algorithms) from hardware to programmers - comes true faster than one could have imagined. Now all the data inside accelerator pipelines feature high precision due to the floating point format.

The stable MS DX9 is expected in the end of summer, and its official release seems to be an X-mas surprise.

And now let me introduce the new chip from Matrox Graphics in all its beauty:

Parhelia-512 GPU

Flowchart:

Specifications:

80 million transistors
0.15 die technology
up to 250 MHz (?) core and up to DDR 325 (650) MHz memory clock rate
Full 256-bit (!) DDR memory bus
About 20 GBytes/s local memory bandwidth
64/128/256 MBytes memory
AGP 2x/4x/8x including SBA and FastWrites
4 pixel pipelines
4 texture units per pipeline (!)
Fillrate: up to 1 Gpixel and 4 Gtexels
Vertex Shader 2.0, four parallel execution blocks
Pixel Shader 1.3; 4 texture + 5 combination stages for each pixel pipeline, with the opportunity of pipeline coupling (to get 2 pipelines with 10 combination stages)
EMBM and DOT3 bump mapping
Fixed DX8 T&L (including improved features of matrix blending and skinning). A special vertex shader actually.
10-bit color component rendering, textures, storage and monitor output (!). 10-bit GigaColor technology
Two 400 MHz integrated 10-bit/channel RAMDACs, supporting UltraSharp technology
Full 10-bit -> 10-bit table for output gamma correction
DVD and HDTV video decoder with 10-bit output precision
Up to 2048x1536x32bpp@85 Hz output support
Integrated TV-Out with 10-bit signal precision
Two digital TDMS interfaces for digital outputs or external RAMDACs. Up to 1920x1200x32bpp resolutions supported.
Two fully independent CRTC
The opportunity of dual- or even triple-monitor (!) output. For example, to 2 integrated and one external RAMDACs, or to both integrated RAMDACs and TV-Out. 3840x1024x32bpp resolution total in the triple-mode. TripleHead Desktop, Surround Gaming, and DualHead-HighFidelity (HF) technologies
Adaptive supersampling (not the multisampling!) - 16x Fragment SSAA with up to 16 samples. Activated on polygon edges only.
N-Patches hardware support with adaptive tesselation (!) and displacement mapping.
Glyph Antialiasing - font hardware edge antialiasing and gamma correction (!)
Microsoft DirectX 8 and OpenGL 1.3. Some DirectX 9 features potentially.

The large number of transistors at such technological norm means both great capabilities, and high cost price of Parhelia-512. It's strange that this complex GPU will be produced according to 0.15-micron technological process; there's still no data about the heat emission. The complete 256-bit DDR memory bus is the even more expensive feature of all latest-generation GPUs (P10, R300, NV30).

Theoretically, such a technological jump improves GPU performance by the factor of two. And only due to the improved cost price, without any new architecture technologies. It seems such solution is economically better now, than any architecture surprises, as the prices for memory and circuit boards enable to use a "wide bus" widely.

Besides, DX9, OpenGL 2.0 and further API versions will store more and more data in GPU memory in the floating point (complex to compress) format. Still more data will represent the geometry and (and other non-graphical) data. Having adopted such a wide memory bus, one needn't invest into the development and debugging of complex tile rendering and compression technologies, saving the local memory bandwidth (and considerably increasing the cost and developing time of chip & drivers). Speaking of Parhelia-512 development cost, it's the time to think over the reasons that made Matrox release this GPU. Noone we've been waiting from this company anything like this for a year already. Surprise!

According to one of Matrox engineers: "We've made this chip to show everybody that we are still capable to develop the most up-to-date solution". Another worker said something less serious: "Just for fun". Obviously this expensive solution won't affect the market in the near future, but it'll improve Matrox reputation for sure. Even if the company sells less than 10000 Parhelia-512 cards. The announce of this unique chip alone can draw the fixed attention.

So, the local video memory bandwidth is twice as wide in comparison with the former-generation GPUs. But the speed is not the main feature of Matrox products. Parhelia-512 surprisingly doesn't support any memory bandwidth saving technologies. This is strange - such solutions are present in all latest products of ATI and NVIDIA. But it seems that Matrox decided to speed up and reduce the price of chip development, the chip itself being a symbol more than a market takeover attempt.

So, unlike ATI and NVIDIA, Matrox hasn't supplied its latest chip with Z-buffer compression or Z-occlusion culling technologies. It features fast Z-clear only. The chip is very complex anyway: the amount of transistors requires the finer .13 technology, but it doesn't seem to be fully available for Matrox at the required cost price. One might forecast the release of the .13-micron updated chip version fully compatible with DX9 in some time (in the beginning of 2003 or later).

The tests of engineer samples with very raw drivers show only 20...30% advantage of Parhelia-512 comparing to NVIDIA GeForce 4 Ti4600. Even in case successful software tuning, the twice advantage isn't that possible. 1.5 - that's the limit. Obviously, usual consumers should buy cards on the new Matrox GPU due to the quality and unique features, but not the speed. On the other hand, Matrox can't pretend to be successful on the professional market without certified OpenGL drivers (they might appear along with the planned professional Parhelia-512 cards line). The card is very expensive for the "just the nice 2D fans" niche. So, initially this is for enthusiasts, semi-professionals, brand fans and us - videocard specialists :-).

It's interesting that the GPU is only "partially" DX9-compatible. Parhelia-512 supports Vertex Shaders 2.0. But at that Parhelia-512 doesn't correspond to the second generation of Pixel Shaders! 1.3 version (up to 4 textures per pass) is the chip's maximum. Let's look at the detailed flowchart of chip's pixel pipelines:

Pixel Shaders 2.0 are already impossible due to obviously insufficient pixel pipeline stages and texture loopbacks absence (you can't use more than 4 texture values per pass). It's interesting that Matrox's papers state 36 stages, but for the total of 4 pixel pipelines. There are two configurations possible:

All 4 pixel pipelines are used, each providing 4 texture and 5 pixel (arithmetic) shader stages.
Two pipelines are used, two pixels are rendered per cycle. 4 texture and 10 pixel stages are available on each.

As a result, pixel shaders with 5 instructions and more will be executed twice slower. For the sake of justice I'd note that typical Pixel Shaders 1.3 (and lower) often consist of 5 and fewer instructions.

Actually, Matrox marketing specialists want to confuse the potential buyers by the considerable number of 36 stages, that are useless for a GPU with 4 textures per pass. The actual maximum is 4 texture and 10 pixel stages. Applications will use only 8 of 10 - clearly corresponding to Shaders 1.3 limitations.

The same Matrox's papers mention 64 (!) texture samples per cycle of Parhelia-512, but at closer inspection it's obviously the total for all 16 texture units again. Unpleasant, is it? They try to confuse us with large numbers around the generally accepted norms :-(.

So, Parhelia-512 is not a complete DX9 GPU. There's no sense in hoping to get its unique features (between DX8 and DX9) supported by developers - the Matrox GPU won't just occupy enough market.

On the other hands, Parhelia-512 fully supports previously described displacement mapping and adaptive tesselation N-Patches. Both strictly according to DX9. The chip works with 10-bit frame buffer introduced with DX9, and has some other DX9-specific features. It also supports Vertex Shaders 2.0 unit, capable of handling 4 vertexes simultaneously:

it's not the first time a chip borns at the turn of APIs - hardware evolves almost twice as faster as software. Let's hope that one of this chip's intents is the creation and debugging of a core for the future mainstream products from Matrox, possibly fully compatible with DX9 with the "floating" Pixel Shaders 2.0 unit.

Except the wide bus and improved data precision, the new-generation chips are bound to feature the high-quality and "performance-inexpensive" antialiasing (AA) and anisotropic filtering. Parhelia-512 is good at this, excluding the marketing trick with 64. We actually have four bilinear sampling units and flexibly-programmable interpolator per each pixel pipeline (or rather a single flexibly-configurable texture unit, capable of handling 4 bilinear samples per cycle (including different textures)). Each of these units can choose and interpolate 16 discrete samples per cycle (the total of 64 for all four pixel pipelines). Below is the example of actual single-pass four-texture mapping:

Depending on filtering we get four bilinear-filtered, two trilinear- or anisotropic-filtered (8 samples) or one anisotropic-filtered (16 samples) textures per cycle. So the old "dual-texture" applications will feature lossless 8-tap anisotropic filtering. The anisotropic filtering technology is close to NVIDIA's, but we won't see it almost for free (based on the RIP mapping) like featured by ATI products. Below is the screenshot of Quake III with anisotropy:

We'll describe anisotropic filtering quality further. But let's note that the performance drop will be smoothed by the twice number of sampled texture data per cycle. But only in old applications, not using 4 textures per cycle.

The antialiasing technology of Parhelia-512 is currently unique. So, it should be praised, being close to the ideal. Essentially, it's the supersampling, up to 16 (4x4) samples per one screen pixel. But it's performed ONLY (!) for polygon edge pixels (3..5% of a typical scene):

Let's compare with popular FSAA methods:

The main advantage is obvious: unlike multisampling, the surplus data isn't stored in memory and are not sent over the bus! The total frame buffer size increases slightly, not more then by two, even at the maximum 16x setting. The special fast rendering pass is used to determine the edge pixels: GPU marks edge pixels of polygons in a separate buffer without texture value calculations and intermediate texel rendering. Besides, there's no texture sharpness drop, featured by FSAA and some hybrid MSAA techniques:

Let's compare the screenshots:

On the left: 3D Mark 2001 - Without FAA-16x; On the right: 3D Mark 2001 - With FAA-16x

On the left: Seaplane - Without FAA-16x; On the right: Seaplane - With FAA-16x

However this very intellectual AA technique may occasionally cause artefacts. Besides, it can't correctly antialias edges overlayed by semi-transparent polygons (like clouds, fog, glass, fire). A user might want to use the familiar classical 4å (2å2) MSAA supported by the chip as well.

The interface richness is the undoubtful advantage of the chip. There are two complete, traditionally high-quality 400 MHz RAMDACs, two TDMS transmitters, integrated TV-Out and two CRTC, providing the opportunity of simultaneous dual-screen image output:

There's also the complete set of dual-display features supported:

As well as the new feature of partial image output to (!) receivers simultaneously. For example, to three monitors, using the third external RAMDAC):

We expect many various games to be showed off at E3 this year in the wide (180-degree view) mode on three monitors at once. As you might have already guessed - on Parhelia-512-based card. There's the special utility supplied with the software, enabling to play many current games in the wide dual- or triple-screen mode.

Feel the difference as they say (the usual mode to the left, Surround Gaming to the right):


Return to Castle Wolfenstein


Soldier of Fortune II


Haegemonia: Legions of Iron


Jedi Knight II: Jedi Outcast

So, there's only one simple problem to solve: where a usual player might obtain three monitors? And where should he put them? :-)

And about the RAMDAC quality, by the way. Matrox has always been presenting high-quality 2D - the company even states the results (!) of its RAMDAC frequency characteristic comparison with the main rivals. The primary:

And the secondary RAMDAC:

The board features the high-quality 5th-degree output filter:

Besides, Parhelia-512 features the hardware support of DVD and HDTV decoder. You can now watch DVD at 10-bit quality. There's just a question of its practicability for signal originally 8-bit stored and compressed.

Bright! Warm?

Let's finally provide the useless enough table:

Chip	Parhelia-512	3Dlabs P10	GeForce4 Ti	RADEON 8500
Memory bus, bits	256 DDR	256 DDR	128 DDR	128 DDR
Bandwidth	20 GBytes	20 GBytes	10.4 GBytes	9.6 GBytes
Maximum memory capacity	256 MBytes	256 MBytes	128 MBytes	128 MBytes
Vertex Shaders	2.0 (4 blocks)	2.0 (4 blocks)	1.1 (2 blocks)	1.1
Pixel Shaders	1.3	1.3, 2.0(?)	1.3	1.4
Pixel pipes	4	4	4	4
Textures per pass, up to	4	8	4	6
Texture units	4	2	2 (2 loopbacks)	2 (4 loopbacks)
Anisotropy	8 (trilinear), 16 (bilinear)	?	8, 16, 32	RIP mapping
Integrated RAMDACs	2 (10 bit!)	2 (10 bit?)	2	2
Integrated TV-Out	1 (10 bit!)	n/a	n/a	n/a
CRTC amount	2 (+ the triple screen mode)	2	2	2
FSAA	16x FAA (fragmented)	8x MSAA	4x MSAA	6x pattern MSAA
N-Patches	DX9 (adaptive)	DX9 (?)	n/a	DX8
Displacement mapping	Available	n/a (?)	n/a	n/a

It's useless as there's no sense in drawing premature conclusions about Parhelia-512's technological leadership - let's wait for the announces and samples of new-generation game chips from rivals - NVIDIA NV30 and ATi R300 DX9 chips.

I wonder if the story about GF2 and Radeon pixel shaders, "cancelled" by Microsoft as the result of the long evolution of DX8 key features including the period after chip development, continues. Will these main rivals be fully compatible with DX9? Their creators answer positively, but we'll be able to check it only in fall.

We shouldn't also forget the professional P10 chip, announced by 3Dlabs. It's the physical incarnation of still unadopted OpenGL 2.0 standard (from 3Dlabs point of view), evolving similarly to DX9. Though slower, but more reasonably and consistently. The architecture, required for OpenGL 2.0, will potentially (possibly) correspond to DX9 requirements as well. Not the vice versa.

So, 64, 128, and 256 MBytes Parhelia-512 cards are to be announced soon (avail. in July). The senior will cost about $500, 128 MBytes one - about $400. All of them will feature the full 256-bit memory bus - the difference is in memory clock rate and capacity only. Parhelia-512 cards will be released only by Matrox itself. The partnership with Gigabyte was a big mistake, according to the company's representatives.

We should also await the professional line of Parhelia-512 cards. They'll appear for sale in the end of summer, at about the same time as NV30/R300 cards. Matrox product will obviously be competing with them, not the former-generation GeForce4 Ti4600 and especially RADEON 8500. This somehow changes the picture, doesn't it?

The parhelia will shine on the 14th of May. Will it be warm?

Write a comment below. No registration needed!