XGI Volari Duo V8 Ultra 256MB Video Card Review

General information
XGI Volari Duo V8 Ultra 256MB video card features
Testbeds and driver settings
Test results: briefly about 2D
Test results : RightMark3D: Pixel Filling
Test results : RightMark3D: Geometry Processing Speed
Test results : RightMark3D: Hidden Surface Removal
Test results : RightMark3D: Pixel Shading
Test results : RightMark3D: Point Sprites
Test results : 3DMark03 synthetic tests
Summary on synthetic benchmarks results
Test results : 3DMark03: Game1
Test results : 3DMark03: Game2
Test results : 3DMark03: Game3
Test results : 3DMark03: Game4
Test results : Quake3 ARENA
Test results : Serious Sam: The Second Encounter
Test results : Return to Castle Wolfenstein
Test results : Code Creatures DEMO
Test results : Unreal Tournament 2003
Test results : AquaMark3
Test results : RightMark 3D
Test results : Tomb Raider: Angel of Darkness
Test results : Half-Life2 (Beta)
Test results : HALO
Test results : Unreal II
Summary on the test results
3D quality: anisotropic, bilinear and trilinear filtering
3D quality in general
Conclusion

General information

We are starting probably the most complicated material in the history of our 3D-Video section. Why is it so? - The answer will be given a little later and it consists of a couple of words.

What is XGI? Officially the company was founded in 2003 by the employees of Trident's graphics department bought by SIS and by SIS's own graphics department. It means that Trident sold its business to SIS which in its turn left it for new-founded XGI.

XGI is actually created by SIS, but unlike S3 Graphics, which is not independent and belongs to VIA, XGI is an independent firm. Moreover, SIS didn't even provide its graphics trade mark (Xabre and others). But it doesn't mean that all new products from XGI are developed from the very beginning.

Look at the collage above. It clearly shows that Volari is identical to Xabre II. When SIS tried to promote its Xabre its roadmap targeted the release of that product for 2003. But it turned out that the Xabre made no good, and the card makers refused to use it because the Xabre had a lot of weak sides. Chaintech brought out a lot of cards based on this processor, and it still tries to empty its overfilled stocks. Many companies had to suffer losses in the attempt to sell them for any possible price (hence Xabre400 cards priced at $35 by retail).

SIS had to decide either to close the graphics department or let it exist independently. Well, they decided on the second option. It takes at least 1.5..2 years to develop a new graphics product from nothing provided that the developers have a good experience and previous generations at hand (like in ATI and NVIDIA). In this case the company launched its product half a year after the company was founded. Moreover, it was a dual-chip DirectX 9 product! No doubt that they used previous solutions from SIS. And it looks like that the changes in the software and hardware section were minimal. If you look at the drivers' screenshots below, you will notice traces of SIS Xabre - even the control panels didn't change their look.

The worst problem of all second-echelon manufacturers (Trident, SIS, XGI) is that they do not properly interact with game developers until games are released! They make gaming products, and it implies that their 3D power must be fully realized in games! At the same time, the game developers rarely test their products on such cards. That is why the programmers (from companies producing cards) have to debug such products after such applications are released. In most cases they simply reduce a quality degree or disable certain functions. Maybe it could be acceptable in 1998, but not in 2003-2004! Users do not need such blurry images these days.

The second problem is that they are too naive regarding latest technologies, in particular, effective memory access, memory optimization, caching and calculation optimization. ATI and NVIDIA produce very complicated chips containing 110-130 M transistors, and the technologies mentioned are widely used there; most of them were polished during several chip generations. Unfortunately, Trident's slogan (after which the company sold its graphics department) - "30M transistors in our products yield the same performance as 100M of NVIDIA and ATI" - gave no rest to Trident, SIS and evidently VIA. But there are no wonders. It's not enough to provide the core with 4-8 rendering pipelines, 8 texture units... They must make them work correctly, eliminate downtime, think about caching, tile technologies of calculation optimization, create drivers and shader compilers which could optimally use all hardware capabilities. It takes time and resources, dozens of years of thousands of experts.

Besides, if shaders are used all 8 rendering pipelines promised become a fiction because there are twice (or maybe 4 times) fewer shader pipelines. That is why SIS's products in shader applications fall far behind their competitors!

The effect from SIS solutions is noticeable here. XGI's products have twice fewer shader pipelines. Let's see what XGI offers (from the weakest solution to the top one):

Volari V3 (V3 Ultra) - DirectX 8.1, 2 pixels per clock (2 pixel pipelines, 1 being able to process shaders), 2 texture samples per clock, up to 200 MHz core, 25M transistors, 0.13um CMOS technology, 128bit memory interface;
Volari V5 (V5 Ultra) - DirectX 9.0, 4 pixels per clock (4 pixel pipelines, 2 being able to process shaders), 4 texture samples per clock, up to 250 MHz core, 75M transistors, 0.13um CMOS technology, 128bit memory interface;
Volari V8 (V8 Ultra) - DirectX 9.0, 8 pixels per clock (8 pipelines, 4 in case of pixel pipelines), 2 vertex processors, 8 texture samples per clock, up to 350 MHz core, 110M transistors, 0.13um CMOS, 128bit memory interface;
Volari V8 Duo - two V8 chips on one card. Virtual characteristics: DirectX 9.0, 16 pixels per clock, (16 pipelines, 8 pixel shaders), 4 vertex pipelines, 8 texture samples per clock, up to 350 MHz core, two V8 chips of 110M transistors each, 0.13um CMOS, 2x128bit memory;

This solution looks more advanced compared to Xabre; at least, it features hardware support of vertex shaders in contrast to Xabre (the latest drivers got the software support). But it's only specification. Let's see what the tests will show.

Have a look at the block diagram of the single V8:

See how the memory bus is organized: 4-channel (32bit each) memory controller. This is a positive effect, though it's not 256 bits typical of latest solutions from ATI and NVIDIA. Now look at the Duo:

Obviously, dual-processing is based on the old principle of master and save processors. They have a special internal bus between them marked as X-Link (also known as BitFluent). This is actually the internal AGP2x, 32 bits wide and working at 133 MHz. The processors work similarly to those in the ATI RAGE MAXX - everyone renders its own frame, but unlike to ATI's product, the chips do not wait for each other to sync when rendering is completed. The disadvantage is a low throughput of the internal bus which transfers data from applications and the slave processor which also constantly reports its results to the master one. Besides, in such configuration most data duplicate in both memory buffers, and each chips fulfills caching separately and less efficiently compared to one chip with the 256bit memory bus. Dual-processor video cards were tested a lot of times already, and every had its own peculiarities. The 3dfx Voodoo5 had excellent realization of the chips' combined operation but the product's cost price was too high. It's clear that in case of ordinary gaming applications which do not exceed the bounds of modern games and top solutions from ATI or NVIDIA the dual-chip configuration with 128bit memory buses will always be less efficient (especially regarding its price/performance ratio) than an advanced single-chip solution with a 256bit memory bus and a more complicated higher-frequency core. That is why the only explanation of a dual-chip configuration is that the developers couldn't create a chip of similar efficiency and/or complexity which would be adequate to the best products from ATI and NVIDIA.

Below are the comparison features of the XGI Volari Duo V8 Ultra. We took NVIDIA GeForce FX 5900 Ultra and ATI RADEON 9800 PRO 128MB for comparison as they have a similar price of $400. The RADEON 9800 PRO can be currently bought at $320-330, but there are no other products from this company priced at $400.

	GeForce FX 5900 Ultra	RADEON 9800 PRO	Volari Duo V8 Ultra
Technology, nm.	130	150	130
Transistors, M	135	110	110
Pixel pipelines	4 (8-only Z)	8	2x8
Vertex pipelines	3	4	2x2
Texture units	8	8	2x8
Core clock, MHz	450	380	350 (330 for Club3D)
Memory bus, bit	256	256	2x128 (DDR2)
Memory bus clock (eff.), MHz	850	680	900
Pixel shaders	2.0+	2.0	2.0
Vertex shaders	2.0+	2.0	2.0
Memory bandwidth, GB/s	27.2	21.8	2x14.4 (28.8)
HSR	Yes	Yes	Yes
Early Z test	Yes	Yes	Yes
Z compression	Yes	Yes	No
Color compression in MSAA modes	Äo 1:4	Äo 1:4	No
Hardware geometrical unit	Yes	Yes	Yes
RAMDAC, MHz	2x400	2x400	2x400
TV-out	Integrated	Integrated	External
DVI	Integrated	Integrated	Integrated
Package	FCPGA	FCPGA	FCPGA
External power supply	Desirable (Ultra)	Required	Required (two connectors)

For more detailed list of the chip features supported by the DirectX drivers see:

Volari Duo D3D CAPS Report

Card

This is a production series card from Club3D. It has AGP x2/x4/x8 interface, 2x128 MB GDDR-II SDRAM (16 chips on both PCB sides).

Samsung K4N26323AE-GC20 memory chips, BGA form factor. Max. clock speed - 500 (1000) MHz, 2 ns access time. Memory clocked at 450 (900) MHz, core at 330 MHz. (the production cards from Club3D have a core clock of 330 MHz instead of promised 350 MHz).

XGI Volari Duo V8 Ultra 256MB

It's obvious that the PCB itself is very complicated. Although it supports only a 128-bit memory bus for each chip, here we have a 256bit bus with DDR-II support, which requires a 10-layer design:

Here are the processors:

They are built on the 0.13 micron technology at UMC's factories. The package is expensive - FCBGA, but it's impossible to produce so complicated chips in other packages. Each chip officially consumes 20W, but the main power consumer is DDR-II memory. The card consumes around 120W(!), which is greater that power consumption of fastest cards from ATI and NVIDIA. That s why it needs additional power supply through two connectors (the power supply unit is not simple).

By the way, the coolers have pleasant lighting.

The cooler consists of two big copper heatsinks. The front heatsink cools both chips and 8 memory chips. This is a solid plate with thermal elements glued to it to fix the memory modules. It has two fans on top which are very noisy, in particular, in the high-frequency range since the rotational speed is really high. The heatsink on back cools down the other 8 memory chips. This is also a single plate. Without additional cooling the card didn't heat up greatly. All the sinks had the temperature lower than human's pain threshold. Some parts of the power supply unit also dissipate a lot of heat, that s why they have their own heatsiks.

It's interesting to compare this card's length with the widely known dual-processor Voodoo5 5500:

As you can see the newer product is shorter than the older one :).

The Volari Duo V8 Ultra supports VIVO. While the TV-out is based on SIS 301,

Video-In is realized with the Conexant BT835:

Our colleague Alexey Samsonov who studies VIVO and TV tuners has already seen this codec and looks forward to testing it :)

Finally I must say that the card doesn't lock any PCI slots though it's not small.

Accessories:

User manual, software CDs and PowerDVD, as well as VIVO software, plus a VIVO adapter/splitter, DVI-to-d-Sub, TV extension cods and a logo stickers.

Retail package:

The box is made of thick cardboard and lacks a jacket. The design uses dark colors.

Overclocking

It's not easy to overclock such a complicated device, but the card was able to work at 370/1000 MHz.

Setup and drivers

Testbed:

Pentium 4 3200 MHz based PC:

Intel Pentium 4 3200 MHz;
DFI LANParty Pro875 (i875P) mainboard;
1024 MB DDR400 SDRAM;
Seagate Barracuda IV 40GB;
Windows XP SP1; DirectX 9.0b;
ViewSonic P810 (21") and ViewSonic P817 (21") monitors.
XGI driver 1.50.01. VSync OFF.

Video cards used for comparison:

ASUS V9950 Ultra (GeForce FX 5900 Ultra, 450/425 (850) MHz, 256 MB, driver 53.03);
HIS Excalibur RADEON 9800 PRO IceQ (380/340 (680) MHz, 128 MB, driver 6.414);
Inno3D Tornado GeForce FX 5700 Ultra (475/450 (900) MHz, 128 MB, driver 53.03);
Sapphire Atlantis RADEON 9600 XT (500/300 (600) MHz, 128 MB, driver 6.414);

Driver settings

The driver settings have the panels identical to the Xabre.

Test results

2D graphics

Up to 1600x1200@75Hz there are no complains about 2D quality.

Remember that quality depends on a sample and a card-display "cooperation". First of all one should pay attention to display's and cable's quality. Remember that 2D quality is tested on the ViewSonic P817-E monitor together with the Barco BNC cable.

D3D RightMark Synthetic Tests (DirectX 9)

In this review we will show the test results obtained with the flexibly configurable synthetic tests for API DX9.

The synthetic suite from the RightMark 3D currently includes the following tests:

Pixel Filling Test;
Geometry Processing Speed Test;
Hidden Surface Removal Test;
Pixel Shader Test;
Point Sprites Test.

The philosophy of the synthetic benchmarks and their descriptions are given in the NV30 review.

If you want to try the synthetic tests of RightMark 3D or measure performance of your own accelerators please download the latest version of our test available at D3D RightMark Beta 3. They have a common shell and a flexible export of scores. Besides, you can download all test settings used in this review.

If you have any comments, ideas or reports on errors please e-mail to unclesam@ixbt.com.

Pixel Filling

At first we will measure the maximum pixel fillrate for a different number of textures to find out a real (effective) number and configuration of pixel pipelines. The textures will have the minimal size and all kinds of filtering will be disabled to eliminate other factors' effect.

In case of textures 0 and 1 the scores are almost equal . It means that the benchmark runs well and texture sampling doesn't affect the process - we measure only the fillrate and frame buffering. We can see that NVIDIA has two texture units per one pixel pipelines while ATI has one unit but a greater number of pipelines. Volari's effective fillrate seems to be limited by some factor (presumably, by the memory write speed). In fact, until the number of textures reaches 4 per pixel the fillrate does not fall down. It's possible in two cases - either the real V8 configuration is 4x2 instead of 8x1 or/and the effective memory bandwidth of each chip is not sufficient for recording data of all pipelines without lowering the speed. Both factors look sad as there are real applications where both factors may have a strong effect. For example, sky rendering in fly simulators or prerendering of depth data for shadows in the Doom III.

It's possible that the chip is made as a super-pipeline with 8 texture units and several parallel ALUs which process a different number of pixels per clock depending on algorithm's complexity (hence a lower number of pipelines indicated in the specs for pixel shaders). So, the main conclusion is that the chip couldn't cope effectively with this simple task but for the case with 4 textures where it outscored its competitors at the expense of the total number of 16 texture units. So, the frame buffer doesn't work that effectively, the configuration of 16x1 yields to its competitors' 8x1 and 4x2 in most cases. The theoretical limit of 5600 M pixels is unreachable for two V8 because of the inefficient chip and memory system architecture.

Earlier we used fixed functions (which are supported yet since DirectX 7) for estimating a fillrate. Let's see how it changes for different shader versions:

While the competitors process simple shaders without speed losses (except 2.0 on NVIDIA in case of 2 textures, though it's rather a useful anomaly of the optimization compiler rather than the architecture's downside), XGI loses to them making this specified feature only a line in the specs.

Note that the fillrate drops with every additional texture used. The computational resources of the texture units and shader pipelines are probably shared (i.e. the same ALU array is used both for interpolation of coordinates and for shader calculations).

Texelrate measurement (depending on the number and size of textures) will let us estimate effectiveness. Data are given in millions of texels per second, for 1, 2 and 4 textures per one pixel for different texture resolutions:

In case of small textures the Volari looks competitive, but as their size grows up the scores are getting lower probably because of ineffective caching and memory operation. This card falls behind its counterparts even at 128x128, not to mention 256x256 though they are the most popular sizes in real applications! The number of pipelines remains just a number if you can't feed all of them in time.

Now we are to test a texelrate depending on texture format and size. Data are given in millions of texels per second, for different texture formats of two sizes: 256x256 and 2048x2048:

While Volari uses all its power to cope with a texture of 256x256, NVIDIA's and ATI's cards demonstrate excellent quality and detail level of 2048x2048(!) at the same speed, though the physical memory bandwidth is narrower.

Let's see how texelrate depends on a filtering type. We will use a popular texture size of 128x128 and test it for 1,2 and 4 textures per pixel:

The bilinear filtering is almost "free of charge" for Volari (like of any proper modern chip), the mip levels make caching more effective which is especially beneficial for a large number of textures. The trilinear filtering is not free, the texture units get paired - the speed drops most of all in case of 4 textures. Earlier the difference was eliminated by the low effectiveness of the frame buffer operation.

Finally, let's have a look at the trilinear filtering which requires better caching and memory access than the bilinear one:

Terrible.

Geometry Processing Speed

Now let's estimate the geometry processing speed.

Fixed TCL performance (or performance of the shader that emulates it):

It's not that bad yet. As the complexity grows up ATI (which doesn't have special hardware support for quick emulation of the old fixed T&L) begins losing to Volari. But XGI's peak bandwidth is not that great and could look better in comparison with previous-generation cards. NVIDIA remains a leader in all tests, the fixed T&L remains its strong side thanks to special hardware units (a special additional instruction for a vertex shader) which quickly fulfills complex calculation of a lighting sources. Well, this is the first synthetic test where Volari doesn't look that awful!

On the other hand, the limit of ~60M triangles and such a week dependence on the task complexity makes us think about software or soft hardware T&L emulation. Such dependence can be on account of the AGP bus and the resultant 60M triangles are not unachievable for modern SSE2 processors.

Vertex Shaders 1.1:

Volari is "beyond" any competition again.

Shaders 2.0 with loops:

This is NVIDIA's weak point (the indirect register indexing is not that fast and makes the loop speed pretty low in almost all popular algorithms) makes it close to the Volari. However, even with this well-known disadvantage XGI can't beat its competitor. It is twice as slower as NVIDIA (!), not to mention ATI.

Hidden Surface Removal

Maximal HSR efficiency vs. scene complexity (HSR performance), and the averaged difference of processing of a chaotic scene and an optimally sorted one (HSR algorithm quality) on the texture-free scene. The results with texture sampling won't be shown as Volari's performance is very low in this case: the real HSR efficiency of this chip can be measured only without them:

So, ATI has the highest maximal theoretical efficiency because of the hierarchical Z buffer. Besides, Volari has a kind of splash in this case on the scene of average complexity. NVIDIA takes the last place, but Volari's bottlenecks in other cases bring to nothing such potential advantage. Probably, the Volari's HSR has more than one clipping level. The chips look approximately equal regarding the algorithm efficiency. Most of them are well balanced for scenes of average complexity.

Pixel Shading

Performance of hardware processing of pixel shaders 2.0:

Benchmark, Shaders 2.0, arithmetical operations performance:

This is a sad story especially in case of one shader. Volari loses to everyone. NVIDIA loses to ATI, the difference between 16 and 32 bits is not that noticeable on the latest drivers. ATI keeps ahead regarding pixel calculations (but not texture sampling).

Let's have a look at the results obtained with alternative algorithms for the same tasks where data sampling from LUT (Look Up Tables) located in textures are preferred to arithmetical calculations. Such approach can change the layout - remember that ATI's products are better balanced in calculations while NVIDIA shines in data sampling. Let's see whose camp Volari hits:

There are no great changes, NVIDIA got a small advantage. Volari is closer to ATI, though the absolute scores are incomparable - Volari drags far behind ATI.

Point Sprites

Point Sprites vs. size:

ATI and NVIDIA excellently cope with their tasks while in Volari this function has the minimal level of realization because of the inefficient frame buffer.

3D graphics, 3DMark03 v.3.40 - synthetic tests

All measurements in 3D tests were carried out at 32bit depth.

Fillrate

Multitexturing:

The data are close to the D3D RightMark, though this time Volari's scores look better in case of one texture (but they are still several times lower than its theoretical limit). ATI and NVIDIA demonstrate lower scores. They have different approaches to realization of these synthetic tests, but we are inclined to trust to our own fillrate measurement technique, which sources are available at D3D RightMark site.

Pixel shader

The scores well match the above ones.

Vertex shaders

The scores match those obtained before.

Summary on the synthetic tests

The conclusion looks pretty sad:

Poorly balanced chip subsystems;
Awfully low frame buffer and fill buffer performance;
Low caching and texture sampling performance;
Low performance of vertex and pixel shaders, both 1.1 and 2.0.

These facts do not let us hope for an acceptable speed in real applications. Moreover, we can see that Volari's achievements are not comparable even to middle-end cards which are 2-3 times as cheap.

The chip architecture is poorly designed, the developers do not have enough experience and expertise.

3D graphics, 3DMark03 - game tests

3DMark03, Game1

Wings of Fury:

DirectX 7.0; approx. 32000 polygons on the scene, 16 MB memory used for textures, 6 MB for buffers for vertices and 1 MB for indices.
All geometrical operations are based on Vertex Shaders 1.1 which can be emulated via CPU (if there is no hardware support).
All planes have 4 texture layers, that is why the accelerators will benefit if they are able to process 4 textures in a pass.
Fire and tail effects are made with the point sprite techniques and others.

3DMark03, Game2

Battle of Proxycon:

DirectX 8.1; Approx. 250 000 polygons on the scene with Pixel Shaders 1.1 (and 150 000 polygons on the scene with Shaders 1.4), 80 MB memory used for textures, 6 MB for buffers for vertices and 1 MB for indices.
All geometrical operations are based on Vertex Shaders 1.1 which can be emulated via CPU (if there is no hardware support).
All heroes are "dressed" with the vertex shaders as well.
Some light sources made dynamic shadows with a stencil buffer.
All pixel operations are carried out with the shaders 1.1, and if possible with shaders 1.4.
Calculation of per-pixel lighting for haze effects and other components.
Accelerators supporting pixel shaders 1.1 use one pass for determining Z buffer, then 3 passes for each light source. If an accelerator supports shaders 1.4, it needs one pass for each light source.

3DMark03, Game3

Trolls' Lair:

DirectX 8.1; approx. 560 000 polygons on the scene with Pixel Shaders 1.1 (and 280 000 polygons on the scene with Shaders 1.4), 64 MB memory used for textures, 19MB for buffers for vertices and 2 MB for indices.
All geometrical operations are based on Vertex Shaders 1.1 which can be emulated via CPU (if there is no hardware support).
All heroes are "dressed" with the vertex shaders as well.
Some light sources made dynamic shadows with a stencil buffer.
All pixel operations are carried out with the shaders 1.1, and if possible with shaders 1.4.
Calculation of per-pixel lighting for haze effects and other components.
Realism of the heroine's hair is achieved with physical models and anisotropic lighting.

3DMark03, Game4

Mother Nature:

DirectX 9.0; approx. 780 000 polygons on the scene, 50 MB memory used for textures, 54MB for buffers for vertices and 9 MB for indices.
Every leaf is separately animated with Vertex Shaders 2.0. The grass is animated with vertex shaders 1.1.
The lake's surface is formed with pixel shaders 2.0.
The sky is made with pixel shaders 2.0, for sun glints the test uses the extra precision of calculations realized in the DX9.
The earth surface is made with shaders 1.4.

Even the cheats do not help XGI's card. Below you will see quality demonstrated by this card.

Look at the shader speed. Is it enough for a dual-processor accelerator priced at $400?

[ Part 2 ]

Andrey Vorobiev (anvakams@ixbt.com)

Alexander Medvedev (unclesam@ixbt.com)

Write a comment below. No registration needed!