3Dlabs P10 VPU

General information on the P10 and its positioning
Line of products
P10's specification
P10's architecture
Peculiarities of the 3Dlabs Wildcat VP870 128MB video card
Test system configurations and driver settings
Test results: briefly on 2D and extreme tests from DirectX 8.1 SDK
Test results : 3DMark2001 SE synthetic tests
Test results : 3DMark2001 SE game tests
3D quality in games
Test results : Professional tests: SPECviewperf 7.0
Test results : Professional tests: Discreet 3DS MAX 4.26
Conclusion

The Đ10 can be considered a starting point of the strengthening family of flexibly programmable GPUs. This chip takes a special position. It is nothing else but a hardware incarnation of a project of the API OpenGL 2.0 standard, as 3Dlabs sees this API. We wrote about the OpenGL 2.0 product in detail and touched upon peculiarities of the DirectX 9 (in analytical materials on Matrox Parhelia-512 and ATI RADEON 9700). As you might notice, although ideas concerning flexible programming of operation of a graphics accelerator are common, it's easy to find difference in realization of such ideas, even on the API's level. Now, when we have a 3Dlabs P10 based card in our lab, we can trace differences on the hardware level.

Positioning

You shouldn't compare directly this card (as far as specs, frequencies or speeds in game tests or applications are concerned) with the latest gaming solutions because the chip was developed as a professional accelerator of OpenGL applications. It is known that most game applications still need a higher fillrate and texturing speed while professional ones need higher transform and lighting speeds. Besides, game applications (excluding future games related with shaders, for example, Next Doom) do not require extraordinary additional capabilities. Moreover, edge anti-aliasing or exotic methods of quality multisampling and AA are not even spoken about. As a rule, average users do not enable maximum settings of anisotropic filtering.

This chip is positioned by the developers as a professional accelerator. It is obvious that Creative that swallowed up 3Dlabs pays more attention to the mass, i.e. game, market (just remember what happened to the developer of professional sound solutions EMU also absorbed by Creative). Undoubtedly the Đ10 will also appear on the scene a bit later with some hardware modifications (from an architectural standpoint this GPU is easily scalable in any direction including a fillrate). It's also possible that Creative will just adjust the drivers for games in the beginning. In our opinion such scenario is possible only provided that the prices of game solutions on the P10 are not high - due to some reasons (we will turn to them later) it will be quite complicated to compete against top models of NVIDIA and ATI, irrespective of an optimization degree of the drivers.

And now let's see how the P10 based cards are positioned inside the family of the 3Dlabs professional accelerators:

Line of products

Wildcat VP970: 128 MB of 256-bit DDR SDRAM; 225M Vertices/Sec; 42G AA Samples/Sec; meant for tough rendering processes in CAD/DCC applications of all kinds.
Wildcat VP870: 128 MB of 256-bit DDR SDRAM; 188M Vertices/Sec; 35G AA Samples/Sec; also meant for CAD/DCC applications (easier).
Wildcat VP760: 64 MB of 256-bit DDR SDRAM; 165M Vertices/Sec; 23G AA Samples/Sec; also meant for CAD applications and provides a good price/power ratio.

The whole line is put into the center of a big pyramid which shows how various sectors of the professional market are captured by the 3Dlabs products.

Although this is a revolutionary product, it's not a High-End (as compared with the latest solutions from NVIDIA or ATI). The top position is taken by the Wildcat III cards - real lions of the professional 3D graphics market. At the same time, even the average card from the middle sector - Wildcat VP870 is able to fight successfully against NVIDIA's High-End solution Quadro4 900XGL, though it is positioned by 3Dlabs as a competitor of the Quadro4 750XGL.

Now let's take a gander at the specs of the P10.

Specification

Here are traditional performance characteristics of the accelerator and the card based on it - VP870:

Technology: 0.15-micron;
Transistors: over 76 million;
Core clock frequency: unknown (presumably 200-250 MHz);
Memory bus: 256bit DDR;
Local memory: up to 256 MB;
Local memory on the tested card: 128 MB;
Memory clock speed: unknown (presumably 250-300 DDR MHz), 17-20 GB/s;
Interface bus: AGP 4x, 1 GB/s;
Full support of all capabilities of the project of the OpenGL 2.0 standard from 3Dlabs;
Drivers optimized for professional applications;
16 scalar floating-point (F32) flexibly configurable vertex processors (a more flexible analog of 4 vector 4D processors of R300 or NV30);
64 floating-point (F32) processors for generation of texture coordinates;
64 non-programmable units for sampling and filtering texture values
Trilinear and anisotropic filtering supported;
64 integer (fixed point) processors for pixel shaders;
Possible to program arbitrarily (!) last stages of a pipeline, which controls reading and recording of values into a frame buffer, anti-aliasing and multisampling;
The frame buffer can house (without taking multisampling into account) not more than 4 completely calculated pixels at a clock.
Partial (!) support of the DX9 features (pixel pipelines work only with integer values at the shader stage):

Pixel Shader 1.4;
Vertex Shader 2.0 (?);
Multisampling up to 8x inclusive;
Hardware tessellation of N-Patches with Displacement Mapping and, optional, adaptive detailing level;

Multithread command processing - simultaneous rendering of images for several applications and windows with hardware management of command streams;
Memory optimization technology based on the block triangle shading (8x8 blocks);
HSR - early removal of hidden surfaces of 8x8 and Early Z Test on the pixel level;
Two independent CRTC;
Two integrated 10bit 400 MHz RAMDACs with hardware gamma correction;
One (two?) integrated DVI (TDMS transmitter) interface.
Integrated general-purpose digital interface video port.

Accelerator	R200(128 MB)	NV25(Ti 4600)	R300	NV30 (1)	Parhelia 512	P10(VP870)
Technology; transistors, M	0.15; 62	0.15; 68	0.15; 107	0.13; 120	0.15; 96	0.15; 72
AGP	4x	4x	8x	8x	4x	4x
Memory bus, bits	128 DDR	128 DDR	256 DDR (II) (2)	256 DDR II	256 DDR	256 DDR
Memory frequency, MHz	275	325	>300	>400	275	250...300 (?)
Core frequency, MHz	275	300	300	400	220	200...250 (?)
Pixel pipelines	4	4	8	8	4	64 (9)
Texture modules	4x2	4x2	8x1 (3)	8x2	4x4	64 (10)
Textures/pass	6	4	16 (4)	16 (4)	4	8 (5)
Vertex pipelines	2	2	4	4	4	16 (7)
Fixed T&L unit	Yes	No	No	No	No	No
N-Patches	DX8	No	DM (DX9)	DM (DX9)	DM (DX9)	DM (DX9)
Vertex shaders	1.1	1.1	2.0	2.0 (6)	2.0 (?)	2.0 (?)
Pixel shaders	1.4	1.3	2.0	2.0 (6)	1.3	1.2 (?)
Memory controller	2x64	4x32	4x64	4x64	1x256	?
RAMDAC, MHz	400	400	2*400	2*400 (?)	2*400	2*400
Optimization technologies	Yes (HyperZ II)	Yes (LightSpeed II)	Yes (HyperZ III)	Yes (LightSpeed 3 ?)	Only early Z test	8x8 units (8)

Note:

(1) Compilation is based on the official data and rumors
(2) Most likely, DDR II will be supported together with the DDR.
(3) Each texture unit can fulfill trilinear sampling itself.
(4) According to the DX9 requirements, up to 16 different textures with 8 precalculated (interpolated over a triangle) 4D texture coordinates can be used in a pass. In a pixel shader it's possible to sample up to 32 values from these textures.
(5) Up to 8 textures with precalculated or interpolated full texture coordinates can be used. In a pixel shader it's possible to sample up to 16 values from these textures.
(6) To all appearances, the hardware part will have capabilities exceeding the DX requirements for vertex and pixel shaders 2.0.
(7) 16 scalar floating-point processors are combined into groups of 2,3 or 4 for processing vector values. I.e. 4 complete 4D vector instructions can be executed per clock, like on the R300 or NV30; scalar and 2D/3D vector instructions can be more depending on a combination of instructions waiting to be executed.
(8) Shading of triangles in 8x8 blocks for optimization of caching and preliminary HSR on the block and pixel levels.
(9) The number of parallel 32bit integer processors for pixel shaders. The processors can be reconfigured flexibly to support a certain calculation format, for example, R10G10B10A2 or R16G16B16 integer formats. In the latter case the number of pixels processed in parallel will reduce twice (two processors will be used for one pixel). Besides, there is a severe limitation - the chip can record into the frame buffer (i.e. physically shade) up to 4 pixels at a clock. Because of a considerable number of processors and limitations of the .15 technology 3Dlabs couldn't provide support for the floating-point format, but they hope to do it in their future chips.
(10) The real number of normal texture units is likely to be lower - 16 or even 8.

P10's architecture

The P10 has a very specific architecture - fixed blocks alternate with programmable ones a lot of times, and programmable blocks are usually made in the form of wide arrays of simple processors which are flexibly configurable into groups for processing certain tasks.

Here is a block-diagram of the P10:

There is a VGA compatible graphics core, two CRTC and a special digital interface to import (capture) video data.

An interesting feature useful for professional applications is the P10's capability to execute simultaneously competitive command streams from different applications. This is what is controlled by the command processor:

It controls a 3D pipeline, thus, making a graphics analog of the multitask mode (based on context switching) widely spread in modern CPUs, as well as executes priority commands such as switching of video pages etc.

The P10 supports virtual texturing controlling block caching of large textures in the accelerator's memory:

It is also similar to modern CPUs equipped with MMU supporting virtual memory on the page basis. Here, rectangular texture units cached in the accelerator's memory act as pages. The chip swaps textures through the AGP DIME or PCI DMA automatically, when necessary. This is not only effective caching of large amounts of textures but an important (for professional graphics) possibility to work with separate textures the size of which can exceed the accelerator's memory (!).

Now let's look at the entire diagram of the 3D graphics pipeline of the P10 which deals with 3D imaging:

As it is typical of 3D graphics, data streams successively pass through functional units of the accelerator.

First of all commands get into the vertex processor:

which is able not only to read but also record into the memory processed vertex parameters, contrary to many other programmable chips. Such flexibility allows programming almost any algorithms of tessellation of spline surfaces or other forms of HOS and SS (Subdivision Surfaces) including N-Patches. The vertex processor can read not only parameters of vertices and their attributes but also texture values, and this allows for various algorithms of perturbation and generation of geometry based on texture maps of heights, normals and other values from textures. One of the examples is DM (Displacement Mapping). In this respect compatibility of the P10 with the DX9 doesn't give concern. Contrary to NVIDIA, MATROX and ATI, it has 16 scalar processors instead of 4 vector ones. The VP Manager conducts these processors according to vertex shader instructions combining them, if necessary, for vector processing of 2D, 3D or 4D values. As compared with common 4 vector processors, such solution provides for equal or higher performance, in case of the same frequency and similar performance of ALUs fulfilling atomic operations.

Vertex coordinates and their attributes obtained from the vertex processors array are sent for shading. First, the dedicated hardware units get rid of back sides and triangles which are not visible (see also the general diagram, Cull and Clip blocks):

After culling triangles get to Setup and are divided into 8x8 tiles for further rasterization. This is again controlled by dedicated hardware units - such approach of rational alternation of programmable and fixed units runs through the whole P10.

Each tile is shaded according to the following scheme:

First, visibility of a whole region and its pixels is calculated, then required texture values are sampled, then final values of pixels are calculated. Let's take a closer look at this process.

The following picture shows how it is defined whether certain blocks and pixels need to be shaded (see also the general scheme):

The yellow circle marks a queue of tiles (fragments) of 8x8 sent for rasterization. Dedicated hardware units define visibility, first, of each tile and then of its pixels by reading values of depth and a stencil buffer and making necessary comparisons. After that separate visible pixels get into the queue to be shaded.

64 pixels are shaded in parallel, which is a whole 8x8 block (!). Each of 64 pixels follows this way through fixed and programmable hardware units:

A floating-point processor for coordinates generation and an integer pixel one (Shader) are programmable. Load and Filter units which select and filter textures are realized on a hardware level and capable of bilinear and trilinear filtering. If necessary, their results can be sent back to the coordinate generation processor, probably, for anisotropic filtering or for more complicated methods of texture sampling. With the programmable processor of texture coordinates generation we can operate with 3D textures and cube and sphere environment maps as well. Unfortunately, the pixel processor is integer-valued which prevents it from being compatible with pixel shaders 2.0 from DX9.

When the final value, as a result of the pixel shader, is calculated, it is sent to one of the dedicated programmable pipelines:

They support different methods of infilling a frame buffer, including various AA and multisampling methods and recording rendering results into several buffers simultaneously.

Finally, before we turn to the performance tests let me show you a list of DirectX 8 capabilities supported in the current drivers:

Texture size - up to 2048x2048, nonsquare textures are possible
Anisotropy degree - up to 8
Light sources - up to 16
Textures in a pass - up to 8
Clipping surfaces - 6
Sprite scaling - up to 127
Primitives per one call - up to 1073741823 (a lot)
Vertex buffer - 65536
Vertex streams - up to 8
Vertex Shader 1.1
Vertex Shader constants - 128
Pixel Shader 1.2
Pixel shader value - up to 8
Multisampling mode: No, 2, 4, 8 samples
Final buffer formats:

D3DFMT_A8R8G8B8
D3DFMT_X8R8G8B8
D3DFMT_R5G6B5
D3DFMT_X1R5G5B5
D3DFMT_A1R5G5B5

Depth Buffer formats:

D3DFMT_D32
D3DFMT_D24S8
D3DFMT_D16
D3DFMT_D24X8

Texture formats:

D3DFMT_A8R8G8B8
D3DFMT_X8R8G8B8
D3DFMT_R5G6B5
D3DFMT_X1R5G5B5
D3DFMT_A1R5G5B5
D3DFMT_A4R4G4B4
D3DFMT_A8
D3DFMT_L8
D3DFMT_A8L8
D3DFMT_A4L4
D3DFMT_V8U8
D3DFMT_L6V5U5
D3DFMT_X8L8V8U8
D3DFMT_Q8W8V8U8
D3DFMT_DXT1
D3DFMT_DXT2
D3DFMT_DXT3
D3DFMT_DXT4
D3DFMT_DXT5

Cube texture formats:

D3DFMT_A8R8G8B8
D3DFMT_X8R8G8B8
D3DFMT_R5G6B5
D3DFMT_X1R5G5B5
D3DFMT_A1R5G5B5
D3DFMT_A4R4G4B4
D3DFMT_DXT1
D3DFMT_DXT2
D3DFMT_DXT3
D3DFMT_DXT4
D3DFMT_DXT5

3D texture formats:

D3DFMT_A8R8G8B8
D3DFMT_X8R8G8B8
D3DFMT_R5G6B5
D3DFMT_X1R5G5B5
D3DFMT_A1R5G5B5
D3DFMT_A4R4G4B4
D3DFMT_A8
D3DFMT_L8
D3DFMT_A8L8
D3DFMT_A4L4
D3DFMT_DXT1
D3DFMT_DXT2
D3DFMT_DXT3
D3DFMT_DXT4
D3DFMT_DXT5

Simple texture filtering modes:

D3DPTFILTERCAPS_MINFPOINT
D3DPTFILTERCAPS_MINFLINEAR
D3DPTFILTERCAPS_MINFANISOTROPIC
D3DPTFILTERCAPS_MIPFPOINT
D3DPTFILTERCAPS_MIPFLINEAR
D3DPTFILTERCAPS_MAGFPOINT
D3DPTFILTERCAPS_MAGFLINEAR
D3DPTFILTERCAPS_MAGFANISOTROPIC

Cube texture filtering modes:

D3DPTFILTERCAPS_MINFPOINT
D3DPTFILTERCAPS_MINFLINEAR
D3DPTFILTERCAPS_MIPFPOINT
D3DPTFILTERCAPS_MIPFLINEAR
D3DPTFILTERCAPS_MAGFPOINT
D3DPTFILTERCAPS_MAGFLINEAR

3D texture filtering modes:

D3DPTFILTERCAPS_MINFPOINT
D3DPTFILTERCAPS_MAGFPOINT

Well, everything looks decent; up to 16 light sources and 8 textures in a pass is attractive. A full range of operations with a stencil buffer and a full range of depth calculation modes are certainly supported. Surprosingly, we have no filtering (even bilinear one) for 3D textures, but this is probably a question of the drivers. However that may be, all basic capabilities are supported within the DX8, including those which exceed the NV25. It is clear that the DX9 won't please us much, at least, because of pixel shaders, but the situation will get clear only after release of the DX9 drivers for the P10.

At last, here is a list of OpenGL extensions supported at the moment:


Matrox, ICD for Parhelia version 1.2	NVIDIA, GeForce4 Ti 4400/AGP/SSE2, version 1.3.1	3Dlabs, Wildcat VP870, version: 1.2.0
GL_ARB_multitexture	GL_ARB_imaging	GL_ARB_multitexture
GL_ARB_point_parameters	GL_ARB_multisample	GL_ARB_texture_env_add
GL_ARB_texture_compression	GL_ARB_multitexture	GL_ARB_texture_env_combine
GL_ARB_texture_cube_map	GL_ARB_texture_border_clamp	GL_ARB_texture_env_crossbar
GL_ARB_texture_env_add	GL_ARB_texture_compression	GL_ARB_texture_border_clamp
GL_ARB_texture_env_combine	GL_ARB_texture_cube_map	GL_ARB_texture_cube_map
GL_ARB_texture_env_dot3	GL_ARB_texture_env_add	GL_ARB_texture_env_dot3
GL_ARB_transpose_matrix	GL_ARB_texture_env_combine	GL_EXT_bgra
GL_S3_s3tc	GL_ARB_texture_env_dot3	GL_EXT_blend_subtract
GL_ATI_element_array	GL_ARB_transpose_matrix	GL_EXT_blend_minmax
GL_ATI_vertex_array_object	GL_S3_s3tc	GL_EXT_compiled_vertex_array
GL_EXT_bgra	GL_EXT_abgr	GL_EXT_polygon_offset
GL_EXT_blend_color	GL_EXT_bgra	GL_EXT_rescale_normal
GL_EXT_blend_func_separate	GL_EXT_blend_color	GL_EXT_separate_specular_color
GL_EXT_blend_logic_op	GL_EXT_blend_minmax	GL_EXT_secondary_color
GL_EXT_blend_minmax	GL_EXT_blend_subtract	GL_EXT_texture3D
GL_EXT_blend_subtract	GL_EXT_compiled_vertex_array	GL_EXT_texture_object
GL_EXT_secondary_color	GL_EXT_separate_specular_color	GL_EXT_texture_edge_clamp
GL_EXT_compiled_vertex_array	GL_EXT_fog_coord	GL_EXT_texture_env_add
GL_EXT_draw_range_elements	GL_EXT_multi_draw_arrays	GL_EXT_texture_env_combine
GL_EXT_element_array	GL_EXT_packed_pixels	GL_EXT_texture_env_dot3
GL_EXT_fog_coord	GL_EXT_paletted_texture	GL_EXT_texture_cube_map
GL_EXT_multi_draw_arrays	GL_EXT_point_parameters	GL_EXT_texture_filter_anisotropic
GL_EXT_packed_pixels	GL_EXT_rescale_normal	GL_EXT_multi_draw_arrays
GL_EXT_point_parameters	GL_EXT_clip_volume_hint	GL_SGIS_multitexture
GL_EXT_rescale_normal	GL_EXT_draw_range_elements	GL_SGIS_texture_border_clamp
GL_EXT_secondary_color	GL_EXT_shared_texture_palette	GL_SGIS_texture_lod
GL_EXT_separate_specular_color	GL_EXT_stencil_wrap	GL_NV_register_combiners
GL_EXT_stencil_wrap	GL_EXT_texture3D	GL_NV_vertex_program
GL_EXT_subtexture	GL_EXT_texture_compression_s3tc	GL_NV_texgen_reflection
GL_EXT_texture3D	GL_EXT_texture_edge_clamp	GL_WIN_swap_hint
GL_EXT_texture_compression_s3tc	GL_EXT_texture_env_add	GL_KTX_buffer_region
GL_EXT_texture_cube_map	GL_EXT_texture_env_combine	-
GL_EXT_texture_edge_clamp	GL_EXT_texture_env_dot3	-
GL_EXT_texture_env_add	GL_EXT_texture_cube_map	-
GL_EXT_texture_filter_anisotropic	GL_EXT_texture_filter_anisotropic	-
GL_EXT_texture_lod_bias	GL_EXT_texture_lod	-
GL_EXT_vertex_array	GL_EXT_texture_lod_bias	-
GL_EXT_vertex_array_object	GL_EXT_texture_object	-
GL_EXT_vertex_shader	GL_EXT_vertex_array	-
GL_EXT_texture_env_combine	GL_EXT_vertex_weighting	-
GL_EXT_texture_env_dot3	GL_HP_occlusion_test	-
GL_KTX_buffer_region	GL_IBM_texture_mirrored_repeat	-
GL_MTX_fragment_shader	GL_KTX_buffer_region	-
GL_NV_texgen_reflection	GL_NV_blend_square	-
GL_SGIS_multitexture	GL_NV_copy_depth_to_color	-
GL_SGIS_texture_lod	GL_NV_evaluators	-
WGL_EXT_swap_control	GL_NV_fence	-
-	GL_NV_fog_distance	-
-	GL_NV_light_max_exponent	-
-	GL_NV_multisample_filter_hint	-
-	GL_NV_occlusion_query	-
-	GL_NV_packed_depth_stencil	-
-	GL_NV_point_sprite	-
-	GL_NV_register_combiners	-
-	GL_NV_register_combiners2	-
-	GL_NV_texgen_reflection	-
-	GL_NV_texture_compression_vtc	-
-	GL_NV_texture_env_combine4	-
-	GL_NV_texture_rectangle	-
-	GL_NV_texture_shader	-
-	GL_NV_texture_shader2	-
-	GL_NV_texture_shader3	-
-	GL_NV_vertex_array_range	-
-	GL_NV_vertex_array_range2	-
-	GL_NV_vertex_program	-
-	GL_NV_vertex_program1_1	-
-	GL_SGIS_generate_mipmap	-
-	GL_SGIS_multitexture	-
-	GL_SGIS_texture_lod	-
-	GL_SGIX_depth_texture	-
-	GL_SGIX_shadow	-
-	GL_WIN_swap_hint	-
-	WGL_EXT_swap_control	-

Now let's turn to the card.

Card

This is not a preproduction sample, but a production card.

It is equipped with an AGP x2/x4 interface, 128 MB local DDR SDRAM memory located in 8 chips on both PCB sides.

Samsung memory chips of the BGA form-factor and 3.3ns access time, which corresponds to 300 (600) MHz. The memory works presumably at 250-300 MHz

3Dlabs Wildcat VP870

With cooler

The design is very unusual. Sure, the 256bit high-speed bus made the design so complicated. First of all, there is a screen protecting from pickups:

Contrary to the Matrox Parhelia 128MB, the PCB of the Wildcat VP870 is almost empty - there is no a great amount of additional and buffer elements. The developers decided to show that the PCB consists of 8 layers and made a window with the layers enumerated:

There is a DVI-out, that is why you must have a DVI-to-d-Sub adapter to connect two CRT monitors. A TV-out (S-Video) is also provided. On the whole, the PCB is quite expensive, but it is less dearer than the Matrox's one. I think it's possible to make a relatively cheap card on such PCB.

The memory modules are located around the chip but at different distances. Besides, the distance between the processor and the chips is much shorter, that is why the card looks quite empty - some of the chips are hidden under the cooler. Now look at the VPU:

Although the VPU is equipped with a 256-bit memory interface, it has a usual package, though a bit greater. In spite of a very complicated architecture, the processor doesn't heat up much because the number of transistors and technology are comparable to the NV25 chips, and the clock speeds are not very high.

But anyway, such a powerful chip needs an efficient cooler. Take a look at its shape and dimensions.

This is a closed heatsink with a fan shifted off from the chip's center. Such cooler is installed on GeForce4 Ti cards, in particular, such coolers are typical of MSI and Triplex (they differ only in the covers' shapes).

Test system and drivers

Testbed:

Pentium 4 based computer (Socket 478):

Intel Pentium 4 2200 (L2=512K);
ASUS P4T-E (i850) mainboard;
512 MB RDRAM PC800;
Quantum FB AS 20GB HDD;
Windows XP.

Pentium III 1000 MHz based computer:

Intel Pentium III 1000EB;
Chaintech 6OJV2 (i815E);
256 MB SDRAM PC133;
IBM DPTA 20GB;
Windows XP.

The test systems were coupled with ViewSonic P810 (21") and ViewSonic P817 (21") monitors.

In the tests we used 3Dlabs drivers 4.23. VSync was off.

For comparison we used the following cards:

ASUS V8460Ultra (GeForce4 Ti 4600, 300/325 (650) MHz, 128 MB, driver 29.42);
Matrox Parhelia 128MB (220/275 (550) MHz, 128 MB, driver 2.31);
Gigabyte MAYA AP128DG-H RADEON 8500 Deluxe (275/275 (550) MHz, 128 MB, driver 6.118);
Hercules 3D Prophet 9000 Pro (RADEON 9000 Pro, 275/275 (550) MHz, 128 MB, driver 6.118);
ATI RADEON 7500 (290/230 (460) MHz, 64 MB, driver 6.118);
Joytech Apollo Blade Monster Xabre 400 (250/250 (500) MHz, 128 MB, driver 3.03);
Leadtek Winfast A170V (GeForce4 MX 440, 270/200 (400) MHz, 64 MB, driver 29.42);
NVIDIA Quadro4 750XGL (275/275 (550) MHz, 128 MB, driver 28.32(ViewPerf7),29.42(3DS MAX));
NVIDIA Quadro4 900XGL (300/325 (650) MHz, 128 MB, driver 28.32(ViewPerf7),29.42(3DS MAX));
ATI FireGL 8800 (RADEON 8800, 250/300 (600) MHz, 128 MB, driver 3.036).

Driver settings

Here is the main menu of the settings (a brief data panel).

This tab shows both driver versions and supported extensions (OpenGL) and capabilities of the DirectX driver (CAPS).

The OpenGL settings are the richest, because this suite of drivers is optimized for professional packets working in the OpenGL (by the way, below you can choose optimization of the driver for a certain packet - the choice is vast). The Direct3D settings are scarce: you can only adjust VSync and enable optimization for games (though it doesn't give any effect).

This tab is the most mysterious. The tests will show that the slider's position is very important, especially in games. Shift it to the right (acceleration of the geometry unit), and the performance in the games, being quite low, will fall down more by 15-20%, and when shifted to the left (acceleration of pixel pipelines) the performance drops a little in professional applications and grows much in games.

Test results

2D graphics

Together with the ViewSonic P817 monitor and BNC Bargo cable the card showed excellent quality at the following resolutions and frequencies:


3Dlabs Wildcat VP870	1600x1200x85Hz, 1280x1024x100Hz, 1024x768x120Hz

Such cards are produced only by 3Dlabs, that is why it makes no sense to repeat that 2D estimation depends on a certain sample. But although in this case quality may not depend on a certain sample, the tandem of the card and monitor, and mainly, quality of a monitor and a cable, have a strong effect.

3D graphics, MS DirectX 8.1 SDK - extreme tests

For testing different extreme characteristics of the chips we used modified (for better convenience and control) examples from the latest version of the DirectX SDK (8.1, release). Let's carry out the tests that are well known to our readers:

Optimized Mesh

This test defines a real maximum throughput of an accelerator as far as triangles are concerned. For this purpose it uses several simultaneously displayed models each consisting of 50,000 triangles. No texturing. The dimensions are minimal - each triangle takes just one pixel. It must be noted that the results of this test are unachievable for real applications where triangles are much greater, and textures and lighting are used. The results are given only for 3 rendering methods - model optimized for the optimal output speed (with the size of the internal vertex cache on the chip accounted for) - Optimized, Unoptimized original model, and Strip - unoptimized model displayed in the form of one Triangle Strip. Besides, values in the mode of software emulation of the vertex pipeline are given to estimate efficiency of geometry transfer from the processor to the GPU:

In case of the optimized model, when the memory subsystem has a minimal effect, we measure almost pure performance of transform and setup of triangles. The Ti 4600 is a leader. 65M triangles/sec is almost twice more than the result of the RADEON 8500 and Parhelia. The P10 takes the second position being a little behind the NV25 because of a lower core clock speed. But we are also expecting the NV30 to come soon, whose geometrical performance will be twice greater as compared with the NV25. In case of the forced activation of software geometry calculation the P10 has a considerable gain, especially with the unoptimized model. Earlier the NVIDIA's solutions were unconquerable, and now the NV30 is smashed to pieces! The main deterrent in geometry transmission is an AGP bus and an algorithm of the accelerator/processor interaction. The P10 uses some new methods of optimized transmission, probably, lower precision of representation of vertex coordinates and attributes or different geometry compression techniques. In case of the Strip model the gap is still twice greater - it seems that the data transfer band is twice narrower for each vertex.

Vertex shader unit performance

This test allows determining the maximum performance of the vertex shader unit. It uses a complex shader which deals with both type-transformation and geometrical functions. The test is carried out in the minimal resolution in order to minimize the shading effect:

The Ti 4600 is far ahead again. The P10 shares the second position with the RADEON 8500.

Vertex matrix blending

This T&L's feature is used for verisimilar animation and model skinning. We tested blending using two matrices both in the "hardware" version and with a vertex shader that implements the same function. Besides, we obtained results in the software T&L emulation mode:

To some reasons the P10 is unable to fulfill the hardware blending based on the vertex shader, probably because of the drivers, because all hardware capabilities for vertex shader implementation are provided. In case of the completely hardware blending (also executed as a shader by the GPU lacking for a fixed T&L) the P10 remains on the second position.

EMBM

In this test we measure performance drop caused by Environment mapping and EMBM (Environment Bump). We set 1280x1024 because exactly in this resolution the difference between cards and different texturing modes is the most discernible:

Well, the test results show that the P10 has a vertex-oriented performance balance. It loses all the time in the shading tests mostly because of lacking memory optimization technologies. It falls behind even with a 256-bit bus! The P10 suffers most of all from the EMBM. The Matrox's chip implements shading in usual modes a bit faster than the P10, but with the EMBM it goes far ahead.

Pixel Shader performance

We used again a modified example of the MFCPixelShader having measured performance of the cards in high resolution in implementation of 5 shaders different in complexity, for bilinear-filtered textures:

The P10 is ahead with a simple shader due to 64 processors and a wider bus. But you should account for the flexibility cost - we have just processors which execute commands successively instead of an array coming with a new result each stage clock. As the shader's complexity grows up performance of the other chips falls down in steps, as the pipelines are joining, and doesn't depend on the shader's complexity but on the number of stages in it. The speed of the P10 falls down with each new command - i.e. faster than in case of its competitors equipped with stages on a pixel pipeline. It also depends on complexity of instructions in a shader. So, being a leader on the simplest shader, on the most complicated one the P10 outscores only the R200 which is known to be quite slow in shader operations.

So, let's draw the first intermediate conclusion. In the DX 8.1 SDK tests the VP870 card looks confident (as compared with the DX8 generation of game accelerators) in geometry processing, but it looks much weaker in shading. It's well seen that it is designed exactly for professional use. The card will hardly become a strong competitor of the DX9 generation (R300 and NV30) even in geometry questions.

But we will return to these issues in autumn when it will be possible to test shaders 2.0 and other capabilities of the DirectX 9.0.

3D graphics, 3DMark2001 SE - synthetic tests

All measurements in all 3D tests were done in 32-bit color.

Fillrate

The theoretical limit of this test is 880M pixels/sec for the Parhelia, 1100M for the RADEON 8500 and 1200M for the Ti 4600. The Parhelia is the closest to the peak value, thanks to the 256-bit memory bus. The exact core frequency of the P10 is unknown, but taking into account that the chip is able to record up to 4 pixels per clock, I suppose it is very close to the Parhelia 512. As the resolution grows up the fill effectiveness of the Parhelia falls down a bit, which implies that the memory controller is far imperfect.

Remember that the peak values for this test are 3520 (1760) M texels/sec for the Parhelia (the second value (in the parentheses) is for 4 pipelines with two texture units on each), 2200 M for the RADEON 8500 and 2400 M for the Ti 4600. In case of multitexturing the chip's balance is the most important factor. Now the Ti 4600 is the closest to its peak value, the RADEON 8500 is the second. The P10 is far behind all the cards as it doesn't have several texture units per fill pipeline, that is why we should expect the results comparable to the previous test. I wonder whether support of just one texture unit is also going to have such a strong effect on the R300?

Scene with a large number of polygons

In this test you should pay more attention to the minimal resolution where the fillrate makes almost no effect:

In case of one light source the Ti 4600 is an absolute leader. It performs almost twice better than the Parhelia and almost reached the maximum throughput for triangles obtained with the Optimized Mesh from DX8.1 SDK. However, the RADEON 8500 is also near its peak value obtained in the SDK's test. The results of the P10 and especially Parhelia are far from ideal.

With 8 light sources the Parhelia performs better: as the number of sources grows up, its performance falls down slower than that of the RADEON 8500. But the Ti 4600 is still a leader. And the P10 is an outsider.

Is the problem in the drivers or in a low performance of a pool of the vertex processors?

Bump mapping

Look at the result of the synthetic EMBM scene:

The poor scores match with the SDK tests. And now the DP3:

All the same.

Vertex shaders

A clean defeat, in spite of so promising results in the test of the maximum throughput with triangles, in spite of 16 scalar vertex processors. The problem is either in the drivers, or the vertex hardware is too OpenGL-oriented and unable to compete against DX accelerators, or too inefficient.

Pixel shader

Taking into account that too low resolutions are limited by geometry and too high ones by the memory bandwidth let's take a look at 1024x768 and 1280x1024:

In case of unsophisticated pixel shaders the P10 manages to have a decent performance, outscoring the Parhelia. The scores obtained in the SDK tests of pixel shaders prove that. But now let's take a look at the Advanced Pixel Shader test.

The situation is different. The lengthier the shader, the worse the P10 performs. It is interesting what the R300 and NV30 will show here.

Sprites

The P10 yields even to the Parhelia. It's obvious that sprites are emulated with standard triangles. On the other hand, such speedy systems of "flat particles" are used mostly for games.

So, the second intermediate conclusion. In the synthetic tests the 3Dlabs P10 loses to its competitors. But that was expected as the accelerator is not meant for games, and the 3D Mark 2001 is a game benchmark, even from the standpoint of synthetic algorithms. Besides, the drivers for the DX8 are not crucial for the developers of the P10 and are still too weakly optimized to fight against ATI and NVIDIA solutions.

3D graphics, 3DMark2001 - game tests

3DMark2001, 3DMARKS

In general, in the game DirectX 8.1 tests the 3Dlabs Wildcat VP870 is between the ATI RADEON 7500 and the RADEON 9000 Pro. But the Game4 works as the card supports pixel shaders. Later we will see that it can be slower than the GeForce4 MX 440 (which has the worst general scores because of lacking Game4 results).

3DMark2001, Game1 Low details

Test characteristics:

Rendered triangles per frame (min/avg/max): 19773/33753/143422
Rendered textures per frame with 16 bit textures (min/avg/max): 7.5/8.8/16.5 MB
Rendered textures per frame with 32 bit textures (min/avg/max): 15.1/17.7/30.3 MB
Rendered textures per frame with texture compression (min/avg/max): 10.7/12.2/21.0 MB

The P10 falls into the last position, even behind the GeForce4 MX 440.

3DMark2001, Game2 Low details

Test characteristics:

Rendered triangles per frame (min/avg/max): 46159/51440/147828
Rendered textures per frame with 16 bit textures (min/avg/max): 8.0/8.8/10.1 MB
Rendered textures per frame with 32 bit textures (min/avg/max): 15.6/17.2/19.8 MB
Rendered textures per frame with texture compression (min/avg/max): 9.3/10.9/13.5 MB

The performance is similar to the GeForce4 MX 440.

3DMark2001, Game3 Low details

Test characteristics:

Rendered triangles per frame (min/avg/max): 16681/21746/39890
Rendered textures per frame with 16 bit textures (min/avg/max): 2.8/4.1/4.7 MB
Rendered textures per frame with 32 bit textures (min/avg/max): 5.7/8.2/9.4 MB
Rendered textures per frame with texture compression (min/avg/max): 5.0/7.2/8.4 MB

The speed of the P10 grows a little bit, but still, it is too low.

3DMark2001, Game4

Test characteristics:

Rendered triangles per frame (min/avg/max): 55601/81714/180938
Rendered textures per frame with 16 bit textures (min/avg/max): 14.9/17.4/20.7 MB
Rendered textures per frame with 32 bit textures (min/avg/max): 28.4/33.5/40.0 MB
Rendered textures per frame with texture compression (min/avg/max): 28.4/33.5/40.0 MB

In this test, with its realistic scene rich in effects, the P10 is finally not the last :-). This time the Matrox Parhelia is the laziest.

On the whole, in the 3DMark2001 the predicted test results are proven: the 3Dlabs Wildcat VP870 and its drivers are not adjusted for DirectX game applications at all. I don't even know who is to blame: either the drivers or the balance of capabilities of the P10. Let's wait for a game accelerator on the P10 promised by Creative Labs.

And now let me console professional designs who can play their favorite games on this card (though they will have to reduce a resolution for better comfort).

3D graphics, game tests

There are not many flaws. But the Morrowind game, which is the first that forms a water surface through pixel shaders, showed us the following pictures on the 3Dlabs Wildcat VP870 (the screenshots of the RADEON 8500 are given for comparison):

P10

RADEON 8500

The game was updated with the latest patch up to the version 1.02.0722.

There were some distortions in other games (though not vital). Quality will be closer examined in the next (August) 3Digest (the gallery of screenshots will get pictures from many games obtained on the P10).

Professional tests, SPECviewperf 7.0

So, we have studied operation of the VP870 in the games and DIRECTX 8. But the main conclusion on a performance of a professional card must be drawn from professional tests. We chose two tests: the new SPECviewperf 7.0 and 3DS MAX 4.26. The new version of the SPECviewperf is an excellent professional synthetic test, and the 3DS MAX is an excellent example of a DCC application. The first test will show us how balanced professional capabilities of the card are, the second one will demonstrate its real advantages in real operation. The detailed descriptions of the tests and test techniques can be found on our site.

So, we carried out the tests the following way: we installed the OS, then drivers of the video cards and then started the tests and made the measurements. The VP870 passed all the tests from the SPECviewperf suite without any problems - no hang-up, no quality losses. Well, this is what we expect from professional, carefully tested and certified drivers. Let's take a close look at the results:

The first test based on the 3D MAX engine makes the VP870 a leader, especially when the geometry optimization mode is set in the drivers, which again proves that the P10's balance is shifted to this side. The VP870 is faster than its closest competitor (900XGL) by almost 7%. But the latter is positioned for a higher niche and should be compared with the next 3Dlabs' card. Certainly, this is not a great advantage, especially considering excellent scalability of the Quadro4 cards. If a higher-frequency Quadro4 is released, the benefit of the P10 can disappear. But at present, given to the current prices, the P10 looks attractive. The 3D MAX is going to prove or disprove it.

This test mostly deals with shading, and the VP870 shows weak scores. It lags behind the leader by almost 35%, irrespective of an optimization mode. Taking into account that 3Dlabs positions this card as a competitor of the Quadro4 750XGL, the latter can be considered a loser in this synthetic test as it's far behind even the Quadro4 750XGL.

The new test shows an opposite picture. The VP870 outdoes the leading Quadro4 by 6%. The difference between the texture and geometrical optimization modes gets clearer - while in the shading-oriented tests the texture optimization doesn't help much (a fillrate depends not only on textures), in the geometrical ones the geometry optimization has a great effect. That is why in any real applications it makes sense to enable only geometry optimization.

This well-balanced test contains a lot of textures and complex geometry, that is why the difference between the modes is almost lacking; in general, the VP870 scores the top marks again. The gap between the VP870 and Quadro4 is 8%. This is a good token for most real balanced applications.

In this test the layout changes, and we will compare not VP870 vs Quadro4 but VP870 vs FireGL 8800, because the FireGL 8800 is faster than the Quadro4 900XGL, and, in its turn, the VP870 is more efficient than the FireGL 8800. The VP870 outpaces the FireGL 8800 by 7.6%, and outscores the Quadro4 900XGL by 14.4%. The test carries a miscellaneous load, and the difference between the geometrical and texture optimization modes is just 2%.

In this test the VP870 takes the palm again. In the geometry optimization mode the VP870 outmatches the Quadro4 900XGL by 37%, FireGL 8800 by 47% and Quadro4 750XGL by 56%. And in the texture optimization mode the difference is the following: VP870 vs Quadro4 900 XGL = 26%, VP870 vs FireGL 8800 = 35%, VP870 vs Quadro4 750 XGL = 43%. The optimization modes differ by 8% (in favor of the geometry optimization).

Now the summary on the SPECviewperf 7.0 tests. The 3Dlabs's P10 based card leads in the most tests. As a rule, it works more efficient in the geometry optimization mode. That is why the specviewperf focuses mostly on geometry processing in most tests (or rather, it's done by real professional applications and tasks, and the benchmark reflects their popular needs). Besides, it makes sense to enable the geometry optimization mode forever, it never brings harm but often allows for a gain.

With the current prices and frequencies the VP870 card is the best choice in its niche (on the SPECviewperf 7.0). Now look at the scores in real applications:

Professional tests, Discreet 3DS MAX 4.26

After examination in the specviewperf which is still a synthetic test (though it excellently emulates accelerator's loads typical of real applications), we are going to study the VP870 in the 3DS MAX. In the 3DS MAX we removed modeling of designer's work, but we are sure this aspect will be clear from the scores. As all test scenes are rendered and displayed correctly (due to the drivers' certification) it makes no sense to include all screenshots into the review. We are going to show only the anti-aliasing operation.

Let's start with the tests on standard demo scenes which come with the 3DS MAX. "Special driver" stands for a driver from the video card maker which is meant only for operation in the 3DS MAX. The ATI's drivers are called MAXIMUM, the NVIDIA's drivers are called MAXTREME, and the 3Dlabs' driver for the VP870 has no name because it has no specific management functions and uses only standard 3DS MAX settings. "OpenGL" stands for operation through a standard OpenGL driver. As you remember, in the SPECviewperf 3dmax-01 based on the 3DS MAX engine the VP870 is the first and, therefore, is the most obvious candidate for a rank of the best card for this 3D modeling system. So, let's dot the "i's" and cross the "t's".

The 4views scene shows us simultaneous rendering in 4 projection areas. And the VP870 performs on the level of its competitors from NVIDIA. In different modes it edges out the Quadro4 750XGL but falls a bit behind the Quadro4 900XGL, only with the special driver used. In case of the OpenGL, the card loses to the Quadro4 line. Well, the NVIDIA's solutions have a greater number of supported OpenGL extensions (see the table of OpenGL extensions).

The test of geometrical capabilities proves that the VP870 remains a leader in the sphere of complex geometry processing. It has a small advantage in case of the OpenGL and a big gain with the special driver. Well, 3Dlabs worked well on the special driver - it is not just a wrapper but it improves image displaying considerably.

The growing complexity of geometry demonstrates that the VP870 is a leader till a certain level only; after that the card is limited by the AGP bus's throughput or some other aspects of interaction with the processor.

The tests of operation with multiple light sources like SPOT show that this kind of lighting is not advantageous for the VP870, and although the absolute speed is high enough for comfort handling of these sources, the card is still in the last position. Probably new drivers will help it, because the card handles skillfully other light sources (they are obviously realized as special shaders).

The ATI's card copes best of all with a great number of DIRECT light sources in the OpenGL (as you remember from the DirectX synthetic tests this chip is equipped with a hardware unit of fixed T&L, which provides a certain advantage as the number of sources grows up). The VP870 is close on the heels of the leader. But with the special driver the situation changes - it allows the 3Dlabs's solution to regain the crown. It is interesting that the different optimization modes do not affect the results. The difference of 1-2 fps can be considered just as measuring inaccuracy.

The last kind of light sources used in the 3DS MAX is OMNI. In the OpenGL mode all the cards perform equally, and with the special driver the difference is inconsiderable as well.

Exactly in this mode optimization of the drivers for different operating modes of the VP870 card matters a lot - the chip's programmability becomes apparent. Under the OpenGL the results are expected, but under the special driver the scores are really striking. The advantage is over 40%! Note that the other cards have a bit higher scores in the OpenGL mode. Well, the successful realization of the special driver by 3Dlabs uses wide capabilities of flexible programming of the chip to the full. This test reveals the benefit of the flexibly programmable architecture despite a strong dependence of this test on the fillrate (which is not a trump of the P10).

Now the texturing tests. And this time the scores depend much on an optimization mode for the VP870! The special driver allows the 3Dlabs card to gain the Olympus. The OpenGL standard drivers are not so brilliant - they are used by the 3DS without accounting for the chip's special character.

This test uses a lot of textures and very complex geometry. The card shows nothing extraordinary though it copes with its duties decently.

This last texture mode contains a lot of textures and complex geometry as well. Nevertheless, proportionality is evident, and the VP870 card is an absolute leader in the texture optimization mode. The scores in the geometry optimization mode are worse, especially under the OpenGL, but in this test we estimate texturing which looks nice.

An important part of the testing is ability of the card to draw correctly and quickly a lot of straight lines - this is a wireframe mode. Either under the OpenGL or under the special driver the outcome doesn't depend on an optimization mode, and the scores in the OpenGL are quite low. Under the special driver which uses a bit different, more advantageous for the P10, algorithm the speed grows up several times. The speed gain allows the card to reach the top.

Now let's turn to Line Antialiasing.

As you can see, the speed doesn't fall down. And the situation doesn't differ much from the previous test. The line antialiasing is a pleasant and free add-on for the new 3Dlabs' solution.

Below are screenshots of the anti-aliasing in different operating modes.

First of all, look at the screenshot with the line antialiasing disabled.

Geometry optimization, OpenGL:

Geometry optimization, special driver:

Absolutely no visual differences. Now let's enable anti-aliasing:

Geometry optimization, OpenGL:

Geometry optimization, special driver:

Well, the screenshots show that the antialiasing looks better under the OpenGL than under the special driver (here it looks rougher). The chip fulfills anti-aliasing quite fast - the problem was successfully solved by the engineers, and you can always use it at no cost.

The card has an enormous potential which will probably be made use of with new drivers.

Summary on the 3DS MAX. The tests in the SPECviewperf pleased us. I hoped to get a new absolute leader. But the real applications show it's not simple. The card is of high quality and efficient, but its leadership is ambiguous. In different tests the results depend on multiple surrounding conditions and overwhelming advantage is not possible. The least thing we should admit is that the special driver is good and scores get really higher when it is used.

Conclusion

First of all, this card is a perfect solution for professional designers working in the sphere of 3D modeling. Having the same price as the NVIDIA Quadro4 750XGL ($580 in August) (according to www.pricewatch.com), the 3Dlabs Wildcat VP870 NVIDIA Quadro4 doesn't fall behind in many tests and even outshines the more expensive Quadro4 900XGL.
Originally the card was developed as a professional solution, and it performs excellently in modes and scenes where geometry processing speed prevails. It can also be referred to the modes with anti-aliasing, especially line AA.
The pixel and vertex shaders realized in the P10 are not widely used in professional graphics, but there are traces of the flexible programmability of the chip noticeable in the highly optimized driver for the 3DS MAX. However that may be, but the potential of this card is not uncovered entirely; we should wait for the OpenGL 2.0, DirectX 9 and different high-level languages for programming graphics accelerators.
We will continue studying the 3Dlabs Wildcat VP870 in different professional applications in Professional Cards Roundups (in particular, this autumn is going to bring a lot of new interesting materials).
In game applications the 3Dlabs Wildcat VP870 is not that powerful, but it can be expected from a professional card. The drivers optimized for professional packets (just look at the list of ICD OpenGL extensions), the weak (unoptimized) DirectX 8 driver and peculiarities of the P10 itself designed for non-game applications do not allow for a greater speed in games.
We consider that the gaming potential of this card is enabled by less than 50%, that is why we have no choice but to wait for a gaming card on the P10 or its special game-oriented modification with a higher shading performance. I hope such card will have drivers more suitable for games.

It should also be noted that a low DirectX performance of the card in comparison with the other contestants can rise significantly with the release of the DirectX 9.0, that is why we do not close this subject but just put it off.

Andrey Vorobiev (anvakams@ixbt.com)
Alexander Medvedev (unclesam@ixbt.com)
Alexander Kondakov (kondalex@ixbt.com)

Write a comment below. No registration needed!

3Dlabs P10 VPU
and Wildcat VP870 Video Card

CONTENTS

Positioning

Line of products

Specification

P10's architecture