This article was initially intended to continue our analysis of shader units and to show how their number affects NVIDIA G8x performance in modern games. We planned to change parameters of a G80-based graphics card to make it resemble the mid-end G84-based products. Our plan was to use RivaTuner by Alexei Nikolaychuk to leave only 32 active unified processors in the GeForce 8800, and change GPU and memory clock rates.
GPU clock rate was to be reduced to the level of G84, and video memory clock rate had to be adjusted to scale memory bandwidth of the GeForce 8800 down to the level of the GeForce 8600 (memory clock must be three times as low, because the difference between memory bus bit-capacity is 384/128=3 times). Theoretically, such cards would have differed only in the number of ROPs: 24 versus 8, as well as in on-die caches and other optimizations.
Our objective was to determine how much the different number of ROPs affects performance, and how much shader units differ in G84 and G80. We wanted to analyze performance of modern games using NVIDIA PerfKit. Unfortunately, our plans never got to the first base - mandatory hardware performance counters of the G80 do not work, when some shader units are disabled. And such an analysis would have been useless without them. So we decided not to trash our test results and to use them in a brief comparative analysis of performance and some other parameters of modern 3D games.
One more idea was discarded in the process - stream_out_busy readings. This counter proved to be practically useless, even though most our tests are Direct3D 10 applications. The counter indicated that stream output units were used only in one game - Lost Planet: Extreme Condition. Moreover, these units were loaded only by less than 1%, so we decided to discard this counter as well.
Testbed configuration and settings
We used the following testbed configuration:
We used only one video mode with the most popular resolution 1280x1024 (or 1280x960 for games that do not support the former), MSAA 4x and anisotropic filtering 16x. Both features were enabled from game options, nothing was changed in the control panel of the video driver.
Our bundle of game tests includes recent projects. We gave preference to games supporting Direct3D 10 or containing new interesting 3D techniques. Here is the full list: Call of Juarez DX10 benchmark, Company of Heroes, S.T.A.L.K.E.R.: Shadow of Chernobyl, Lost Planet: Extreme Condition DX10 benchmark, Colin McRae Rally: DiRT, PT Boats: Knights of the Sea DX10 benchmark, SEGA Rally Revo, Clive Barker's Jericho. Our tests included several games without built-in ways to run demos, so we had to test them tentatively. Additional software: NVIDIA PerfKit 5, Riva Tuner 2.05, and Microsoft PIX for Windows from DirectX SDK.
Unfortunately, our tests did not include such interesting applications as World in Conflict, BioShock, Medal of Honor: Airborne, Test Drive Unlimited, TimeShift, Call of Duty 4: Modern Warfare, Half-Life 2: Episode 2, etc. Some demos and games did not make it in time, a couple of projects failed to run under PIX debugger: BioShock and Test Drive Unlimited.
Lost Planet: Extreme Condition
We start our analysis with one of the most technically advanced games launched in 2007. Lost Planet: Extreme Condition has come to PC from Xbox 360. But its high-tech features are confirmed by many changes in the engine for Direct3D 10 GPUs. Compared to its console version (which is a high-tech game as well), it has some new features: FP16 frame buffer, motion blur and depth of field of higher quality, fur, more samples and improved shadow map filtering, ambient occlusion, soft particles, advanced parallax mapping, etc.
Lost Planet was tested with a built-in demo consisting of two game levels. They differ much - there are not many objects in the first level, so its performance is limited mostly by a graphics card; the second level contains a lot of objects, so the performance is limited by both CPU and GPU.
In our tests we used a DirectX 10 demo of the game with a built-in benchmark, which does not reflect real performance of the release with all patches applied. That's why we could evaluate the frame rate only relatively, rendering speed of the latest release is much higher. We can see that the G84 is heavily outperformed by the G80. Let's locate the bottleneck. First of all, the GeForce 8600 can be slowed down by by video memory size, which is half as much as the game uses. Secondly, judging by the results, rendering speed is affected by the number of shader units and TMUs - the difference in performance is proportional to the difference in units.
Let's have a look at interesting results. The average number of draw calls does not reflect the real picture, because it depends on levels: there are not many calls in the first scene, and in the second scene this number reaches over 4000, which is really much, even for state-of-the-art games. The amount of geometry in this game is above average, but I don't quite understand the difference between primitive count and setup triangle count. Both GPUs were constantly working, we can see that performance does not depend on a CPU. Geometry and raster units are not loaded much, while texture and shader units are working at full capacity. We haven't seen such active usage of shader units before. That's what I call good optimization - performance depends on a GPU only.
PT Boats: Knights of the Sea
This benchmark has been released recently, and we are interested in many of its parameters. Unlike most of our tests, this is not an FPS. Besides, this demo reveals innovations used in the game. It has one of the best (or the very best) visualization of water surfaces: geometry surface with excellent dynamics, reflections and refractions. Other visualizations are also very good: well filtered soft shadows, active post processing, and complex geometry models of vehicles. The demo is written for Direct3D 10 API and uses some of its features, such as geometry shaders.
In the course of our tests we ran the DirectX 10 benchmark with maximum quality settings for a given resolution. Interestingly, these settings affect LOD and do not deteriorate render quality much.
Performance of the GeForce 8600 dropped very low in this case. Well, even the GeForce 8800 is slow with maximum settings. If we consider the minimum frame rate, the G84 is a catastrophe. We take into account, of course, that we use the special driver, which reduces performance a little together with the PIX debugger. But such low performance cannot be written off solely to the driver. I think that the main factor here was high requirements to video memory size, because the demo requires up to 800 MB with maximum settings! Other characteristics of GPUs and cards also affect performance, of course. Performance may be limited by computing and texturing performance from time to time.
Interestingly, the number of draw calls is not that big. Perhaps, the program was optimized after all. The amount of processed geometry is big even for a modern game. That may be the effect of almost disabled LOD. Besides, the GeForce 8800 did not work at full capacity, it was idle almost half of the time, and the total performance was apparently limited by a CPU. The demo loads TMUs and ALUs well. But the results do not reach 100%, so performance is not limited by a single component. Judging by results, the load on geometry units is not big, despite the number of triangles.
Colin McRae Rally: DiRT
It's quite a new game, which performance depends on a graphics card in the first place, but it still needs some optimizations. The latest game of the CMR series demonstrates complex geometry, high quality particle and reflection effects, and powerful post processing effects. It's a multiplatform game. But it still uses high technologies, because the latest generation of consoles possess powerful GPUs on the level of the previous generation of PC GPUs.
As the game does not have a built-in benchmark and does not allow to record how you play, we had to drive the same track in the same conditions to analyze performance. Such measurements are highly inaccurate, so each graphics card was tested three times, and then these results were averaged.
The frame rate of this game with maximum settings is irritatingly low. You can play this game with the G80-based card, although the game can be too slow sometimes. But if you have the GeForce 8600 GT, you'll have to reduce image quality in game settings to make the game run smooth. The G84 is approximately three times as slow in this game as the G80 - a usual situation. DiRT performance with the GeForce 8800 is sometimes limited by a processor, and the GeForce 8600 is always loaded to the brim. Low performance of the latter is caused by fewer shader units and by insufficient video memory (for high quality settings).
In our opinion, the game depends on a processor because of many draw calls, 3600 in extreme cases. In case of the G84, the fact that a game has high CPU requirements is hidden by the slow graphics card. But the G80-based card clearly shows that the game depends on a CPU, the GPU is generally idle for 15% of the time, and its units are not fully loaded. There is much geometry in this game, although geometry units are not heavily loaded. The same concerns raster units. This cannot be said about texture and shader units, which work at 40-80% of their capacity, depending on their type and a GPU. In this case the load on shaders is higher than on textures.
Call of Juarez
This game is older, but it has a high-tech engine, especially considering that we used a benchmark of the updated Direct3D 10 version of the game. This version of the game is technologically different from the initial release, it offers many new features: a lot of objects and geometry, an improved particle system that uses geometry shaders, high quality soft shadows with improved shadow map filtering, new textures, and advanced parallax mapping, alpha to coverage.
We used a stand-alone benchmark for our tests, which technologies and optimizations are similar to the game with the latest patch. Performance of Call of Juarez is almost completely limited by a graphics card. The game contains a lot of geometry and pixel processing, which is done by unified processors. A CPU might have limited performance only in case of many draw calls. But the game is optimized to reduce their numbers.
The updated engine and support for new effects made this game even slower. That's the effect of maximum settings and a relatively high resolution, but less than 30 fps with a high-end graphics card is a bit much... However, the game was modified in cooperation with AMD. So solutions from this company perform better in this game, the updated code is too slow on NVIDIA cards. Perhaps, such geometry algorithms are better suited for the AMD architecture. The G84-based card is approximately three times as slow as the G80-based card. Performance does not depend on a CPU in both cases, only on a graphics card.
The size of video memory used by the game reaches 530-540 MB, which contributed to low performance of the GeForce 8600. There are not many draw calls, which can be explained with certain optimizations. The number of processed geometry primitives is great. There is apparently a lot of geometry in this game, the number of polygons in a frame may reach almost three millions! ROPs and geometry units are not heavily loaded, the average load on texture units is one fourth. But we can clearly see that performance is limited by shader units. It's a proof that rendering speed in this game depends on the number and frequency of unified processors in the first place - they process vertex, pixel, and geometry shaders.
Company of Heroes
This is another game that is not a shooter. It's a real-time strategy. Unfortunately, we used the Direct3D 9 version of the game without some Direct3D 10 features added with patches. The game cannot boast of many modern effects. But it still offers the most popular features: soft shadows, post processing, bump mapping, high quality lights and textures, particles.
We've mentioned many times that the benchmark built into Company of Heroes does not reflect gaming performance, because it uses a script scene, not gameplay. But it's still interesting to see the difference in cinematics render speed on various graphics cards.
The game is rather old (we haven't tested patches with Direct3D 10 support), so the frame rate is relatively high. Low minimum FPS results can be explained with dynamically loaded content and peculiarities of the engine. The G84 is only twice as slow as the G80. Judging by the gpu_idle readings, the latter was idle more than one third of the time, it was limited by a CPU.
What concerns other counters, we are surprised to see so few vertices and triangles per second. But then video memory usage is high, all textures and models seem to be loaded at once. There are not many batches, the amount of geometry is an indirect sign of it. Traditionally, ROPs, input assembler, and geometry units are not loaded much, while texture and shader units are very active, especially in the G84. Perhaps, performance of the script scene in this benchmark is sometimes limited by texture units, and sometimes - by shader units. The ALU load was almost full sometimes even on the G80.
S.T.A.L.K.E.R.: Shadow of Chernobyl
We included this game into our tests because of its popularity and technical singularity. It uses some new interesting technical solutions: deferred shading, a lot of per-pixel light sources, filtered soft shadows from several sources, simple parallax mapping on many surfaces, active post processing, etc.
Fortunately for testers, developers added an option to record and play demos, as well as a benchmark, where a game is not recorded, a camera just gives you a fly-around - not a sterling test, but it's better than nothing. The game does not allow to use multisampling because of deferred shading, so we had to restrict ourselves to the usual 1280x1024 mode.
Interestingly, a performance difference between so different graphics cards was not that big, although the 3:1 ratio in the frame rate is preserved again. Besides, the render speed of the GeForce 8800 GTX in S.T.A.L.K.E.R. was limited by a CPU. The GPU was idle almost one fourth of the time. Judging by FPS values, you can play the game with the G80-based card, but the G84 is not powerful enough for maximum settings. Game performance depends mostly on a CPU and shader/texture units of a GPU.
There were quite many D3D draw calls: average 2000 calls, maximum 3500 calls. The amount of geometry processed per frame is average for these days, but input assembler is loaded more than usual. It may indicate that other units of a GPU fetch much data from memory. Intensive load of the input assembler in the G80 versus the G84 can be explained with more input data because of a higher frame rate.
Interestingly, the game actively uses both texture and shader units, but the effect of the latter on the overall render speed is higher, they have more work. Even though the Direct3D 9 engine of the game does not allow to use multisampling, the ROP load is above average, which speaks of active post processing and several render buffers.
SEGA Rally Revo
Another rally game in our list. Unlike Colin McRae Rally, it does not have that many interesting technologies. But it's a model of an average multiplatform game. I cannot say that this engine is very simple, it supports shadow maps, dynamic reflections, bump mapping, post effects. But it's a plain engine compared to other games. All the more interesting to see how various graphics cards cope with it.
The game does not allow to run tests and record gameplay. So we had to use the same scheme as in case with DiRT - we drove one track for several times and then averaged the results.
We can see that the 3D engine of this game is easier to process than in most previous cases. The GeForce 8800 demonstrates a comfortable frame rate (low minimal FPS is not a problem in this case), the GeForce 8600 is slower, but not as slow as usual, a tad over twofold. It happens because game performance is limited by a CPU. Judging by gpu_idle, the G80 was idle over 30% of the time. When the speed depends on a GPU, performance is limited by shader units. The other parts of a GPU do not slow rendering down.
The demo uses a tad over 300 MB of video memory. It almost fits into 256 MB in the GeForce 8600. So it shouldn't affect its performance so much. There are a lot of batches, which can be explained by the multiplatform origin of the game and the lack of PC optimizations. The average number of draw calls is almost 1500, over 3000 in extreme cases. The game processes much geometry (vertices and polygons), even though image quality is not outstanding. The ROP load is average. Geometry and input assembler counters show strange results, which are hard to explain. Texture and shader counters provide interesting readings, as usual. Texture units are loaded by one third of their capacity, shader units are loaded by half in the G80 and completely in the G84. In fact, the load is full, the average ALU load exceeds 90%, the peak load reaches 95%.
Clive Barker's Jericho
The last game in our review does not impress its users with image quality, it was not intended to be on the edge of technological progress. It's a middle game, but it uses a lot of new technologies: parallax mapping, a lot of post processing (depth of field, motion blur, bloom), filtered shadow maps, average geometry with textures, and a lot of peer-pixel processing. Let's see how our graphics cards will perform in another multiplatform project...
Our demo version of the game does not offer any benchmark options. We had to walk through the same level several times and then average the results.
This situation differs from the one with the previous multiplatform game. Now the difference between the cards is strictly threefold, as it should be. The GeForce 8800 is on the verge of being slow, and the GeForce 8600 cannot cope with maximum quality settings. Note that the frame rate is not limited by graphics cards only, even the G84 was idle for some time. A CPU seems to determine performance in some moments. In other cases - it's governed by unified processors, which are busy 70-80% of the time.
Other GPU units are used in the usual manner: ROP - less than 20%, geometry and input assembler - less than 10%, texture - 30-40%. This multiplatform game is limited by shader processors. That's not the first time when performance is limited by these units. The game requires more video memory than usual - over 400 MB. The number of draw calls and the amount of geometry in a frame is also above average, 1000 calls and less than 300000 primitives. So we've got a modern game with average requirements.
Let's draw conclusions on all games at once:
Alexei Berillo (firstname.lastname@example.org)
November 14, 2007
Write a comment below. No registration needed!