So, let's analyze cases, when games do not allow to record and play back user demos. Although they do not always reflect the real performance, they are close to what happens in a game. In some cases you can use script scenes that use the game engine, but such scenes do not reflect gaming performance very well. Results will differ significantly from gaming FPS. They can demonstrate relative performance of various graphics cards rendering script scenes, but they won't reflect performance in a game itself.
We'll try to recreate tester's operations while benchmarking performance in such games as Oblivion, Need For Speed, BioShock, where there are no options to record demos, no script scenes reflecting gaming performance. Let's assume that testers go through a certain part of a game level, for example, from a fence to an oak tree. Let's see what happens if we load a selected game level and complete it several times. However, we'll deliberately go through the selected level a little differently each time, in order to evaluate a possible spread of results caused by the above mentioned tester's errors.
We'll start with BioShock, from the first sterling game level. The graph shows average frames for four passes through a certain part of this level.
We can see on the graphs that the time it takes to go through the selected route differs, but the difference was very big only in one case (red). We can even discard this pass, as it differs too much. The other lines have generally matching FPS values. We can see only small differences in some FPS peaks and different times with the increased frame rate in the middle of the graph.
However, even though the graphs look similar, the average values vary from 39.4 to 43.1 FPS, that is the difference reaches 9.4%. It's too high, in our opinion, to measure performance in this way. Of course, one can always discard the 43.1 value to make the measurement error quite acceptable. But how can you single out right and wrong results? There is no universal method here, only multiple tests, thorough analysis, and averaging results.
The next game is Call of Duty 4. Its single-player mode does not allow to run tests because of bugs in its code. The multi-player version can be used for tests, but it's not quite what testers need. We decided to check out the difference in the average frame rates in the bonus level in an airplane:
The general FPS line is similar in all passes, the biggest difference can be seen in the end of the level. We discard the short red pass and get the difference of 9.3% between polar values, which is also too much for serious tests. Even if we discard another pass for some criterion ("purple", with maximum average FPS values), the measurement error will be above 3%.
And now let's analyze two graphs from DiRT, which offers no tools to benchmark performance at all. You cannot even record and play demos. We tested two different types of the game: a two-car contest on a looped track, which depends more on a graphics card, and a contest with several buggies, which depends more on a CPU. Will the results differ much in different modes? Here is the first test:
The frame rate is generally even, the graphs in time look alike, except for the red line. If we take it into account, we get the 4.4% difference between the polar values of the average frame rate: 28.6 and 27.4. If we discard it, this figure drops down to 1% error!
In this case FRAPS is justified. The only requirement is to perform many tests to recheck the results and make sure they are valid. And now let's have a look at the other DiRT mode. Can it be used for tests? The more cars (3D objects) on the screen, the bigger the difference in GPU loads, as it's impossible to have identical races.
But even this time we have a similar situation - the graphs are less identical, but they are still similar. If we take into account the red line, we'll get unacceptable 9% of the performance difference between 20.5 and 18.8. If we discard it, we'll get 2.6%, and this measurement error is close to the acceptable range, although good tests do not show such a spread of results, as between 19.3 and 18.8 FPS.
We can draw a conclusion that DiRT can be used for benchmarking with FRAPS, but it requires additional tests and thorough analysis of results. Let's proceed. It's time to analyze Need for Speed: ProStreet:
Even in this game, where each race is unique, all framerate graphs show a similar picture. Only the blue graph is a tad different. All the graphs are identically wavy. But let's see the difference in the average framerate demonstrated in the latest Need for Speed.
We'll take all four readings into account at first. We get 24.7 and 22.2 FPS, which makes the difference of 11% - it's absolutely unacceptable for serious tests. Even if we discard the maximum result of 24.7, we'll get almost 6% of the difference between the other results. Of course, we can also discard 22.2 FPS to get 1.7% - now that's a good result. But it looks like tampering.
The next game is rather old - TES4: Oblivion. Many testers like to use this game to measure 3D performance. We've deliberately chosen an indoor scene, as there are even more factors outdoors that affect FPS. And we already tried this game in the outdoor scene in the previous article. Here are four passes through an Oblivion level:
These FPS graphs differ very much, they are jumping here and there at random. We deliberately added some variations to the test, of course - we changed the original trajectory, turned the camera. But this also happens with real tests, because no one can repeat the original path with 100% precision several times on different days.
Let's discard the red line, as it apparently falls out of the others. We've got 15% of the difference between 63.8 FPS (maximum) and 55.4 FPS (minimum). That's too much. Such "tests" cannot be used to benchmark graphics cards. Let's discard the highest result as well, as in previous cases. We still got the 9% difference between the seemingly close values: 60.4 and 55.4. Too much again. Oblivion cannot be used for our tasks.
And the last game is Test Drive: Unlimited. It's notable for even FPS, so it can be potentially handy for benchmarking graphics cards, although it does not allow to record and play demos. Let's see:
The graphs are very close to each other. The frame rate is flat, about 25 frames per second. It drops only at the start, when tires are burning and smoking. The average FPS values also suggest that the results may be good.
But that's not so. The difference between 24.3 FPS (maximum) and 22.8 FPS (minimum) is 6.6%, which is smaller than in all previous cases, but still too high for an acceptable measurement error. What if we discard the highest result? It's still 5%, too much for normal tests, where the measurement error must stay below 2-3%.
Write a comment below. No registration needed!