iXBT Labs - Computer Hardware in Detail






Real-Time Ray Tracing Realization.
SSE Optimization


Today we will examine an original graphics engine for computer games which is based on the ray tracing technique and which differs completely from others used in modern popular 3D games. The engine doesn't (!) use 3D accelerators.

We will speak about its unique possibilities and principles of operation, consider aspects of optimization of applications for SIMD extensions, and first of all, for SSE - an additional set of processor instructions first realized in Pentium III and Celeron II. At last, we will test performance of the program on the latest processors.

What for?

Why we create one more engine if there are a lot of them? What do you need more if there are excellent in quality Quake III and Unreal engines?

It is OK if you are going just to practice shooting for a couple of hours. But what if you want to change something there? To redesign, expand or break?

All such features must be supported by a graphics engine. A gamer can do only those things which are supported by a certain engine. And what if he wants to blow up everything or turn off one lamp and turn on another? An engine won't let you do it.

To be drawn, large levels are calculated beforehand during a long period of time and then the recorded information is used. For example, there is a certain room, and it is calculated in advance what and how other rooms can be seen from it. It allows reducing considerably the number of triangles drawn at a clock. But you can't break a wall of the room: the algorithm won't draw the things behind it because they aren't calculated in advance.

The problem is not only in visibility. The matter is that it is necessary to calculate shadows of objects, walls, stairs etc. If you broke stairs or close a window nobody will change lighting because it takes a lot of time to calculate it using modern algorithms of drawing triangles with a z-buffer. It is impossible to move a light source - how to live in such a house?

Let's approach the problem from another point of view. A level with the precalculated information takes megabytes. And what if you need to download levels constantly while playing a multiuser game on the Internet?


I decided to develop a graphics engine meeting the following requirements: no preliminary scene processing, positions and number of light sources and objects can change at any time. It means that a new scene is drawn at each clock. It will allow creating new games and extending new types.

Instead of triangles I use spheres as base primitives; i.e. objects will consist not of triangles but of spheres.

It is clear that we should use as less primitives as possible to draw an object, otherwise the performance will be unacceptably low.

Why do I choose spheres as primitive elements and the ray tracing method as the basic technique? The matter is that it takes a lot of time to calculate shadows if objects are drawn with triangles. There are different methods, and they require either preliminary data on an object, or multiple redrawing of it (into a texture) and tough utilization of a video accelerator. There is one more considerable disadvantage which you might omit in demo programs: if an object casts a shadow onto another remote object, the shadow will be angular. To prevent it we should draw such an object into a shadow texture with a gigantic resolution.

Shadowcast is a demo program from NVidia. It is not new, and therefore, doesn't require GF3.

I think that they purposely made a small shaded area. But it is still well seen how rough becomes the shadow as the distance from the objects grows.

Besides, there are problems connected with self-shading: when an object casts shadow not on another remote object but on itself. These problems are fundamental; they are peculiar to the method of visualization by drawing triangles and using z-buffer.

It is simple to calculate shadows from spherical objects and to draw spheres by tracing rays, hence my choice.

Ray tracing speed

In the ray tracing algorithm a speed of operation depends on a resolution. A typical operating time formula is as follows: c1*n*ln(n)+c2*n*n+c3*ScreenWidth*ScreenHeight. c1, c2, C3 are some constants, n is the number of objects in a scene. Let's, first of all, take a look at the second order term - c2*n*n. It is time of calculation of a shadow. Different optimization methods allow just decreasing c2, nothing can help to avoid squared relationship. This term limits the number of objects in a scene. At a certain point an insignificant increase in the number of objects causes a considerable performance drop.

The first term doesn't affect much the total operating time. So, when there are the largest possible number of objects in a scene the last term which depends on a screen resolution becomes determining. This value reflects a tracing speed of rays which correspond to screen pixels. This is a constant but it is a gigantic one! For example, 800*600=480000, 1024*768=786432. Modern accelerators need only 100 clocks for processing a ray at the frame frequency equal to 25.

It is interesting that c3 doesn't depend on the number of objects. The matter is that an analyses of a scene carried out before a tracing cycle starts optimizes calculation in this cycle. That is why no matter how many objects we have, the task will be to find where the ray crosses the fixed number of objects. In fact, C3 depends on n, but we can neglect it.

As a result, it is impossible to apply the ray tracing method in a real time mode to old PCs equipped with PentiumII and lower. This constant would eat up the processors power even at a resolution of 400X300. However, modern computers have enough power to realize this method in high resolutions.

Now I'd like to start speaking about the VirtualRay graphics engine that I created. It can draw objects only with spheres but it can change a scene cardinally in a real time mode.

VirtualRay engine

The working resolutions of the VirtualRay on the Pentium III with SSE support are 640x480 and 800x600 in 32bit color (i.e. true color).

It is impossible to draw too many randomly placed spheres. An acceptable speed is provided in case of several thousands of spheres in a scene. The engine draws well those objects which can be easily represented with spheres. For example, outer space: planets, stars, spaceships and space stations. Monsters and extraterrestrials and places they live in. Technical objects. Symbols, abstract objects and surrealistic worlds.

Mapping and bilinear filtering of textures are supported, thus, providing a high-quality image.

For the speed to be as high as possible we have chosen the simplest lighting model, all light sources are considered point sources.

Spheres can be transparent; transparency factors may change and may depend on a color channel.

Light sources can be of different colors.

The engine can work in any resolution: both in 320x240 and lower, and in 1600x1200 and higher. But it affects the speed.

At www.virtualray.ru you can look at the current demo version of the engine and screenshots. Although it is not an arty work with stylish textures, you can use it as a meccano of spheres.

To view the demo you need a PentiumMMX and higher and a video card supporting true color in a 32-bit format (not compatible with the Intel 740 as the true color there is only in the 24-bit format; but almost all modern cards support the required format). Here is the link: www.virtualray.ru/demo/demo.zip. Note that a format of color representation in a video card can be identified incorrectly, so that a sky, for example, can be yellow instead of blue. You should choose the right format hit or miss. Since low resolutions look too granular on big monitors you should reduce a screen area to boost up the performance.

Below are several screenshots from the demo version. But the possibilities of the engine are not delivered entirely as its best merit is a dynamic play of light and shade which a static image can't demonstrate.


A frame frequency is what a user pays attention to first of all. Later we will examine a performance of the engine in different systems, and now we are going to study it on the PentiumIII800EB based system. As the engine doesn't use video accelerators its performance depends entirely on the processor's power.

A typical frame frequency is 20 at 800x600x32 (800x450x32). At first sight it seems unimpressive. But FPS of the engine has two useful features. First, an average FPS and the minimal one are close. The average FPS can be 25 and the minimal one can be 22. And the latter is a more important parameter than the former. In many games when you just walk the fps is 50 or so, but when you start shooting it falls as much as twice or even more.

Another useful feature is stability of a frame frequency. Let's assume we have 50 fps. It means that each frame takes 20 ms to be drawn. But in reality some frames take much more or much less time, for example, because it becomes necessary to cache data from time to time. As a result, motion is not smooth. The VirtualRay draws frames independently, and the FPS demonstrated is real.

All in all, it is not for hard-core gamers of Quake, but it is possible to play, especially if it is not a shooter.

It should be noted that not all optimizing algorithms are tested yet, many are and will be developed. I hope it is possible to speed up the engine.

Spherical engine arrangement

The spherical engine is based on the algorithms described in detail in publications concerning the ray tracing technique. I had just to adapt them to spheres and the real-time mode.

Let's examine the scheme the engine is based on. It is divided into two parts: a preliminary analyses of a scene, and a double cycle of tracing of rays corresponding to screen pixels.

First, the engine gets a scene description: positions and characteristics of spheres, positions and parameters of light sources. These data go to the primary analyses block, which is the first part of the scene preliminary analyses block, where spheres which do not get into the frame and light sources whose effective areas are not seen get clipped off. Coordinates of objects change according to the observer's position, and frequently used values (like a distance to spheres) are calculated.

After that a screen is divided into a lot of rectangular areas for which an array of potentially visible spheres is calculated (i.e. those spheres which can be crossed by rays corresponding to pixels of the given area).

If we reduce a size of the areas they can have only several spheres; this will cut the costs in a gigantic cycle of ray tracing.

Then a ratio of a shading degree is calculated. It is defined how many light sources illuminate each sphere. And if there are a lot of them then the least contributing are excluded (there should be only several).

For each sphere it then is necessary to find all others which cast shadows on it. It is not difficult thanks to a simple form of a sphere. Now points of a sphere are checked whether they are shaded only by the defined spheres during a tracing cycle.

No we can start a tracing cycle.

Comments. The first part which realizes complicated and various algorithms is written entirely in C++. The second part which has a much smaller source code is written in Assembler and has three versions for different processors which differ in the instructions used. The SSE should be used, but it is possible to work without it as well, only with the MMX support. On an old Pentium166MMX the program will definitely work only in the slideshow mode, but you still can watch it in a small window.

Instructions Processors
SSE + Enhanced MMX Pentium III, Pentium4, Celeron II, AthlonXP
FPU + Enhanced MMX Athlon, Duron, (K6-III, K6-2+)
FPU + MMX Pentium MMX, Pentium II, Celeron, K6-2

For operation with real numbers we use the SSE/FPU, for operation with whole numbers and numbers with a fixed point we use the MMX. Usual registers (eax, ebx, ecx, edx, esi, edi) are used to store and calculate addresses and flags.

The program is compiled with the IntelC++Compiler4.5 integrated into the Microsoft Visual C++ 6.0.

Spherical engine: areas of application

Where can we use such an engine which has such great advantages and disadvantages? It has mainly two spheres of application: computer games and other purposes.

It is obvious that it can be used for visualization of molecules and atoms; but shadows there look inappropriate. A sphere's description is less in size than that of a triangle, and a sphere is a "richer" figure. It makes me think that the engine can be successfully used over the Internet. You can draw and easily transfer emblems, logos, animated scenes etc. Besides, the engine doesn't depend on a video card and you don't have to spend time on drivers, test it in millions of configurations, find out whether an OS supports graphics libraries and whether there is DirectX, what is a version of the OpenGL. Images are identical on all computers accurate to a monitor.

Now let's proceed to computer games. The spherical engine is a gadget for online games. It is possible to build and destroy a level as you want in 3D in a real-time mode in multiuser games!

A gaming universe can be divided into enclaves represented by independent scenes which can be connected by some kind of hyper-transitions. When you move a new scene loads instantly.

The VirtualRay engine can also be used for any types of space simulators. Planetary systems, asteroid areas, space devices etc. are easy to draw with the help of spheres. You can create some Death Star and then blow it up.

Arcade games are one more area where this engine can be widely adopted.

It is also possible to use this engine in exotic games, for example, where the task is to create living beings.

CPU MultiMedia 3D Test!

Another, the most obvious, application of the spherical engine is performance measuring. Performance is a double-edged weapon: you can measure a speed of operation of a program on a given system, or you can test a performance of a system using this program. The demo version integrates the CPUMultiMedia3DTest!, a graphics test of performance for a system of a processor and memory. I will tell you about peculiarities and advantages of this test, and about efficiency of the program on the PentiumIII and AthlonMP processors.

The CPU MultiMedia 3D Test! tests a processor from the standstill of geometrical calculations.

A graphics application, either a 3D graphics studio or a graphics engine of a 3D shooter, is realization of a solution of a certain geometrical problem. A process of solving a graphical problem is divided into simple operations - scalar multiplication of vectors, calculation of a vector's norm, composition/multiplication of vectors and arrays.

These operation are frequently used in any graphics application, and the most part of calculations is connected with them.

This test is meant for measuring performance of a processor in such tasks.

The test measures a scene drawing speed with the VirtualRay engine. As I mentioned before this engine is based on the back ray tracing method and doesn't use video accelerators. In course of operation a lot of complicated geometrical calculations are implemented.

As this is a software engine a video system, a speed of pumping textures through the AGP etc. do not affect the test results. So, we get a pure performance of the "processor+memory" system.

Peculiarities of different versions of the ray tracing block (the second engine's component)

The ray tracing cycle written in Assembler has three versions different in instructions used: SSE+EMMX, FPU+EMMX and FPU+MMX.

Originally the engine was aimed at the SSE technology (Streaming SIMD Extension), and the SSE+MMX version uses up all possibilities of the SSE; data are organized a specific way, there are few transitions, preliminary data caching and storage of frequently used operands in the SSE registers are used often. FPU instructions are absent in this part, that is why usage of the MMX do not cause problems of incompatibility with FPU.

In the FPU version the SSE is replaced with FPU, FPU registers are often swapped into memory as they are 4 times less in size than SSE ones, and their access is difficult because of the stack organization. Unnecessary operations of saving/undeleting of MMX registers and switching from the MMX mode to the FPU one and back are also implemented.

In the FPU+MMX mode the FPU+EMMX is supplemented with emulation of EMMX instructions lacking in the MMX, thus providing an additional load on the MMX block and additional swapping of MMX registers.

As the first version is specially optimized for SSE and lacks for FPU instructions it is a good test of quality of the SSE realization.

MMX usage

The MMX helps to operate color, i.e. texturing, bilinear filtering etc. So, if a scene lacks for textures, the performance doesn't depend on the MMX part. The memory doesn't influence it much as well, because selection of a texture element may cause delays.

So, we can get a performance index for MMX by displaying a scene with the same geometry, with and without textures, and then compare the results.

Measurement method

We measure an average time of operation of different engine blocks during a certain period of time when a dynamic scene is drawn. Scene motion is synchronized with real time and doesn't depend on fps. It means that a character of the computational load alters as the time goes by.

The resulting performance index is an inverse value of average operating time of the corresponding block during a period of drawing of one frame.

  • FPUC++ Index - the first part of the engine written in C++
  • SSE+EMMX Index - the corresponding version of the second part of the engine.
  • FPU+EMMX Index - the corresponding version of the second part of the engine.
  • FPU+MMX Index - the corresponding version of the second part of the engine.
  • Overall Index - combined operating time of both blocks.

This result can be interpreted as fps when only the given part operates. Because calculations are implemented mainly in the ray tracing cycle (the second component) its index is close to real fps. But it is possible to create such a scene where the most of the load will be taken by the first block of the scene analyses.

The real fps depends on additional factors such as synchronization of page flipping with a monitor refrash rate, time of the procedure of scene changing etc.


The FPUC++ Index reflects the system performance when algorithms heavy with calculations and branches, and written in high-level languages without a SIMD optimization are implemented.

The SSE+MMX Index reflects the performance of SSE (+EMMX) in 3D tasks in a special code optimized for SSE, with a minimum of transitions with the data organization being optimal for the SSE.

The FPU+EMMX, FPU+MMX Index reflects the performance of the FPU (+MMX) in geometrical and graphical applications which frequently use vectors and arrays and which draw images.

The Overall Index is an integrated system performance indicator.

Tests results for Pentium III, Athlon MP, Pentium 4, Athlon XP, etc.

I haven't finished an extensive testing of various processors so far. But the second part of the article will come with a lot of figures and graphs.

Comment on accelerators

You may say that the modern video accelerators are anyway better, the gf3 will soon fall to $50 and no specific engine is and will be needed. But I must say that it is unfair to compare software algorithms with hardware accelerated ones. Maybe, if the ray tracing were accelerated with the help of hardware the performance would be much better.

Let's compare it with software engines based on triangles drawn with a z buffer, with the QuakeI and Unreal. By the way, the Unreal was first developed as a software engine; and they used an advanced software texture anti-aliasing method.

Unlike the Spherical engine, the Quake works quite good on old computers, and on the new PentiumIII it shows a higher FPS. But it lacks for dynamic lighting, shadows, AA textures and true color. But it is impossible to extend its possibilities up to the ones of the VirtualRay, and vice versa; i.e. they are not comparable.

The Unreal, from the one hand, is a more developed engine than the QI, but its speed is much lower. In high resolutions (640x480x16, 800x600x16) the average FPS is some 20, the minimal one can be 15. There is nothing new. And image quality has a long way to go.

Obviously at that time the industry reached an algorithmic deadlock but video accelerators resolved the extensive development. During a long period of time almost all improvements were connected with increasing in the number of polygons in a frame.

It is much more interesting to consider joint operation of the Spherical engine and 3D accelerators. An accelerator can be used to draw a part of an image and special effects to enrich a scene. The engine is not so independent now, but it can be justified. However, lack of an interface of reading/writing into a z-buffer of an accelerator prevents from correct overlapping of images. Objects drawn by different methods won't overshadow one another the write way. Nevertheless, an engine's version able to draw different effects with an accelerator, light around light sources, particles etc. is developed now.

Optimization for SSE

An acceptable performance in high resolutions is achieved mainly due to optimization for SSE instructions. On processors without SSE the program looks pale.

SSE stands for Streaming SIMD Extension. SIMD stands for Single Instruction Multiple Data.

The engine implements ray tracing hundreds of thousands times, and this cycle takes most of the time. This relatively small procedure should be optimized first of all as the smallest performance gain is multiplied many times.

The engine uses two optimization methods: special data organization for the SIMD technology and storage of data in the SSE registers.

Speaking about SSE

To clarify the issue I will tell you a bit about the SSE (the detailed information is available at the Intel's site).

So, 8 new registers were added to the standard architecture. They are direct access registers, against the FPU ones which are combines in a stack, thus, complicating programming in Assembler for the FPU. In the C terms the registers are: float[4] float=single - a single precision number which takes 4 bytes. The total size of the SSE registers is, thus, 128 bytes, 32 numbers with a floating point, it is 4 times more than in the FPU. The FPU can work with numbers with extended precision, but they are not used in the engine.

It is possible to implement the following operations with these registers: load/unload into/out of memory, shuffle elements of one or two registers; arithmetic operations (addition, subtraction, multiplication, division) with a pair of registers like with vectors; square rooting of each element of the register and calculation of an approximate value of an inverse value and of a value inverse to the square root.

Component-wise comparison of two registers and component-wise calculation of the maximum and minimum.

Conversion into a whole number (the value can be directly recorded into an MMX register).

Bitwise logical operations with registers, like with a sequence of bits. Registers have a bool type[128].

Instructions of loading into a cache of data which will be required later. These instructions are meant to prevent a processor from idling while waiting data from memory.

Almost all the mentioned operations can be carried out with first elements of the registers. It is done for convenience.

In general, instructions for arithmetical operations with four-dimensional vectors and auxiliary instructions were added.

The basic thing in optimization of a program for SSE is organization of the code so that implementation of one arithmetical operation over 4 independent pairs of operands be natural.

What obstacles can appear here? Well, not always 4 one-type operations are implemented one after another, for example, 4 additions or 4 multiplications. Besides, operations can be interchanged. Implementation of the following operations often depends on the result of the previous ones. For example, comparison or branching lies between two additions.

The most evident area of application of the SSE is optimization of linear algebra. Multiplication/addition of arrays from 4x1 to 4x4 are the favorite operations of the SSE. There are no branching, and operations are of the same type.

To support developers Intel issued a special library called Small Matrix Library. Class array, class vector are realized there. Operands corresponding to arithmetical operations are realized as inline functions written in the SSE codes. There is also a FPU version, and a target platform can be used for compilation. It is quite convenient as it is possible to get two versions: for processors with SSE and without. It is not necessary to think about SSE in course of the development, you should just use classes corresponding to arrays and vectors. The IntelC++Compiler will put in the SSE-optimized code for all operands.

The Small Matrix Library (SML) distributive contains an interesting program which demonstrates advantages of the SML. It calculates time of implementation of an opration in case of optimization for the SSE and FPU.

Here is its log.

Operation SSE/FPU Time of implementation*
3x3 * 3x1 FPU 31
3x3 * 3x1 SSE 29
Transpose(3x3) * 3x1 SSE 23
4x4 * 4x1 FPU 53
4x4 * 4x1 SSE 31
Transpose(4x4) * 4x1 SSE 27
3x3 * 3x3 FPU 79
3x3 * 3x3 SSE 59
4x4 * 4x4 FPU 172
4x4 * 4x4 SSE 90
6x6 * 6x1 FPU 113
6x6 * 6x1 SSE 60
6x6 * 6x6 FPU 652
6x6 * 6x6 SSE 307
4x4 * 4x4 (general case) SSE 529
Inverse 4x4 FPU 392
Inverse 4x4 SSE 209
Inverse 6x6 FPU 1118
Inverse 6x6 SSE 600

*Time is measured in processor cycles. The measurements were carried out on the PIII 800EB.

During the measurements all data were in the cache.


4x4 * 4x4 (general case) - 4x4 arrays are multiplied as mxn arrays, with usage of cycles. In all other cases an array size is known in advance; thus, we get rid of cycles and obtain a good optimizations.

Let's analyze the results. The first thing that catches your eye is a small difference in speed between the SSE and FPU versions of 3x3 * 3x1 and an unexpected performance gain in case of array transposition. It is caused by necessity to shuffle the contents of the registers in one of the version so that the array's lines lie "correctly" for multiplication by a vector.

The gap of the relative performance gain between multiplications of three-dimensional and four-dimensional array is quite natural. The SSE registers are four-element.

Nevertheless, the SML is hardly used in the engine's part written in C++. The matter is that the scene preliminary analyses block has few multiplications and array inversions and a lot of branching. There are mainly such operations as multiplication of a three-dimensional array by a vector. Vectors are also mostly three-dimensional. That is why I decided to focus on optimization of the ray tracing procedure for SSE.

Now I'd like to speak about what lacks in the SSE. It lacks for an operation which prevails in the local geometry of the space we live in - it is scalar multiplication of three-dimensional vectors.

In order it's possible to implement this operation effectively with the SSE it is necessary to turn 8 vectors at an angle of 90 and place them into the SSE registers. Then you can get values of 4 scalar products by multiplying three times and adding twice.

4 element 3 element 2 element 1 element SSE register
VectorB4.x VectorB3.x VectorB2.x VectorB1.x XMM5
VectorB4.y VectorB3.y VectorB2.y VectorB1.y XMM4
VectorB4.z VectorB3.z VectorB2.z VectorB1.z XMM3
VectorA4.x VectorA3.x VectorA2.x VectorA1.x XMM2
VectorA4.y VectorA3.y VectorA2.y VectorA1.y XMM1
VectorA4.z VectorA3.z VectorA2.z VectorA1.z XMM0

That is why it is only a ray processing procedure which is optimized with the SSE; but it is written entirely in Assembler, without FPU, and innerproducts are calculated by the most optimal method. Data, for example, on sphere shaders are written in the form optimal for loading into SSE registers, thus providing the best performance.

Comparison of SSE, SSE2, 3DNow!, 3DNow!Pro

Now let me collate different SIMD extensions of processor instructions. 3DNow! is a SIMD extension from AMD which first appeared in k6-2 processors. In new AMD processors the 3DNow! got new instructions but no cardinal changes were brought in. There are two main differences from SSE: the 3DNow! registers are two-element (instead of four). They are of the float type[2]. The set of instructions is identical to the SSE, though they work with two pairs of operands. The second difference is that 3DNow! and FPU can't work together, so we have to enable a special (quite slow) instruction of switching the FPU mode with losing all the data from the 3DNow! registers as they coincide with the FPU ones on a physical level. It is impossible to use the MMX as well as it shares the same 8 registers with the 3DNow!.

A register size (2 for the 3DNow! and 4 for the SSE) affects programming for these extensions, rather than efficiency. It is simpler to use wholly two-element registers than 4-element ones. You make less efforts for data organization. The 3DNow! can add two elements of the same register. The SSE can't easily add four elements because it hampers programming. But if data are organized, a larger volume of the SSE registers provides for a less number of instructions. Besides, a larger volume allows keeping frequently used operands in the SSE registers, which will be a problem for the 3DNow!

AMD released a special library for operations with vectors where vector and array operations are realized as inline functions written in 3DNow! The AMD's library is better optimized for operation with three-dimensional vectors and arrays than the SML from Intel, and scalar multiplication is also better realized. It seems that a scalar multiplication in the SML is organized with the FPU.

SSE2 is an extension of the SSE for Pentium4 processors. The SSE2 uses the same registers as in the SSE, but they are of the double type[2]. The set of instructions is identical to the SSE. The SSE2, above all, has instructions for operation with whole numbers, registers are identified being of the int type with different digit capacity. It is like the Double MMX (SSE2 registers are of 128-bit size, twice larger than 64 (MMX registers' size)).

The SSE2 is identical to the 3DNow!, but it works with the double type instead of float.

The 3DNowProfessional is an extension of the 3DNow! in the latest AMD's processors - AthlonXP and AthlonMP. The 3DNowPro includes SSE, that is why the SSE now supports all latest processors. AMD also extended the 3DNow! adding instructions which simplify operation with 3DNow! registers if they are identified as complex numbers. Indeed, it is better to represent a two-element register of the 3DNow! as the one containing real and imaginary parts of a complex number. Complex numbers are used in Fourier transformations and the latter are used in audio and video encoding.

Now let's analyze different extensions.

The name 3DNow! implies that the developer is eager to attract, cheat a user. The benefit of the 3DNow! doesn't exceed 1.5 times. Besides, even a 3D program can't be completely optimized for the 3DNow!, that is why the overall performance gain is even less.

It should be noted that many programs which have a great performance increase with optimization for SIMD - it concerns both 3DNow! and SSE - just were not well optimized before. Once I decided to rewrite in Assembler a small frequently used function. My Assembler code was 1.5 times smaller as compared with the generated IntelC++Compiler4.5, but it worked very similar. So, I rewrote it with the SSE in use. But only first two elements of the SSE registers were enabled, because it was necessary to calculate scalar products of two instead of four pairs of vectors. And this version worked at the same speed. The matter is that the code in C++ was written excellently from the very beginning, and the compiler didn't damage it.

If you remember, when the 3DNow! just appeared AMD processors were worse than Intel's ones in operations with a floating point because of a bad pipeline for a coprocessor. A speed of the SIMD is less dependent on quality of realization of a pipeline. That is why the 3DNow! was meant to improve the situation in calculation with real numbers.

Let's study the SSE. Here the situation is similar. As you know, the SSE has a title: The Internet Streaming SIMD Extension. Sometimes Intel's marketers demonstrate at displays how the SSE helps in working on the Internet.

It is known that the pipeline of the Pentium4 is not optimal so that it be possible to increase the frequency. And the SIMD extension is again meant to improve the situation. Without the SIMD the Pentium4 falls considerably behind many processors working at a lower frequency in some tests, sometimes even the PentiumIII. Supposedly, as the frequency of the Pentium4 grows up a gap in performance between the SIMD and non-SIMD code will increase.


The analyses above unveils some prospects for the SIMD. As the SSE is supported by the latest processors as well, nobody would meddle up with optimization for 3DNow! anymore. The 3DNow! is topical only for old Athlons which will soon be replaced with new AthlonXP. The optimization for SSE will become obligatory otherwise the Pentium 4 won't be so good to work on.

But if the market will be glutted with some AthlonXP3000+ processors which work excellently without any SSE, the Pentium4 will die. But such an AthlonXP3000+ is impossible since an optimal pipeline is not compatible with a high frequency.

The SSE support in the latest processors of AMD means that AMD is going to follow Intel as far as frequency increase is concerned, and future AMD processors will also work bad on programs without the SIMD optimization.

Sadly, processor manufacturers often shift their problems onto the shoulders of developers. Necessity to optimize a program for a processor is a bad trend. Besides, it is quite difficult to make a compiler which would solve this problem automatically, without a programmer. The best achievements in this sphere is an automatic insertion of an SSE optimized procedure of multiplication of four-dimensional arrays and a vector library from AMD.

They had better create something really great, for example, optimize geometry a more optimal way. For instance, they could increase a stack of a coprocessor so that it can house several arrays and add several normal 3D instructions like these ones:

  • fLoadVector mem
  • fLoadMatrix mem
  • fStoreVector mem
  • fStoreMatrix mem
  • fMulVectorScalar r1,r2
  • fMulVectorVector r1,r2
  • fMulVectorMatrix r1,r2
  • fMulMatrixScalar r1,r2
  • fMulMatrixMatrix r1,r2
  • fAddVectorVector r1,r2
  • fAddMatrixMatrix r1,r2
  • fDotProduct r1,r2
  • fNormalizeVector r1,r2
  • fInverseMatrix r1,r2

Then it wouldn't be shameful to name the extension Super Enhanced 3D Extension.

But it is impossible to find a golden mean in optimization of hardware for software and vice versa.

Optimization of ray tracing for SSE

Let's consider some aspects of optimization of ray tracing for SIMD extensions of processor instructions. As a rule, it is possible to find a lot of objects, which a ray can cross, during a preliminary processing of a scene before the tracing cycle starts. We can implement it, for example, by grouping rays which correspond to certain areas of a screen and finding all objects which can cross any ray of such a group. After that we can arrange information connected with a particular ray so that it can easily load into SSE registers and it be possible to check crossing in parallel with 4 objects.

The same approach can be used for optimization of tracing of a reflected ray or rays directed to light sources. An array of objects the reflected ray can cross is formed for each object. And crossing right with 4 objects is calculated with the SSE.

If SIMD registers are of 8, 16, 32, 64, 128 in size, this approach is also suitable.

The SSE instructions include approximate calculations of an inverse value and an inverse square root. But they do not suit for geometrical calculations as they have a great error.


Prospects of these technologies are quite vague. Whether the VirtualRay engine will be called for, time will show. Maybe some its ideas will be taken, developed further and realized in other spheres.

Write a comment below. No registration needed!

Article navigation:

blog comments powered by Disqus

  Most Popular Reviews More    RSS  

AMD Phenom II X4 955, Phenom II X4 960T, Phenom II X6 1075T, and Intel Pentium G2120, Core i3-3220, Core i5-3330 Processors

Comparing old, cheap solutions from AMD with new, budget offerings from Intel.
February 1, 2013 · Processor Roundups

Inno3D GeForce GTX 670 iChill, Inno3D GeForce GTX 660 Ti Graphics Cards

A couple of mid-range adapters with original cooling systems.
January 30, 2013 · Video cards: NVIDIA GPUs

Creative Sound Blaster X-Fi Surround 5.1

An external X-Fi solution in tests.
September 9, 2008 · Sound Cards

AMD FX-8350 Processor

The first worthwhile Piledriver CPU.
September 11, 2012 · Processors: AMD

Consumed Power, Energy Consumption: Ivy Bridge vs. Sandy Bridge

Trying out the new method.
September 18, 2012 · Processors: Intel
  Latest Reviews More    RSS  

i3DSpeed, September 2013

Retested all graphics cards with the new drivers.
Oct 18, 2013 · 3Digests

i3DSpeed, August 2013

Added new benchmarks: BioShock Infinite and Metro: Last Light.
Sep 06, 2013 · 3Digests

i3DSpeed, July 2013

Added the test results of NVIDIA GeForce GTX 760 and AMD Radeon HD 7730.
Aug 05, 2013 · 3Digests

Gainward GeForce GTX 650 Ti BOOST 2GB Golden Sample Graphics Card

An excellent hybrid of GeForce GTX 650 Ti and GeForce GTX 660.
Jun 24, 2013 · Video cards: NVIDIA GPUs

i3DSpeed, May 2013

Added the test results of NVIDIA GeForce GTX 770/780.
Jun 03, 2013 · 3Digests
  Latest News More    RSS  

Platform  ·  Video  ·  Multimedia  ·  Mobile  ·  Other  ||  About us & Privacy policy  ·  Twitter  ·  Facebook

Copyright © Byrds Research & Publishing, Ltd., 1997–2011. All rights reserved.