Hyper Threading In-Depth

By Viktor Kartunov

Some time ago I criticized the new Intel's paradigm - Hyper Threading. Intel noticed my complains and offered to help to clarify the situation concerning the Hyper Threading technology. Well, let's do it. Besides, since then I got some more information about this innovation.

First let's see what we know about the Hyper Threading technology:

This technology is meant to increase efficiency of operation of a processor. The matter is that, according to Intel, only 30% (by they way, the figure is questionable as the details of its estimation are unknown) of all execution units in the processor work the most part of time. And the idea to load other 70% looks logical (the Pentium 4 processor, by the way, which is going to incorporate this technology, doesn't suffer from superfluous performance per megahertz). That is why I must admit the idea is sensible.
The main point of the Hyper Threading technology is that during implementation of one thread of a program idle execution units can work with another thread of the program (or a thread of another program). Or, for example, while executing one sequence of instructions they may wait for data from memory for execution of another sequence.
When executing different threads, the processor must "know" which instructions refer to which threads. That is why there is some mechanism which helps the processor do it.
It is also clear that taking into account a small number of general-purpose registers in the x86 architecture (8 in all) each thread has its own set of registers. However, this limitation is evaded by renaming the registers. That is, there are much more physical registers than logical ones. The Pentium III has 40. The Pentium 4 has obviously more - I think, about a hundred. I failed to find out the true data. According to the unconfirmed information, they are 128. Other sources mention other figures. It's completely vague... By the way, the Intel's position is unclear.
It's also known that when several threads need the same resources or one of the threads waits for data the "pause" instruction must be applied to avoid a performance drop. Certainly, this requires recompilation of the programs.
Sometimes execution of several threads can worsen the performance. For example, because the L2 cache is not extendable, and when active threads will try to load the cache it's possible that the struggle for the cache will result in constant clearing and reload of data in the L2 cache.
Intel states that the gain can reach 30% in case of optimization of programs for this technology. (Or, rather, Intel states that on the today's server programs and applications the measured gain is up to 30%) Well, it's a decent reason for the optimization.

Now, when we have listed some peculiarities, let discuss some consequences. First of all, let's look deeper into what the company offers. Is it really so free? How is the simultaneous execution of threads accomplished? And what does Intel mean by "thread"?

I think (maybe, I'm wrong) that in this case it is a program fragment which is assigned for execution by the multitask operating system to one of processors of the multiprocessor hardware system. Well, it is one of the definitions. I just want to find out what new Intel has included.

After that one of the threads is executed. At the same time the instruction decoder (it is entirely asynchronous and is not included in the 20 stages of the Net Burst) implements fetching and decoding (with all interdependencies) into microinstructions. Here I must explain what I mean by "asynchronous" - x86 instructions are split into microinstructions in the decoding unit. Each x86 instruction can be decoded into one, two or more microinstructions. At the processing stage interdependencies are determined and necessary data are delivered via the system bus. And a speed of operation of this unit depends on the speed of data coming from the memory, and in the worst case it will be defined by it. It would be logical to isolate it from the pipeline where microinstructions are executed. And it was done by placing the decoding unit in front of the trace cache. Thus, if there are microinstructions to be executed in the trace cache the processor works more efficiently. This unit works certainly at the processor speed, unlike the Rapid Engine. I think that this decoder is something like a pipeline of 10 - 15 stages. And there are obviously 30 - 35 stages from data fetching from the cache to getting the final result (including the Net Burst pipeline, see Microdesign Resources August2000 Microprocessor report Volume14 Archive8, page12).

The obtained microinstructions together with their interdependencies accumulate in the trace cache which makes approximately 12,000 microoperations. The trace cache must be 96-120 KBytes! The source of such estimation is the structure of the P6 microinstruction; the length of instructions is hardly to have changed much (considering that a microinstruction together with service fields is about 100 bits). And the data cache of 8 KB looks asymmetrical :-)... and pale. Of course, as the size increases access delays get also greater (for example, at 32 KBytes the delays are 4 clocks instead of two). But is the access time so really important that the twice greater delay makes such a volume increase unprofitable? Or it's just that they don't want to increase the die's size? But then, in case of the 0.13-micron technology they should increase this cache first of all (not the L2 one). Just remember the transition from the Pentium to Pentium MMX when thanks to the twice larger L1 cache all programs got 10 - 15% performance boost. So, what can we say about the 4 times larger cache (especially considering that the processors reached 2 GHz and a multiplier grew from 2.5 to 20)? Well, the situation is vague. However, there can be some obstacles. For example, some requirements for geometry of layout of units and lack of space near the pipeline (it is clear that the data cache must be as close to the ALU as possible).

Let's examine the process further. Let the current instructions enable the ALU. And FPU, SSE, SSE2 and other stand idle. And here the Hyper Threading starts working. Noticing that there are microinstructions together with data for a new thread, the register alias unit assigns a portion of physical registers for a new thread. By the way, there are two variants - a unit of physical registers common for all threads or a separate one for each thread. Taking into account that Intel doesn't mention the register alias unit as a unit to be changed - this is the first variant which is chosen. Is that good or bad? One the one hand, it saves on transistors. From the programmer's standpoint, it's unclear. If there are 128 registers, a situation of shortage of registers mustn't occur if there is a sensible number of threads. After that the microinstructions are delivered to the scheduler which sends them to the execution unit (if it's not busy) or enqueues them if the given unit is not currently available. Thus, execution units are used more effectively. At this moment the processor is seen by the OS as two logical processors. But is that really so unclouded? A part of the equipment (caches, Rapid Engine, branch prediction unit) are common for both processors. By the way, precision of branch prediction will probably suffer from it. Especially if simultaneously executed threads are not connected with each other. And some units (for example, MIS [Microcode Instruction Sequencer] - it is similar to the ROM which contains a set of programmed sequences of usual operations, and RAT [Register Alias Table]) must distinguish different threads processed by different CPUs. At the same time, if two threads are "greedy" for the cache (i.e. the cache increase does a big job) the Hyper Threading can even decrease the speed. It is because today there is a competitive mechanism of the struggle for cache - an active thread "drives out" the inactive one. Though the caching mechanism can change. It is also clear the the speed (at least for the given moment) will fall down in those applications in which it drops in the "normal" SMP. For example, the SPEC ViewPerf demonstrates higher results on uni-processor systems. That is why the outcome with the Hyper Threading will be poorer than without it. The results of the testing of the Hyper Threading can be found here.

By the way, the Pentium 4 has 16-bit ALUs. But note that it doesn't concern bit capacity of the processor. And at a tick the ALU (of the twice higher frequency) calculates only 16 bits. The other 16 bits are calculated at the next tick. That is why it was necessary to make the ALU twice faster. And 32 bits are calculated at one full clock. But it seems that two clocks are necessary to join and separate bits - but it's not clear. Why did they choose such ALU model? It seems that Intel kills several birds with one stone:

It is easier to speed up a 16-bit pipeline than a 32-bit one - just because of crosstalk noise and Ko
Probably Intel considers that integer calculations are quite frequent, that is why it speeds up exactly the ALU (and not, for example, FPU). Apparently, to calculate results of integer operations they use either tables or carry accumulation schemes. Just note that one 32-bit table is 2E32 addresses, i.e. 4 GBytes. Two 16-bit tables are 2x64 KBytes, or 128 KBytes! Besides, carry accumulation goes faster in two 16-bit portions than in one 32-bit one.
It's possible to save on transistors and... heat. You know well how all those architectural things heat up. Apparently, it was a very grave problem - just take the Thermal Monitor technology! It seems that such technology is not needed much, usual locking would be enough to make the system reliable enough. But if such a complicated technology was developed, the situation when such frequency changes made on the run was really considered a rated operating mode. And maybe even the basic mode? The rumor had it that the Pentium 4 was going to have a much greater number of execution units. In that case the heating problem must have been the most important. The generated heat was estimated to be 150 W. Then it was quite logical to take the measures that the processor worked to its full capacity only in those systems where normal cooling was provided.

But are there processors which support this technology today? Yes, they are (Prestonia) and XeonMP. It is interesting that the XeonMP, unlike the Xeon, supports up to 4 processors (IBM Summit-like chipsets support up to 16 processors, the technology is similar to that of the ProFusion chipset) and has an L3 cache of 512 KBytes and 1 MBytes integrated into the core. By the way, why did they integrate exactly the L3 cache? Why not to increase the size of the L1 cache? Why not the L2 one? Maybe the problem is that the Advanced Transfer Cache needs relatively small delays. And in more sizeable caches the delays are longer. It's also interesting that a bus which links the L3 cache and the core is 128 byte wide!!! The bus's width is 1024-bit + 128-bit ECC. The PIII Xeon 2 MBytes had a 256-bit + 32-bit ECC bus. Well, they have done everything to deliver data faster into the core (and unload the memory bus).

So, are there are no any bottlenecks? One processor - and an OS sees 2. Two processors and an OS sees 4! Excellent! Stop! What operating systems can work with 4 processors? Microsoft OSs which can operate with more than 2 processors cost an arm and a leg. 2000 Professional, XP Professional, NT4.0 support only 2 CPUs. This technology is meant for the workstation (and server) market and is supported only in the respective processors. Today we can use processors with this technology only in a dual-processor board having coupled with only one processor. That is, if you want to use this technology you must buy Server and Advanced Server versions of the OSs. Well, this "free" processor turns out to be quite expensive... At present Intel communicates actively with Microsoft trying to link the licensing policy with the physical processor. At least, according to the document, new Microsoft operating systems will be licensed according to the physical processors.

Of course, one can turn to operating systems of other manufacturers, but it is not the best way-out... That is why I can understand why Intel hesitated much time to use this technology.

The conclusion is that the Hyper Threading technology can bring both a performance gain and a performance drop. We have already discussed the performance decrease, and now let's see what's necessary to get the performance boost:

BIOS of the mainboard
Operating system (!!!)
Application

The BIOS is not a problem, the OS was discussed earlier, and as for applications, it's necessary to include the pause instruction in those threads which wait for data from the memory, in order not to slow down the processor's operation; because if data are lacking such thread can block some execution units. And to enter this instruction it's necessary to recompile applications which is not so simple. That is why the worst drawback of the Hyper Threading technology is recompilation of applications. But there is one advantage in it - such recompilation will improve the performance in normal dual-processor systems. By the way, these experiments show that in the most cases programs optimized for SMP get a performance boost with the Hyper Threading from 15% to 18%.

And now let's imagine what can be improved with the further development of this idea. It is obvious that the development of this technology will be directly connected with the development of the Pentium 4 core. So, what do we have here? The 0.09-micron, or 90nm, technology... I think there are several spheres this family will be developing in:

The processor's frequency will be higher thanks to the finer fab process.
I hope the data cache will get greater. At least, 32 KBytes.
ALUs will be of 32 bits. It must improve the performance.
A higher speed of the system bus (it will be done very soon).
Dual-channel DDR memory (again very soon).
Maybe, a technology similar to the x86-64 will be developed if AMD will is promote it. And I hope such technology will be compatible with the x86-64. In the interview Jerry Sanders said that last year AMD and Intel agreed on cross-licensing of all except the Pentium4 system bus. Does it mean that Intel will integrate the x86-64 in the next Pentium4 core (Prescott) and AMD will integrate Hyper Threading in its processors?...
The number of execution units will be increased. However, it needs complete redesigning of the core.

Will the idea of Hyper Threading be developed further? On the one hand, it is clear that two physical processors are better than three logical ones. And it won't be easy to position it... The Hyper Threading can be useful for integration of two (or more) processors on a die. But this technology will allow most of users to work on "dual-processor" desktop computers. This is very good.

Conclusion

I must admit that I changed my attitude towards the Hyper Threading technology several times when I worked on this article. And now I have the following things to say:

there are only two ways to improve performance - by increasing the frequency and by increasing performance per clock. And while the Pentium4's architecture follows the former way, the Hyper Threading takes the latter one, and this is an advantage of it. Besides, there are some interesting consequences of the Hyper Threading: changing of the programming paradigm, popularization of multiprocessing, improvement of processor performance. But there are some obstacles on the way: lack of the normal support in operating systems and necessary recompilation (and sometimes it's necessary to change the algorithm) of applications so that they can use advantages of the Hyper Threading entirely. Besides, the Hyper Threading can allow for parallel operation of an operating system and applications, not one part after another like now. Surely, there must be enough execution units.

Write a comment below. No registration needed!