Some time ago I criticized the new Intel's paradigm - Hyper Threading. Intel noticed my complains and offered to help to clarify the situation concerning the Hyper Threading technology. Well, let's do it. Besides, since then I got some more information about this innovation.
First let's see what we know about the Hyper Threading technology:
Now, when we have listed some peculiarities, let discuss some consequences. First of all, let's look deeper into what the company offers. Is it really so free? How is the simultaneous execution of threads accomplished? And what does Intel mean by "thread"?
I think (maybe, I'm wrong) that in this case it is a program fragment which is assigned for execution by the multitask operating system to one of processors of the multiprocessor hardware system. Well, it is one of the definitions. I just want to find out what new Intel has included.
After that one of the threads is executed. At the same time the instruction decoder (it is entirely asynchronous and is not included in the 20 stages of the Net Burst) implements fetching and decoding (with all interdependencies) into microinstructions. Here I must explain what I mean by "asynchronous" - x86 instructions are split into microinstructions in the decoding unit. Each x86 instruction can be decoded into one, two or more microinstructions. At the processing stage interdependencies are determined and necessary data are delivered via the system bus. And a speed of operation of this unit depends on the speed of data coming from the memory, and in the worst case it will be defined by it. It would be logical to isolate it from the pipeline where microinstructions are executed. And it was done by placing the decoding unit in front of the trace cache. Thus, if there are microinstructions to be executed in the trace cache the processor works more efficiently. This unit works certainly at the processor speed, unlike the Rapid Engine. I think that this decoder is something like a pipeline of 10 - 15 stages. And there are obviously 30 - 35 stages from data fetching from the cache to getting the final result (including the Net Burst pipeline, see Microdesign Resources August2000 Microprocessor report Volume14 Archive8, page12).
The obtained microinstructions together with their interdependencies accumulate in the trace cache which makes approximately 12,000 microoperations. The trace cache must be 96-120 KBytes! The source of such estimation is the structure of the P6 microinstruction; the length of instructions is hardly to have changed much (considering that a microinstruction together with service fields is about 100 bits). And the data cache of 8 KB looks asymmetrical :-)... and pale. Of course, as the size increases access delays get also greater (for example, at 32 KBytes the delays are 4 clocks instead of two). But is the access time so really important that the twice greater delay makes such a volume increase unprofitable? Or it's just that they don't want to increase the die's size? But then, in case of the 0.13-micron technology they should increase this cache first of all (not the L2 one). Just remember the transition from the Pentium to Pentium MMX when thanks to the twice larger L1 cache all programs got 10 - 15% performance boost. So, what can we say about the 4 times larger cache (especially considering that the processors reached 2 GHz and a multiplier grew from 2.5 to 20)? Well, the situation is vague. However, there can be some obstacles. For example, some requirements for geometry of layout of units and lack of space near the pipeline (it is clear that the data cache must be as close to the ALU as possible).
Let's examine the process further. Let the current instructions enable the ALU. And FPU, SSE, SSE2 and other stand idle. And here the Hyper Threading starts working. Noticing that there are microinstructions together with data for a new thread, the register alias unit assigns a portion of physical registers for a new thread. By the way, there are two variants - a unit of physical registers common for all threads or a separate one for each thread. Taking into account that Intel doesn't mention the register alias unit as a unit to be changed - this is the first variant which is chosen. Is that good or bad? One the one hand, it saves on transistors. From the programmer's standpoint, it's unclear. If there are 128 registers, a situation of shortage of registers mustn't occur if there is a sensible number of threads. After that the microinstructions are delivered to the scheduler which sends them to the execution unit (if it's not busy) or enqueues them if the given unit is not currently available. Thus, execution units are used more effectively. At this moment the processor is seen by the OS as two logical processors. But is that really so unclouded? A part of the equipment (caches, Rapid Engine, branch prediction unit) are common for both processors. By the way, precision of branch prediction will probably suffer from it. Especially if simultaneously executed threads are not connected with each other. And some units (for example, MIS [Microcode Instruction Sequencer] - it is similar to the ROM which contains a set of programmed sequences of usual operations, and RAT [Register Alias Table]) must distinguish different threads processed by different CPUs. At the same time, if two threads are "greedy" for the cache (i.e. the cache increase does a big job) the Hyper Threading can even decrease the speed. It is because today there is a competitive mechanism of the struggle for cache - an active thread "drives out" the inactive one. Though the caching mechanism can change. It is also clear the the speed (at least for the given moment) will fall down in those applications in which it drops in the "normal" SMP. For example, the SPEC ViewPerf demonstrates higher results on uni-processor systems. That is why the outcome with the Hyper Threading will be poorer than without it. The results of the testing of the Hyper Threading can be found here.
By the way, the Pentium 4 has 16-bit ALUs. But note that it doesn't concern bit capacity of the processor. And at a tick the ALU (of the twice higher frequency) calculates only 16 bits. The other 16 bits are calculated at the next tick. That is why it was necessary to make the ALU twice faster. And 32 bits are calculated at one full clock. But it seems that two clocks are necessary to join and separate bits - but it's not clear. Why did they choose such ALU model? It seems that Intel kills several birds with one stone:
But are there processors which support this technology today? Yes, they are (Prestonia) and XeonMP. It is interesting that the XeonMP, unlike the Xeon, supports up to 4 processors (IBM Summit-like chipsets support up to 16 processors, the technology is similar to that of the ProFusion chipset) and has an L3 cache of 512 KBytes and 1 MBytes integrated into the core. By the way, why did they integrate exactly the L3 cache? Why not to increase the size of the L1 cache? Why not the L2 one? Maybe the problem is that the Advanced Transfer Cache needs relatively small delays. And in more sizeable caches the delays are longer. It's also interesting that a bus which links the L3 cache and the core is 128 byte wide!!! The bus's width is 1024-bit + 128-bit ECC. The PIII Xeon 2 MBytes had a 256-bit + 32-bit ECC bus. Well, they have done everything to deliver data faster into the core (and unload the memory bus).
So, are there are no any bottlenecks? One processor - and an OS sees 2. Two processors and an OS sees 4! Excellent! Stop! What operating systems can work with 4 processors? Microsoft OSs which can operate with more than 2 processors cost an arm and a leg. 2000 Professional, XP Professional, NT4.0 support only 2 CPUs. This technology is meant for the workstation (and server) market and is supported only in the respective processors. Today we can use processors with this technology only in a dual-processor board having coupled with only one processor. That is, if you want to use this technology you must buy Server and Advanced Server versions of the OSs. Well, this "free" processor turns out to be quite expensive... At present Intel communicates actively with Microsoft trying to link the licensing policy with the physical processor. At least, according to the document, new Microsoft operating systems will be licensed according to the physical processors.
Of course, one can turn to operating systems of other manufacturers, but it is not the best way-out... That is why I can understand why Intel hesitated much time to use this technology.
The conclusion is that the Hyper Threading technology can bring both a performance gain and a performance drop. We have already discussed the performance decrease, and now let's see what's necessary to get the performance boost:
The BIOS is not a problem, the OS was discussed earlier, and as for applications, it's necessary to include the pause instruction in those threads which wait for data from the memory, in order not to slow down the processor's operation; because if data are lacking such thread can block some execution units. And to enter this instruction it's necessary to recompile applications which is not so simple. That is why the worst drawback of the Hyper Threading technology is recompilation of applications. But there is one advantage in it - such recompilation will improve the performance in normal dual-processor systems. By the way, these experiments show that in the most cases programs optimized for SMP get a performance boost with the Hyper Threading from 15% to 18%.
And now let's imagine what can be improved with the further development of this idea. It is obvious that the development of this technology will be directly connected with the development of the Pentium 4 core. So, what do we have here? The 0.09-micron, or 90nm, technology... I think there are several spheres this family will be developing in:
Will the idea of Hyper Threading be developed further? On the one hand, it is clear that two physical processors are better than three logical ones. And it won't be easy to position it... The Hyper Threading can be useful for integration of two (or more) processors on a die. But this technology will allow most of users to work on "dual-processor" desktop computers. This is very good.
I must admit that I changed my attitude towards the Hyper Threading technology several times when I worked on this article. And now I have the following things to say:
there are only two ways to improve performance - by increasing the frequency and by increasing performance per clock. And while the Pentium4's architecture follows the former way, the Hyper Threading takes the latter one, and this is an advantage of it. Besides, there are some interesting consequences of the Hyper Threading: changing of the programming paradigm, popularization of multiprocessing, improvement of processor performance. But there are some obstacles on the way: lack of the normal support in operating systems and necessary recompilation (and sometimes it's necessary to change the algorithm) of applications so that they can use advantages of the Hyper Threading entirely. Besides, the Hyper Threading can allow for parallel operation of an operating system and applications, not one part after another like now. Surely, there must be enough execution units.
Write a comment below. No registration needed!