We haven't paid attention to the analysis of "thermal" technologies in modern processors for a long time. The last article on this issue – Testing Thermal Throttling in Pentium 4 CPUs with Northwood and Prescott Cores – is already over 8 months old. At the same time, power saving processor technologies have been gaining increasing popularity of recently – no less than the performance race. Power consumption control functions of the processors can be clearly divided into two categories – protecting from overheating in emergency cases and reducing power dissipation in standard mode when idle. It's no secret that processors spend most of the time in idle mode, both in modern personal computers and in servers. Intel confirmed this fact by presenting a new technology Enhanced SpeedStep for server processors Xeon. This article will review the most important performance and power consumption control technologies, implemented in Intel Pentium 4 series processors (as well as to their brief comparison with other similar technologies).
In this article you will often come across the following notions: Actual CPU Clock, CPU Load, and Throttled CPU Clock. Though some of them may seem quite evident, we should still dwell on the ideological principles and measuring methods for every above mentioned parameter.
So, Actual CPU Clock is a complete number of cycles, which the clock sends to a processor core via its internal bus per unit time (to get the standard frequency expression in Hertz, you should use 1 second for the unit time). In fact, Actual CPU Clock is just the frequency of a clock signal, generated by the clock, which is reflected in Time Stamp Counter (TSC). You can read it any time using the RDTSC instruction from IA-32 (x86) set. So, the difference between TSC readings for a given time period divided by its duration will give us Actual CPU Clock. This standard method is used for real time CPU clock measurement in most utilities that provide CPU information (CPU-Z, WCPUID, and well as RightMark utilities – RMMA, RMSpy, and RMClock).
Naturally, CPU Load is the number of duty CPU cycles divided by the total number of cycles per unit time. This may seem a trivial matter, but measuring CPU load as such is far from a trivial task. First of all, note that CPU Load is not a differential but an integral quantity. To put it simply, it cannot be measured for an infinitesimal period of time (in fact, it can be done, even with a precision of a cycle, but this quantity will have no profound meaning, because it will either be zero – in case of an "idle" cycle, or one (100%) – in case of an "effective" cycle). That's why it makes sense to measure CPU Load for a relatively large period – for example, 100 ms (or even 1 second). The larger the period is, the higher the measurement accuracy is, but the shorter the response time is. That is if the integration period is too long, the curves will be too smooth and you won't be able to make out abrupt changes in CPU Load. Of course, necessary measurement accuracy is dictated by a given task – in practice, 1 second is quite enough for precise measurements.
We should say a few words about the CPU Load measurement itself as well. Of course, you can rely on the readings taken by an operating system, as most system applications developers do. A significant disadvantage of this method is a relatively low measurement accuracy (system counter resolution does not exceed 15 ms, that is, for example, it cannot get more than 6-7 counts for the 100 ms period. And so the measurement accuracy will be below 15-16%). But the quantity obtained is also of a considerable conditional character – first of all, relative to the real device – CPU (because only a developer knows what and how the operating system measures). In our today's research we'll use a new utility RMClock, which is the first to use a cardinally different method, based on taking readings (counters) of the CPU itself. It lends quite tangible or physical (as scientists put it) meaning to the measurements.
Going back to the beginning, duty cycles are those spent by a physical CPU (we emphasize this moment, because Pentium 4 processors are now equipped with Hyper-Threading technology, which allows to present one physical processor as two logical ones. Thus, we are interested in readings from the real device as a whole) on executing the code of user applications (user mode) and of the operating system (kernel and user mode). All other cycles will be considered "ineffective" or "idle" – the operating system puts CPU for these cycles into sleep mode (it executes HLT instructions). The total of duty and idle cycles must obviously be equal the total number of cycles, so CPU Load can take values from 0% to 100% inclusive.
Now we are to dwell on the last, the most profound quantity – Throttled CPU Clock. It's an exclusive feature of our new utility RMClock – as far as we know, no other alternative SysInfo software can measure this quantity so far. In order to understand this quantity, we should run a few steps forward and briefly mention what CPU throttling actually is. It turned out that its principle is quite simple, it consists in modulating CPU clock. What is it? It's just a part of the total number of CPU cycles is forced "idle". We have put this notion on the third place on purpose, because it is closely related to the first two notions reviewed.
CPU Clock Modulation. Sample Modulation with 25% Duty Cycle. (Source: IA-32 Intel(R) Architecture Software Developer’s Manual, Volume 3: System Programming Guide)
Clock modulation has one important effect – when this procedure starts, CPU Load may become... less than 100% even at full load. A part of cycles in throttling mode are forced "idle", and thus a processor cannot spend 100% of its cycles on executing user code resulting in its lower real load. Certainly, the drop of real CPU Load with the increase of Clock Modulation will not be detected by any operating system or utility, based on OS methods. That's why we mentioned above the conditional character of CPU Load readings in OS.
Considering the above said, RMClock determines Throttled CPU Clock in the following way – it loads CPU 100% for a relatively small period of time (to minimize the influence on Total CPU Load) and counts the total number of executed cycles and duty cycles. Their ratio is the throttling order (or level). Multiplied by the actual CPU clock, it gives the required quantity – Throttled CPU Clock. The latter can obviously be from 0 (only theoretically) to the actual CPU clock, which cannot be exceeded though.
Performance and power consumption control technologies
This detailed explanation concludes the methodological chapter. Let's proceed to more interesting issues – the analysis of performance and power consumption control technologies in Intel Pentium 4 and Intel Xeon series processors. Intel processors currently use the following technologies of this kind:
Let's examine each of these functions.
Emergency overheating detector
It's the simplest, completely automated (that is it cannot be controlled; besides, its presence in a processor cannot be detected) mechanism that first appeared in P6 series processors, which is also implemented in Pentium 4, Xeon and Pentium M processors. Its idea is quite simple – on reaching a certain thermal threshold (specified at the production stage) the processor just suspends execution until RESET# (as claimed in manufacturer's documentation). In practice (at about 100°C) Pentium 4 and Xeon based systems are actually powered off (until the PWRGOOD signal :), that is power on).
Automatic thermal monitoring mechanism #1
Thermal Monitor 1 (TM1) is the very mechanism, which got the widely spread name "Throttling", or "Thermal Throttling", or "Thermal Trip". This name is probably known to everybody familiar with Pentium 4 processors on Prescott core, which often bears quite a negative connotation.
Automatic thermal monitoring mechanism #1 is implemented in Pentium 4, Xeon and Pentium M processors. It is a combination of the second thermal sensor (the first one is used in catastrophic shutdown mechanism), which is also calibrated at the CPU production stage adjusted for its recommended thermal operating range, and the CPU clock modulation (duty cycle) mechanism (which has been already mentioned above). Unlike the catastrophic shutdown detector, TM1 can be both detected (using the CPUID instruction) as well as controlled (using the model-specific IA32_MISC_ENABLE CPU register). According to Intel's recommendations, TM1 must be enabled in BIOS at processor initialization and shouldn't be modified later on (by the operating system).
So, the idea of TM1 is quite obvious – in case of emergency (the most obvious case is cooler failure) the task of this mechanism is to maintain the CPU temperature at maximum safe level (as far as possible, of course) by decreasing its performance, or, in our terms, by reducing CPU Load.
It'll be very curious to try it out in practice – let's take a Pentium 4 processor with Prescott core (Testbed #1) and run RMClock.
This processor supports TM1 and On Demand Clock Modulation (ODCM). Note that the former was already enabled at program startup, the latter was disabled. CPU Load is minimal, its thermal conditions are within the normal range, in this connection the Throttled CPU Clock coincides with the Actual CPU Clock.
Now let's apply 100% load (for this purpose we'll use a simple two-stream application executing FPU instructions). CPU zone temperature (according to SpeedFan) reaches 66°C upon coming up to the steady-state mode. Let's stop CPU cooler using SpeedFan and watch the CPU reaction.
CPU temperature reaches 70°C already in 15 seconds and TM1 snaps into action (we have already seen the same throttling threshold before). Throttling reaches its maximum intensity in one minute – mysterious 46% of duty cycles. CPU Load drop (as an average quantity measured at 1 sec interval) is quite smooth, while the throttled CPU clock curve demonstrates noticeable surges. The latter fact (considering that the throttled CPU clock is estimated by a relatively short interval, the reason for this having been described in the methodological chapter) means that clock modulation in throttling mode (that is when TM1 is active) operates rather chaotically. It stops at the same mark ~46% of the Actual CPU clock (in our case – 1555 MHz). CPU temperature at this point reaches about 80°C, however there is certainly no direct relation between throttling intensity and temperature. The reason has been already explained above: TM1 serves to maintain CPU temperature within the admissible range (ideally – below 70°C) "at any cost". In our experiment the CPU was driven to full throttling by heating it to 80°C. But a different experiment (for example, to slow down CPU cooler instead of stopping it) may achieve full throttling at a completely different temperature (but not lower than 70°C).
OK, enough on CPU tortures – let's set the cooler to full swing.
Complete restoration is approximately in 2.5 minutes – the temperature drops to 70°C, CPU load is restored to standard 100%, and the throttled CPU clock gradually reaches the actual CPU clock (again with typical surges on the corresponding curve).
Now let's see how Intel Xeon Nocona will operate in the same situation (Testbed #2).
Start RMClock in standard CPU mode. To all appearances our processor turned out an engineering sample, because the program wrongly detected it as Pentium 4 (Prescott) due to the lack of the appropriate Brand ID value in CPUID data (actually, there is no other direct way to differentiate between Nocona and Prescott). But the set of thermal features it supports is much more interesting – it has not only TM1 and ODCM (which are available in Prescott as well), but also TM2 and the new Enhanced SpeedStep technology (server modification, as you'll see later, different from the modification in mobile Pentium M processors). When the program starts, TM1 and Enhanced SpeedStep are already enabled (the latter is enabled by RMClock itself, its functionality will be described below).
Let's apply 100% load and then (after some time) disconnect power from cooler (SpeedFan wouldn't start on this platform, so temperature monitoring and rpm control of the cooler fan were impossible). Throttling starts only in 1 minute 40 seconds of operating without active cooling as smooth decrease of actual CPU load. It reaches its maximum in another 2.5 minutes, its maximum being the same as in Prescott – the same 46% (unfortunately, CPU Load jumped up when I took the screenshot). The throttled CPU clock curve again demonstrates surges. It means that TM1 is implemented in Nocona in the same way as in Prescott (which is quite natural, considering that the former is the server modification of the latter with minimum changes).
Resume active cooling of the system. Complete performance restoration is quite fast – just 40 seconds (which has absolutely nothing to do with a CPU, it depends solely on the cooling system properties).
Automatic thermal monitoring mechanism #2
Thermal Monitor 2 (TM2) is an advanced mechanism of CPU overheating protection, implemented in Pentium M processors and, according to Intel, in new Pentium 4 models (nevertheless, you can currently see it only in the server modification – Xeon Nocona). A considerable difference of the new mechanism is that TM2 (as the manufacturer claims) can control the frequency (to be more exact, FID – FSB frequency multiplier) and CPU voltage (VID), while TM1 modulates CPU clock. Due to reduced voltage, TM2 allows to retain better processor performance in case of overheating at the same reduction of power consumption level. TM2 can be detected using the CPUID instruction, and it can be controlled by MSR (IA32_MISC_ENABLE and MSR_THERM2_CTL). Note that TM2 is implemented in Pentium M and Pentium 4 (Xeon) processors differently. Target FID and VID values are important TM2 parameters and a processor should switch to them in case of overheating (the latter is indicated by the same sensor as used in TM1).
Responsibility for TM2 usage is laid on BIOS. TM2 is recommended for 2.8 GHz processors and higher (166 MHz bus) and 3.6 GHz and higher (200 MHz bus), while TM1 is recommended for junior models. It's a non-standard CPU mode to enable or disable both TM1 and TM2 at the same time, thus it's not recommended by the manufacturer. Target FID and VID values should be set by BIOS at the CPU initialization process. A typical target FID is minimum possible (14x) and the recommended VID is 1.275 V.
It's high time to test TM2 on a Xeon Nocona processor (no other options available so far). As BIOS does not allow to choose thermal monitoring modes at will, we'll start CPUMSR (later on we shall add specific CPU settings into RMClock), disable TM1 and enable TM2. We don't change the target FID and VID settings (considering that the table of VID values in CPUMSR is just wrong). After that we run RMClock.
The changes are applied – the processor uses TM2 instead of TM1. What concerns target frequency and voltage settings, you can easily see (Minimal FID/VID indicators) that BIOS set them to 14x and 1.388V (standard CPU voltage), which means that the processor will not change its voltage when switched to TM2 mode. This is irrelevant for our purposes – we cannot estimate the actual power consumption of a processor anyway. So, 100% load is already applied to the processor, we only have to... stop the cooler fan.
And what do we see? Actual CPU clock (TSC clock frequency) remains... unchanged! But the drop of CPU load and throttled CPU clock is obvious – with the same typical surges on the curve. Throttling starts in 1 minute 30 seconds and ends at the second minute with the 82% load and the throttled CPU clock of 2800 MHz sharp. That is the final state, marked by the 14x "multiplier" (so to speak), can be considered reached. But there is one important reservation: it was not the actual CPU clock that was reduced, but only an effective clock, which we call the "throttled CPU clock". What does it mean? Perhaps, the processor is really capable of reducing its clock frequency (according to Intel's documentation), but it does it so intricately that we just cannot see it. Or maybe it cannot do that – God knows, we don't :). As the CPU manufacturer does not provide a "valid" official method to measure CPU clocks in Pentium 4 / Xeon processors, which would take into account their performance and power consumption management technologies, we have nothing to do but to assume that TM2 is the very same... CPU clock modulation. But for one vital difference (from TM1): in TM2 mode a processor can reduce its clock modulation together with voltage.
Power on the cooler. The picture is evident and requires no explanations.
Enhanced SpeedStep technology
And now let's proceed to the "dessert", the hit of new Intel Xeon processors with Nocona core – Enhanced SpeedStep technology for servers (code name – DBS, Demand-Based Switching), which was demonstrated so well on Intel Developer Forum 2004 Russia.
Enhanced SpeedStep initially appeared in Pentium M processors as a replacement of Intel SpeedStep (used in mobile processors Pentium III and Pentium 4) for more effective power consumption management. This is done by changing dynamically P-state transitions, each P-state is set by a combination of FID and VID values). The improved nature of SpeedStep consists in the centralized control mechanism (its integration into CPU core) and the program interface (in the form of model-specific registers). Enhanced SpeedStep is indicated by the corresponding CPUID flag. Processor MSR is used to control this technology (IA32_MISC_ENABLE, MSR_PERF_STATUS and MSR_PERF_CTL).
On the quality level (that is from the point of view of a description), Enhanced SpeedStep in Xeon processors for servers looks the same, while its implementation may be significantly different. Well, why not examine the behaviour of this new technology? We restore the standard CPU mode (TM1 should be enabled just for the record), run RMClock, set the CPU performance/power consumption level to minimum (P-state profile: Minimal). To be more illustrative, let's apply 100% load to the processor. Here are the results...
Did you expect anything different? We didn't. No need in commenting, but we'll do it anyway: the actual CPU clock remains unchanged, CPU load and throttled CPU clock go down to 82% and 2800 MHz correspondingly. It means that Enhanced SpeedStep for servers (in our opinion) is just another modification of the CPU clock modulation mechanism. But in this case it's much closer to TM2 in its implementation, because it can also modify CPU voltage. The difference is that it snaps into action "on demand" instead of being triggered by overheating.
For conscience sake, we tried to set other CPU "multiplier" values – steadily increasing it from 14x to the nominal 17x. The result is quite obvious, throttled CPU clock changes at full CPU load as well as in the idle mode.
Let's dwell on changing CPU voltage as a more important component of Enhanced SpeedStep (as well as TM2). It's not so easy to enable – if BIOS does not properly initialize a processor (the way it's recommended by the manufacturer), you will have to sweat it out. Let's use CPUMSR again. So, first of all, enable TM2 instead of TM1. Then, change FID/VID values for TM2 target. Minimum voltage has to be set "blindly" (as we have already said, the table of VID values in CPUMSR is wrong).
You can see the result – we have turned the trick. On the second screenshot you can see that we can swap TM2 back with TM1 – processor still acts correctly. In conclusion it should be noted that we had no opportunity to rate the real changes in CPU voltage (because we had no system monitoring software for this chipset/motherboard). That's why we made use of a rude method – we set the minimum voltage to 1.000 V. The system instantly rebooted – it means the core voltage had really changed (to an inadmissibly low level).
On demand clock modulation (ODCM)
We deliberately left this technology for the end – as we have demonstrated above, this very technology (at least its basic principle) is used as a basis for all the other performance/power consumption management technologies in Pentium 4 and Xeon processors. Software controlled on-demand clock modulation is indicated by the ACPI flag in CPUID, it can be controlled using processor MSR IA32_CLOCK_MODULATION.
The main parameters in control of this technology – enable/disable and specified minimum level of CPU duty cycle (from 12.5% to 87.5% at 12.5% steps, it also has a reserved value).
We'll use the CPUMSR utility again to examine the software controlled modulation. At first, let's take a Xeon Nocona processor (its results are more illustrative) and set the maximum modulation level – from 12.5% of duty cycles.
The result is very interesting – in almost idle mode the throttled CPU clock drops to 785 MHz (approximately 23% of the actual CPU clock, which is naturally not changed). It's instantly restored to maximum when a CPU gets loaded – even the minimum load, you can see it well in the right screenshot. Here you can also see that ODCM can be successfully used together with Enhanced SpeedStep. In this case throttled CPU clock varies within 647–2800 MHz (at the specified 14x "multiplier").
The same picture is with Prescott. This processor acts a bit differently – on the left screenshot you can see a steady growth of duty cycle from 12.5% to 50%, which results in no changes in throttled CPU clock and means that the minimum possible duty cycle of a processor is 50%. The picture on the right shows further increase of duty cycle to 100% (that is complete ODCM blackout), which is accompanied by increased throttled CPU clock. It's not quite clear why Intel calls this technology "software controlled". Yep, its settings (enable/disable) are indeed software controlled (as well as settings of all the other technologies except for the catastrophic shutdown detector), but its behaviour is completely automated (on the hardware level) and you can easily see that from our tests.
It's time we draw the conclusion of our today's research. Let's dwell on each of the reviewed and tested technologies and briefly enumerate and comment on its key features.
Automatic thermal monitoring mechanism #1 (TM1)
TM1 is an interesting and useful technology that prevents CPU overheating in emergency cases (for example, when a fan on the CPU cooler fails). Of course, nothing stops this technology to work in standard situations – for example, with a low-quality cooling system. To our mind, this technology has one serious drawback in this connection: TM1 effect is absolutely transparent for the operating system as well as to inexperienced users and typical sysinfo software. Why? Because an operating system as well as popular utilities like CPU-Z or WCPUID will tell users that the CPU clock in their systems is still 3.4 GHz (it's an example) and the CPU load is 100% (at full load). But in fact the CPU may be actively "throttling", that is operating at its minimum 46% and shock a user with its performance.
Automatic thermal monitoring mechanism #2 (TM2)
TM2 is a very similar (because it is actually based on the same principle – clock modulation), but still improved version of the TM1 technology. The key improvement consists in reducing voltage when TM2 snaps into action (that is when a CPU is overheated). It's useful both from the point of view of extending the CPU service life as well as its performance (which drops to a lesser extent in TM2 than in TM1). An heir to TM1, TM2 is not without the same drawback – its complete transparency to users.
Enhanced SpeedStep (DBS) technology modification for servers
Paradoxical, but true: Enhanced SpeedStep technology for servers differs little from TM2. The only difference consists in how they snap into action – TM2 operates only on CPU overheating, while DBS can be enabled or disabled "on-demand". It also allows automated performance management depending on CPU load (this function is an integral part of RMClock). From the point of view of the idea (and technology name), Enhanced SpeedStep for servers is a complete counterpart of the mobile Enhanced SpeedStep as well as AMD PowerNow! (mobile version implemented in Mobile Athlon XP) and AMD Cool`n'Quiet (desktop/mobile version implemented in Athlon 64 series processors). But from the implementation point of view it isn't. Enhanced SpeedStep for servers modulates CPU clock (which is certainly easier from the implementation point of view, taking into account that the clock modulation mechanism has been available in Pentium 4 for a long time), while the mobile Enhanced SpeedStep and proprietary AMD technologies honestly change the CPU multiplier "on the run". A nice fact – both technologies can effectively change CPU voltage, which has a much greater effect on CPU power consumption (to the simplest approximation, CPU capacity has a linear relation to its clock, and a quadratic relation to voltage). But what's so wrong with Enhanced SpeedStep for servers? Nothing's wrong actually, but it has the same drawback – CPU multiplier changes (its effective clock actually) are completely invisible to users.
On demand clock modulation (ODCM)
It's an interesting technology, which seems to allow a considerable decrease of CPU clock in idle mode and thus a considerable power consumption reduction. We have a question in this connection – why this technology is not enabled by default? The answer came rather suddenly – we decided to compare CPU temperature in idle mode "as is" (at 3.4 GHz) and at 50% modulation (at the throttled CPU clock of about 1.9 GHz). And we got unexpected results – the CPU zone had absolutely the same temperature in both cases (about 46°C)! Come to think of it, it's quite logical: when a CPU is halted (HLT), it's quite natural to expect of it to go into sleep (power saving) mode. In other words, it doesn't matter whether the clock is modulated or not. Wait... does it mean that the effect of the CPU multiplier reduction in DBS will also be that insignificant? Unfortunately, we had no opportunity to check this assumption in practice, but it may really be so...
Instead of the conclusion
When the article was still in progress, there appeared a new utility ThrottleWatch from Panopsys on the freeware web sites. It's used for detection and quantity evaluation of throttling in Pentium 4 series processors. As it's obviously appropriate to this article, we decided to review its key features and compare them with the features of new (soon to be released) RMClock 1.3.
We've performed a standard step: we took Pentium 4 processor (Prescott), ran RMClock and ThrottleWatch, applied the 100% load, and then stopped a fan on the CPU cooler.
A very interesting result! RMClock can detect precisely the drop of "effective" CPU clock, but ThrottleWatch still assures users that the CPU operates at full capacity, without throttling...
Summing it all up, ThrottleWatch is certainly a useful utility. Considering that it's the first specialized utility (from the non-specialized utilities we should first of all mention CPU Stability Test from the CPU RightMark benchmark) that can detect such a thin and user-transparent Intel technology as CPU throttling. Nevertheless, the throttling measurement methods in ThrottleWatch obviously leave much to be desired. In this connection, we can only recommend our readers to wait for the announcement of the new version of RMClock, which can detect any CPU throttling forms.
Dmitri Besedin (firstname.lastname@example.org)
February 9, 2005
Write a comment below. No registration needed!