When examining AMD Athlon XP and Intel Pentium 4 processors we left aside a wide range of issues related with their thermal conditions, temperature monitoring and thermal control. It is obvious that these questions are not less important than performance or architectural peculiarities because only correctly operating temperature monitoring together with carefully developed thermocontrol mechanisms can provide optimal thermal conditions and reliable operations of these two flagships.
When Athlon XP and Pentium 4 just appeared we carefully examined over a hundred of technical and normative documents, carried out a lot of tests and experiments and then analyzed thermal conditions of two hottest processors and made some interesting, even provocative, conclusions.
Symptomatology of disease and possible complications
As you know, a die of the integrated circuit (IC) warms up. Current passing in a conductor (semiconductor) is followed by thermal power dissipation, and as the conductor (semiconductor) has finite thermal conductivity its temperature becomes higher than the ambient one. A chip's package and various internal protective/insulating layers whose thermal conductivity is usually lower aggravate the situation by hampering heat removal from the IC's die and making its temperature higher.
Reliable operation of IC transistors and structure of their intercircuit connections much depend on thermal conditions. As a result, a temperature range for an average IC is rather narrow - from -40 to 125°C. The lower limit is caused by different temperature-expansion coefficients of the silicon substrate, isolating/protective layers, metallization layer etc. (low temperatures cause internal mechanical stress which affects IC electrophysical properties and can even break the die). The upper limit can be explained by worsening of frequency and electrical properties of transistors (current weakening, decrease of threshold voltage, gate oxide soft breakdowns etc.) and by possible hard breakdowns in reverse-biased p-n junction. Modern processors with a finer microstructure and more complex packages than those of an average CMOS IC have the temperature range even narrower - from 0 to 100°C.
Well, if a processor can operate well at 100°C, then what's the problem? Its temperature hardly rises higher than 90-95°C even with a rather weak cooler... However, proper operation at high temperatures is illusive because there is also a great deal of electrochemical changes inside the metallic-silicon heart of the computer whose speed actually much depends on temperature. With time they can start hampering normal operation of the processor or even damage it though working temperatures would be within the safety range from the electrical standpoint. Although some of such effects can even improve electrical and frequency properties of transistors, the most of them bring nothing good.
There are two groups of processes which are the most harmful. The first one is electrochemical destruction of metallization (electromigration). An electric field together with a high temperature can let atoms of the metal out of their points. With time the conductor can become much thinner (active resistance grows up considerably), that is why even at a small current localized overheating can rapture a part of a track and then damage a group of transistors, a functional part and a whole IC. Although the 0.18-micron technology of Pentium 4 and Athlon XP provides quite good immunity against electromigration and creates favorable conditions for back diffusion, the balance disappears at 75-85°C and higher temperatures.
The second group is gate oxide degradation. A film of silicon dioxide used as a dielectric under a transistor gate does have some impurities (of n-type usually) which concentrates near the film's inner surface (between the dielectric and silicon). Ions of the impurity causes inversion or accumulation layers (parasitic channels) near the semiconductor's surface under the dielectric which influence a reverse current of p-n junctions and breakdown voltage. The field (in the 0.18-micron transistors the field strength can reach 106 V/cm) and temperature gradients cause drift and diffusion of ions in the dielectric which alters its properties and considerably changes conductance and length of parasitic channels in the semiconductor (therefore, it damages operation of the transistor due to significant current fluctuations), or at worst it can cause hard breakdown of the dielectric or even of the p-n junctions even at low temperatures. Additional ions which migrate into the oxide from other parts of the transistor aggravate the situation, and it takes place again at a high temperature.
All this shows that a high temperature is the first enemy for a processor. It can be proven not only theoretically but also in practice.
According to different sources an average service life of a relatively primitive IC is 50-75 years at 60°C and only 1000-1500 hrs at 125°C. We didn't conducted any large-scale tests of complex ICs (processors), but some semiexperimental estimations of their average service life are much more pessimistic than for ordinary ICs - not more than 1000-1500 hrs at 85-90°C.
Well, it is obvious that it's necessary to have correct temperature monitoring support to estimate a die's temperature to high precision, and thermal control mechanisms to maintain the temperature of modern processors within the acceptable range.
Many consider that temperature monitoring and thermocontrol are optional for an ordinary user as brand cooling systems protect from any overheating. And if it even takes place it's a fault of a user or a support service man.
However, no cooling system is impeccable, and nobody can guarantee lack of defects and correct operation of a system. The most dangerous thing is that some failure or defects appear unexpectedly while a user equipped with no diagnostic means might know nothing.
The most trivial failure is fan failure or malfunction of its thermocontrol in coolers, damage of a rotary pump or its performance drop in liquid-cooling systems. In several seconds thermal resistance may grow up several times causing increase of a CPU temperature. Taking into account an average thermal power of Athlon XP processors of 60-70 W, a temperature inside a system case of 30-40°C and thermal resistance of the system with natural convection of a heat carrier which can be 1.5°C/W and higher it is easy to calculate the resulting temperature... It's terrific!
If a fan (or a pump) noises much you can notices it and can shut down the system in time. But as you know, fans make noise differently and users have different ears.
On the other hand, there are fan speed monitoring programs, but they can also have some errors and usually do not support a wide range of equipment and, above all, are useless if an operating system failures. Besides, monitoring chips (or rather, ADC channels which work in the timer/counter mode and carry out tachometer control) do not always operate flawlessly.
Another problem is that a body of thermal grease changes with time or its physicochemical property do not comply with the requirements. You can hardly say by eye whether certain grease is standard or not. As a result, you can give your processor a good thermal insulator with thermal resistance of 1.5-2°C/W. The resulting temperature of the die will be terribly high.
Apart from adverse thermal conditions there are some other factors which can damage a processor. For example, increased core voltage is a powerful catalyst of oxide degradation and electromigration. But again, too high voltage increases an operating current which, in its turn, makes thermal power higher, and therefore, lifts up the temperature. Quite often a user is guilty himself as he overclocks his mainboard carelessly.
VRM circuits of mainboards or power supply units can also often fail because of considerable voltage jumps in the AC network. However, modern circuit solutions of power supply units and mainboards have a lot of protective means and emergency shutdown mechanisms. That is why a processor would hardly fail because of VRM circuits.
Therefore, the most real problem today is only bad thermal conditions of the processor.
Pentium 4 - diagnostics, treatment and self-treatment
Top models of the Pentium 4 based on the 0.18-micron core (Willamette) have a theoretical thermal power of 90-95 W. However, in practice it is possible to lift heat dissipation only up to 80-85%. But even 70-75 W is too much.
Well, the developers of Pentium 4 made two steps to solve the problem: they improved a processor's package on the macro level (replaced the FC-PGA with the FC-PGA2), and added Thermal Monitor support on the micro level.
The FC-PGA2 package has an integrated heat spreader mounted on the core's surface (a copper plate, 2 mm thick, is covered with a thin layer of nickel).
Although the heat spreader brings in additional thermal resistance (0.3°C/W) on the way from the processor's die to a heatsink it makes possible to reduce 2-3 times an effective heat-flux density and considerably weaken influence of a spreading resistance effect which takes place when surfaces of a heat source and a heatsink base differ much in the FC-PGA package. As a result, a real thermal resistance of the processor-cooler system decreases considerably. One more advantage of the FC-PGA2 case is better mechanical protection of a processor's die in case of longitudinal loads. It allows increasing acceptable hold-down pressure for fastening mechanisms of coolers/heat exchangers, and, therefore, it makes heat losses in thermal interfaces lower (thermal grease layer, phase change material etc.).
The design becomes simpler and a cost of a cooling system reduces at the expence of the optimized (from the thermal standpoint) FC-PGA2 package.
Thermal power remains dangerous for a processor if a cooling system fails. The problem can be solved by using the temperature monitoring technology for a processor.
A thermal diode has been being used for a long time for temperature control of ICs heavy loaded with heat. Voltage drop on a p-n junction depends on its temperature. Later the thermal diode was incorporated in Intel Pentium II/Celeron and now it is installed into AMD Athlon MP/XP.
The thermal diode's temperature doesn't depend on environment and reflects a real die's temperature and its changes. Any methods of external measuring do not make possible to estimate die's temperature objectively as it is inaccessible, and temperature of an IC package much depends on its design philosophy and environment. Besides, stability of characteristics of temperature converter also depends on environment, and they are, as a rule, nonlinear.
However, not always temperature measured with a thermal diode is objective. In course of normal operation of the Pentium 4 (and Athlon XP) temperature can jump within the range of 30-50°C/s. Digital monitoring ICs can't process diode's measurements faster than 8 times/sec (conversion time is 125 ms and higher). It means that the measured temperature is behind the real temperature of the die. The bigger the jumps, the greater the error: when a cooling system fails the temperature jump can be 60°C/sec and more and the temperature lag will be at least 7°C. That is why classical thermal monitoring schemes do not suit the Pentium 4.
All Intel processors starting from Pentium Pro have a temperature sensor and an analog comparison circuit meant for detecting catastrophic overheating. This sensor, like a thermal diode, is a diode-connected transistor but now based on a reverse-biased p-n junction and a dependence of the junction reverse current on temperature. Diode's current is measured with a comparator with a reference source current which is adjusted so that the comparator would react on a definite temperature value. A response time of such circuit is just several hundreds nanoseconds, that is why it can be found out quickly if a temperature exceeds the limits. As a result, if a temperature of an Intel CPU is more than 125-135°C, this comparison circuit stops sending clocks to all processor nodes, the THERMTRIP# signal reports on catastrophic overheating and the processor VRM turns off.
The engineers working on the Pentium 4 decided to make such circuit more flexible and developed the Thermal Monitor technology. The thermal sensor, thus, moved to the most heat-loaded region of rapid integer ALUs of the Pentium 4 and got an additional comparison circuit and necessary logic. This resulted in a one more die's temperature threshold (85-90°C depending on a processor model), Thermal Control Circuit and several new MSR registers.
When the die temperature exceeds the threshold value the processor doesn't get turned off, but from time to time the clock signal stops, i.e. a duty cycle takes place.
The Thermal Monitor has two modes: Automatic and On-Demand. The automatic mode can be activated in the BIOS and in case of overheating a clock modulation block slows down the processor by 50% (i.e. time of the duty cycle and of the normal cycle is the same). The on-demand mode can be enabled any time depending on thermal conditions with the immediate activation of the clock modulation unit. In this mode it is possible to vary a duty cycle from 12.5% to 87.5%. When activated, this block asserts PROCHOT# signal. It is also possible to generate a processor interrupt on front/edge, PROCHOT# (this signal is also accessible for internal processor blocks) which can be easily used by BIOS or software developers.
As you can see, the Thermal Monitor is an accurate, prompt (delay in case of the Pentium 4 is only several tens of nanoseconds) and efficient temperature control means. Of course, The Thermal Monitor doesn't depend on the system logic completely as the clock modulation unit must be activated in the BIOS or system software. But even if this feature is not enabled, there is catastrophic overheating protection and a good old thermal diode.
Athlon XP - diagnostics and treatment
The Athlon XP has no any catastrophic overheating protection. Why didn't AMD provide such protection as it isn't beyond their power? The Athlon XP can be easily damaged when a cooling system can fail (thermal power of top model of the Athlon XP is 60-70 W and is almost equal to that of the Pentium 4).
Besides, the Athlon XP has quite a useless thermal diode because the most of latest Socket A mainboards do not support it, although top chip makers and mobo manufacturers had samples of the Palomino based processors yet in March-April last year.
There were even some funny stories connected with this thermo diode. Last summer ASUS released A7V266 board which had THEMCPU jumper for thermal diode support. However, after installation of the Athlon XP the board demonstrated 50-52°C irrespective of the processor load. Nothing changed in the A7V266-E board as well.
However, not all mobo makers are so careless. For example, the D1289 board from Fujitsu Siemens Computers excellently "understands" the diode.
The D1289 has high-quality support of the thermal diode, and we chose it as a base for a testbed for Socket A coolers.
Thermocontrol is based on a classical scheme and called CPU Throttling. The system monitoring chip is loaded with a temperature threshold value. Measurement results of the diode are regularly compared with the threshold value, and if the temperature exceeds the limit, THRM# signal applied to a south bridge (we had VIA VT8233) makes it generate a duty cycle and apply STPCLK# signal on the Athlon XP processor. A cycle duration depends on values of the corresponding register of the south bridge and can amount to 100%.
As we mentioned before the worst drawback of such classical scheme is great latency (for the monitoring chip of the D1289 it makes 150-200 ms). Thus, the chip processed temperature goes behind the real one and makes impossible to react promptly on sharp temperature jumps. If the cooling system fails the thermal resistance go up quickly (2-3 times in several seconds) and temperature can start jumping at 70-100°C/s. In such extreme conditions the classical scheme is not inappropriate anymore.
Mainboards without thermal diode support where a thermistor in the socket is used as a thermal sensor has such so-called thermocontrol built up the same way. But the thermistor doesn't reflect a true temperature of the core: the difference between its values and the true temperature can reach 20-40°C.
The Athlon XP supports energy-saving Halt and Stop Grant modes which are usually related with ACPI C1 and C2 modes. In the Stop Grant mode the CPU dissipates 2-3 times less thermal power than in a usual mode with the temperature going down proportionally. But the system logic of the most Socket A mainboard never sets up this mode. As a result, the Athlon XP consumes almost maximum power all the time, and the core's temperature is, thus, very high.
Taking into account that typical office computers stand idle 99% of the processor time it's obvious that it's necessary to enable the Stop Grant mode during down-time periods.
Well assembled systems based on the Athlon XP often reduce probability of catastrophic overheating. Still, this can happen. If you use an excellent heatsink and good thermal grease the thermal resistance of the processor-cooler system doesn't exceed 1.5°C/W. There is no catastrophic overheating immediately (the temperature never reaches 150-160°C), but the temperature can still reach 100-110°C and is able to reduce to acceptable 80-90°C only in 5-10 min. Such high temperature is obviously harmful for the processor microstructure. It's not correct saying that such thermal tortures pass without leaving a trace. It is possible that processor may fail later even if the temperature is normal - just because of essential gate oxide degradation, electromigration and other harmful effects. As I mentioned in the Symptomatology of disease and possible complications section these effects are directly connected with overheating of the processor and can shorten its service life much.
Well, the resume is really disappointing: despite an excellent performance of Athlon XP systems the most of them do not have reliable and effective thermocontrol means, do not provide optimal thermal conditions on the system level and are hardly protected from failure of cooling systems. By the way, it was proven by one of the hardware physician - Tom Pabst.
Athlon XP - vivisection
I guess many of you saw the reports and the video clip which initiated experiments on Pentium 4, Pentium III, Athlon and Athlon MP carried out by a team headed by Tom Pabst. They showed terrific "faces" of, in fact, cremated Athlon and Athlon MP after a short period of time of their running without any cooling system.
After the experimental results were published Tom was accused of being corrupt, and it was also said that the D1289 had no thermal diode support (exactly this board was used by Tom) and didn't suit the Athlon MP. It is obvious that all those views and debates have no reasons. It is also obvious that Tom was able to get at the root. That is why we will now turn to the technical side of the issue.
From the technical point of view the experiments held by Tom and his team make no sense, and has nothing common with the practice. First of all, Tom says that it is very probable that the socket's fasteners can break or the cooler can unfasten after rigging of system blocks, delivery of computers by mail etc. I'd say he is naive. Because socket's fastener can stand the load of 5-10 kg and several hundreds of cooler installation/deinstallation cycles.
Secondly, even if a cooler breaks the video card can get mechanical (and electrical) damages which can cause failure of not only the video card but also a mainboard. The most ridiculous thing is that the processor can remain safe and sound because a competently developed processor VRM circuits will immediately turn off the core voltage in case of the catastrophic failure.
Thirdly, both Intel and AMD processors can't be used without cooling systems, which, at the same time, must meet certain requirements.
At last, the result of such experiment on Athlon or Athlon MP/XP processors is predetermined - they will not survive! Why?
Without a cooler the thermal resistance of the processor is be 8-9°C/W, and in 100-150 ms after the heatsink with a cooler "turns off" heat-loaded nodes overheat. 100-200 ms are enough for it to start hard breakdown. It means that the thermocontrol system must do its best to put the processors' temperature into the acceptable range at a maximum of 150-200 ms. As you remember, it takes at least 150 ms for the monitoring chip in the D1289 board to "realize" thermal diode measurements. It may seem that the chip has enough time to bring the processor back to life, but it isn't so as in 150 ms it will receive the temperature value measured when there was still no problems or if it didn't raise up considerably. The next stage will take 150-200 ms more but it won't help.
The second video clip posted by AMD, as you remember, proves the opposite: the Athlon MP remained safe without any cooler or fan. However, you should take into account the testbed configuration: the guys from AMD used a completely different thermocontrol technique. They took MAX6512 chip and connected it directly to the processor socket's outputs linked with the thermal diode of the Athlon MP/XP. This chip is a pure analog model, and effective time of temperature measurement is thus about 70 ms (high temperature jumps are easier to register). As a result, the MAX6512 has time to send an emergency signal and turn down power supply of the mainboard and of the processor before the temperature of the latter reaches critical values. As you can see, the conditions of the experiment of Tom and AMD's guys are different, and it is not correct to compare them.
Well, I hope the company will make some steps to improve its product. It is needed to add just a small sensor of catastrophic overheating and an ordinary thermocontrol system and alter a processor case.
The Intel Pentium 4 has a great reliability reserve as there is catastrophic overheating protection and Thermal Monitor support. But optimal reliability/performance ratio for Pentium 4 based systems requires the most effective cooling means.
The AMD Athlon XP lacks for any integrated overheating protection means, and the most of systems based on it do not have any correct thermocontrol mechanisms. At present Athlon XP based systems do have thermal problems and are not protected from serious failures of cooling systems. However, systems with the thermal diode support still have a minimal thermal protection.
Write a comment below. No registration needed!