iXBT Labs - Intel 6th Series Chipset Defect - Page 1: The problem and Intel's solution

<< Previous page

As we all know, people find certain pleasure in seeing each other in trouble. If a celebrity or a major company is in focus, emotions go beyond mere delight to gloating or similar feelings. People also like to exaggerate, to make mountains out of molehills. This is something one can hardly do by himself, but the development of telecommunications does help a lot, elevating emotions to panic at times.

This is what comes to mind, when you look at the hype caused by Intel's announcement about problems with the newest chipset and by company's intention to withdraw and replace all chips shipped so far. Well, of course, only a relatively small portion of the Internet is enveloped in a hype, but, hey, we're a part of it too.

Being reasonable and willing to help you, our readers, we searched for available information on the problem. Since many would like to hear from Intel itself, we contacted Mikhail Rybakov, Intel PR Manager Russia/C.I.S., over the phone and asked him a few questions. Here's what we've managed to find out.

The heart of the problem

As you may know, modern chips consist of hundreds of thousands and millions of transistors. Central processing units are more complex and feature up to a billion transistors each. Chipsets are simpler but comparable to processors produced a few years ago. Dies haven't increased in size for a long time already and have even been reducing to cut cost prices. Higher levels of integration are achieved by moving to finer process technologies — modern electronic technologies have long become nanotechnologies.

There were certain problems on this path which didn't bother manufacturers 15 years ago, but they did 7 years ago. When the first 90nm Prescott-based Pentium 4 processors were rolled out, the masses got to know about leakage current. As a rough approximation, this is when electrons don't behave, ignoring electric current decorum and breaking the laws of how semiconductor devices, e.g. transistors, are supposed to function. For example, instead of moving along carefully laid conductors, electrons 'puncture' the dielectric layer and leave elsewhere. Or, vice versa, arrive from elsewhere. As a result, transistors switch spontaneously, making machine logic sort of 'fuzzy' (pun intended). Engineers managed to deal with this unpleasant issue, but it took some time, effort and sacrifice. In particular, the power consumption of Prescott (the fixed one) turned out to be higher than planned, and expected clock rates were not achieved. Eventually, they decided that the entire NetBurst architecture was at a dead end and gave up on it.

What do the P67 and H67 chipsets have to offer? Like predecessors, these solutions are made using the 65nm process technology which has long been polished (CPUs are made using the 32nm one already). But in terms of functionality and 'performance' (in what measure this word is applicable to chipsets) these chipsets differ much from the 5th series. No wonder, many criticized the latter for the mediocre PCI Express performance of just 250MB/s per lane which corresponded to the first edition of the standard. Besides, SATA-600 is becoming more and more popular these days. In other words, the new chipsets are actually new, unlike the previous (the P55 reminded the older ICH10R a lot). But it's always hard to do so much work with no errors, and the 6th series is no exception.

So what's the problem? The leakage current turned out to be higher than planned for one of the transistors. This happened because the dielectric layer turned out to be too thin for the chosen voltage, or the voltage was too high for that chip design. It's not clear how the error was made. Anyway, such things happen much more often than we hear about them. But in this case Intel is unlucky, because the problematic transistor is in the clock generator circuit responsible for SATA-300 ports (of which there are 4). In certain conditions this may result in controller synchronization errors, which, in turn, will lead to read and write errors. This may reduce performance of drives at best, as data will be read/written several times until confirmation. Under the least favorable conditions, data may be corrupted. This is not a certainty, but a possibility.

This is not a logical error in die topology (like a corrupt interconnection or something), but a potential problem that may show over time as a result of wear. Serious errors are detected as soon as the first wafer is made, because chips are run through a number of logic tests. How does one find a less serious error? All manufacturers use more or less similar methods of accelerated aging. The same batch of chips is exposed to high temperatures in a heat chamber as well as high voltages to simulate prolonged wear. There are rather strict mathematical models which allow engineers to predict mean time between failure (MTBF) based on statistical damage results obtained in aforementioned wear tests. That's exactly what we're dealing with today: a prediction from Intel (we'll discuss exact changes and time periods later). One has to understand that it's a statistical estimate, not a fact. There are simply no 3-year old machines based on the new chipsets at the moment to speak of actual defects.

Since data stored on computers often costs much more than computers themselves (unless it's a gaming rig), Intel made a tough decision not to wait for actual trouble. As the Murphy's law states, "Anything that can go wrong, will go wrong," so they had to look for a solution.

Intel's solution

Firstly, Intel acknowledged and annouced the error officially. This was a serious, painful, but necessary step. It blotted their copybook and gave some people a reason to gloat. But what could they do? The problem is real and can't be ignored, so users have been warned. By the way, not all other companies do the same (we'll not name names). Some act like everything's fine and acknowledge problems only through force of circumstances (and angry buyers), when it's impossible to put a brave face on a sorry business anymore.

But, of course, just to acknowledge the problem is not enough. Intel urgently developed a fixed chipset modification that will begin shipping to motherboard makers this month. All chipsets in stock have already been recalled, and those already used in motherboards will be most likely replaced in March in advance. For example, imagine a company has purchased a million chipsets from Intel by today. By January 31, 500K have been in stock, and the other 500K have been used in motherboards shipped to retailers. So, within its replacement program, Intel will exchange the entire million of chips before the distributed 500K is withdrawn. This will let the motherboard maker resume shipments (based on the fixed chipsets) and accumulate enough products to be ready to exchange them for the previously shipped defective ones.

As for motherboards made by Intel itself, it will completely cover the replacement expenses. In other words, a defective chipset revision should be enough to trigger a warranty-based refund, whether you have problems with the motherboard or not. Unfortunately, it's not clear what other motherboard makers will do. Will they recall all sold products or replace them only after users there are defects? It's also not clear what retail stores will do. But that obviously doesn't depend on Intel. The company has done its part (or rather has started doing it), and the replacement program doesn't have a time limit.

The things are expected to be set straight in April, although our contact in the company carefully hinted that it may happen sooner. According to Intel's estimate, the losses will be up to $700 million. Although some analysts predict higher numbers. They believe that direct damages (chip redesign, restarting the manufacture, chipset replacement) and indirect damages (stained reputation, delays in spreading of the new platform) may make at least a billion dollars — a painful loss even for Intel. However, the last year, company's monthly net income was about the same billion (and growing). So they've just corrected income expectation downward, but not too fundamentally.

Write a comment below. No registration needed!

Next page >>