Intel Core 2 Extreme QX9650 Processor

In this review we'll examine the latest signficant addition to Intel's processor series - the Intel Core 2 Extreme QX9650 processor based on the Yorkfield core. The core belongs to a new family designed for mobile, desktop and server processors generally codenamed Penryn. Frankly speaking, the majority of users have been most anticipating the updated dual-core solutions on Wolfdale core, since quad-core products interest a considerably smaller percent of potential buyers. But this time Intel did it "AMD style", having began to promote a new processor series with a product for enthusiasts and servers. Thus after Core 2 Extreme QX9650 we should see a triplet of Yorkfield-based Core 2 Quad CPUs:

Processor	Core clock, GHz	Bus clock (QP, MHz)	L2 cache, MB	TDP, W
Q9550	2.83	1333	12 (6x2)	95
Q9450	2.66	1333	12 (6x2)	95
Q9300	2.50	1333	6 (3x2)	95

And a quadruplet of Wolfdale-based Core 2 Duo processors:

Processor	Core clock, GHz	Bus clock (QP, MHz)	L2 cache, MB	TDP, W
E8500	3.16	1333	6	65
E8400	3.0	1333	6	65
E8300	2.83	1333	6	65
E8200	2.66	1333	6	65

But this is to happen in the year 2008. And now we'll examine the only processor in the new series already available... Available to our test lab, at least. :)

Intel Core architecture: a mild upgrade

The transition to a new process technology always has a beneficial effect on developers' imagination, because it means more transistor space, natural reduction of power consumption and heat without any original ideas involved. So, it's no wonder that they've decided to modify the good old Conroe as well - in a number of rather ordinary ways at that.

Newer cache

The maximum volume of shared L2 cache was increased to 6 MB (vs. the older 4 MB). The amount of associativity channels was correspondingly increased from 16 to 24 (6/4=24/16). Also, the cache became "smarter" thanks to the new Enhanced Cache Line Split Load technology that tries to speed up reading of data units distributed among different cache strings. Theoretically, it can speed up applications that actively scan large memory volumes, codecs and archivers included.

Newer instruction set

Intel has long been the trendsetter of x86 instruction set expansions that don't require crucial revamps - MMX, SSE, SSE2, SSE3... There was a time, when AMD tried to compete by creating the 3DNow! set. But that was also the last of its efforts, and now AMD prefers just to license instruction sets from Intel.

The new expansion is named SSE4.1 meaning some incompleteness and implying at least the announcement of SSE4.2. SSE4.1 features 47 new instructions aimed at speeding up streaming data processing, video encoding, and scientific calculations. We won't be dwelling on it more in this review, since SSE4.1 deserves a whole dedicated article. We'll just say that, of popular software, at least DivX v6.7 already supports SSE4.1.

Newer functional units

The most changes are related to fast division and shuffle units - introducing the newer Fast Radix-16 Divider and Super Shuffle Engine. Conroe's Radix-16 divider processed 2 bits per pass, while the newer Fast Radix-16 divider processes 4 bits per pass. In its turn, the newer Super Shuffle Engine now provides for any bit shuffle operations in a 128-bit register per 1 clock. According to Intel, this should considerably speed up both the newer SSE4.1 and "older" SSE3 instructions. Besides, we are promised routine virtualization improvements.

In general, this is suspiciously similar to the Prescott, don't you think? :) But we still hope the resemblance is purely formal.

Hardware and software

Testbed configurations

Hardware:

DDR2 memory: Corsair CM2X1024-6400C4, 2 x 1 GB, DDR2-800, 4-4-4-12.
DDR memory: Corsair CMX1024-3500LLPRO, 2 x 1 GB, DDR-400, 2-3-2-6.
LGA775 board: ASUS P5B Deluxe, Intel P965.
Socket AM2 board: ASUS M2N32-SLI Deluxe, NVIDIA nForce 590 SLI.
Socket 939 board: ECS RD480-A939, ATI CrossFire Xpress 1600.
HDD: Samsung HD401LJ (SATA-II).
Socket AM2 cooler: boxed.
Core 2 Duo / Celeron cooler: boxed.
Core 2 Quad / Extreme cooler: Zalman CNPS9700 NT.
PSU: Cooler Master RS-A00-EMBA.
Graphics card: Reference NVIDIA GeForce 8800 GTX, 768MB DDR3, PCI-E x16.

Processors tested:

Processor	Core 2 eXtreme QX6700	Core 2 eXtreme X6800	Core 2 eXtreme QX6850	Core 2 eXtreme QX9650	Athlon 64 X2 6000+
Process technology, nm	65	65	65	45	90
Core clock, GHz	2.66	2.93	3.0	3.0	3.0
# of cores	4	2	4	4	2
L2 cache*, MB	8	4	8	12	2x1
Bus clock**, MHz	1066 (QP)	1066 (QP)	1333 (QP)	1333 (QP)	2x800 (DDR2)
Multiplier	10	11	9	9	15
Socket	LGA775	LGA775	LGA775	LGA775	AM2
TDP***, W	130	130	130	130	125
AMD64/EM64T	+	+	+	+	+
Virtualization Technology	+	+	+	+	+

* - "2 x ..." means per core;
** - for AMD processors this is memory controller bus clock rate;
*** - measured differently for Intel and AMD processors; impossible to compare directly.

Software

Windows XP Professional x64 edition SP1
3ds max 9 x64 edition
Maya 8.5 x64 edition
Lightwave 3D 9 x64 edition
MATLAB R2006a (7.2.0.32) x64 edition
Pro/ENGINEER Wildfire 2.0
SolidWorks 2005
Photoshop CS2 (9.0)
Visual Studio 2005 Professional
Apache HTTP Server 2.2.4
CPU RightMark 2005 Lite (1.3) x64 edition
WinRAR 3.62
7-Zip 4.42 x64 edition
FineReader 8.0 Professional
LAME 3.97
Monkey Audio 4.01
OGG Encoder 2.83
Windows Media Encoder 9 x64 edition
Canopus ProCoder 2.01.30
DivX 6.4
Windows Media Video VCM 9
x264 v.604
XviD 1.1.2
F.E.A.R. 1.08
Half-Life 2 1.0
Quake 4 1.3
Call of Duty 2 1.2
Serious Sam 2 2.07
Supreme Commander 1.0.3220

Test results

Essential foreword to charts

Our test method has two peculiarities of data representation: (1) all data types are reduced to one - integer relative score (performance of a given processor relative to that of Intel Core 2 Duo E4300, given its performance is 100 points), and (2) detailed results are published in this Microsoft Excel table, while the article contains only summary charts by benchmark classes. We will nevertheless focus your attention on detailed results, when needed.

3D modelling suites

The QX9650 seriously aimed to win from the very outset by outperforming the closest competitor by 6.5%. The value itself is modest, but let's not forget that QX9650 and QX6850 have identical clock rates, so the difference is provided by other reasons.

CAD/CAE suites

The reaction of suites the new processor was very different. MATLAB even liked the older QX6850 more, with it slightly outperforming the QX9650. SolidWorks remained nearly indifferent to the new CPU showing a mere 3% of performance boost. But Pro/ENGINEER met the QX9650 with full approval with it outperforming the QX6850 by nearly 6%.

Digital photo processing

Even if we look at the detailed results, no subtests would be distinguished. The QX9650 was always a bit faster than the QX6850, which made it generally slightly faster in the end. Thus it's hard to say why exactly the new CPU won - either cache volume, or improved calculation units.

Compiling

Since compilers like large volumes of cache, the result is predictable. On the one hand, it's good. On the other, the rest of advantages remained hidden on the background of QX9650's evident advantage in L2 cache volume.

Web server

The matter we've already mentioned this article negates all the architectural and especially L2 cache advantages of Intel's modern quad-core processors.

Synthetics

Merely a stunning result. If we look at the details, we'll see that the key was QX9650's nearly two-times dash forward in the Solver, an unparallelized physical model calculator. I think we'll ask the CPU RightMark team for a dedicated examination of this, but the main conclusion is so obvious that it, most likely, is true. What we've seen here is a result of improving processor's calculation units, since CPU RM is rather indifferent to cache volume (and that has been proven by a large number of previous tests.)

Archiving

Such a modest result, considering the 1.5-time increase of L2 cache. We suppose the testbed memory has become a bottleneck here.

OCR

Here, judging by the upper three lines, we've bottlenecked by something else than a processor. Memory engine, perhaps? Then it should show in the future tests involving DDR3-1333 modules.

Audio encoding

An old group of tests that has nearly lost its importance due to high predictability of results. No comments.

Video encoding

The QX9650 hasn't shown any special advantages though we've had some hopes in improved calculation units after the CPU RightMark tests. Not satisfied with our standard tests, we wanted to uncover the delights of DivX 6.7 with SSE4 support. This "custom" test was conducted with the following settings:

As you can see, the DivX 6.7 introduced a new Experimental SSE4 full search option. Frankly speaking, the word "experimental" means much to experienced people. Translated from geekspeak to English, it usually means: "Impressed with the new features, we've programmed something, but we don't take it seriously."

The results we've got look rather strange. See for yourself:

Enabling that option actually slows down encoding. However, while the performance drop is significant with SSE2, it's nearly imperceptible when being processed by SSE4.

Therefore, based on our results and Occam's razor, we can suppose that DivX developers had a kind of a "dream feature" processed too slowly to realize until SSE4.1. And suddenly an opportunity occurred.

We just hope that the "dream feature" does actually improve the encoded picture or compression level, since otherwise it's not clear why realize it, in the first place. :)

Games

Despite the impressive victory of the QX9650, we'd still like to draw your attention to the detailed results. It's clear that the new processor showed its key advantage in the Low and sometimes Medium Quality modes. This speaks for some very nice prospects (meaning the novelty has more than enough power for games as well.) But its also means that QX9650 is, most likely, excessive for a gaming rig of today, since at highest resolutions and quality settings the performance is anyway bottlenecked by the graphics engine. This merely negates the difference between the QX9650 and QX6850.

Total score

The charts are rather clear. We'll just mention that the QX9650 showed some greater performance boost in "professional" applications than in "home" software. In general, this is a purely positive trend: let the performance monsters act better where it's really needed.

Supposed power consumption

Despite the claimed 130 W TDP (identical to the QX6850's), the real power consumption of QX9650 equals 76 W at 100% load. It's even lower than that of the 65nm lower-clock QX6700. And even if we consider this value as absolute, 76 W is really low for a top CPU in the series. It's been a long time since we saw a high-end processor not exceeding the symbolic 100-watt line.

Conclusions

Fortunately, they haven't made a Prescott out of a Penryn. At the same core clock the new processor performed 8% faster than the old one (on average). Besides, some details indicate that has been achieved not only due to the extensive approach (e.g., increasing L2 cache volume), but also thanks to the actual speeding up of calculation units. This doesn't make it a new architecture, of course, but the modification is still significant. Given the lower-end products based on this core are available all right, Intel's main rival has a reason to worry. While we still wait for a real AMD K10 in a desktop, Intel's new architecture again surmounts the next performance ridge. If this continues, someone may find their new processors slower than competitor's older ones...

Testbed memory modules provided by
Corsair Memory Russia

Stanislav Garmayuk (nawhi@ixbt.com)
November 26, 2007

Write a comment below. No registration needed!