Intel Core i7 - 965/920 - architecture and performance
Category : Reviews
Published by Jean-Luc Hadey on 03.11.08
Original ImageWith the Nehalem architecture Intel presents the next step in the evolution of microprocessors. Officially the now call this 731 million transistors heavy derivate Core i7.
On big step was to integrate the memory controller into the CPU. Further Nehalem again comes with Hyperthreading which we still know from the Pentium 4 and a heavily overworked power management which now can even shut down single cores to save energy. These are only two of the new features Nehalem is provided with.

First of all we thank Intel for providing us with the test samples.

Original Image


On the following pages we will discuss the features of the Nehalem architecture as well as its performance.


Discuss this article in the forum.
[pagebreak]
Specifications

 Intel Core I7-9xxIntel Core 2 Duo Intel Core 2 Quad
Model


 
I7-965, I7-940, I7-920



E8600, E8500, E8400 E8300, E8200, E8190
E7300, E7200
 
Q9650, Q9550, Q9450 Q9400, Q9300, Q8200

 
CodenameNehalemWolfdale Yorkfield
SocketLGA-1366LGA 775LGA 775
Process45 nm45 nm45 nm
Clock (GHz)2.66 - 3.22 GHz2.53 - 3.33 GHz2.33 - 3.00 Ghz
Cores424
Hyper-ThreadingJaNeinNein
Turbo Mode JaNeinNein
Bus Speed133 MHz266 & 333 MHz Quad Pumped333 MHz Quad Pumped
MemoryDDR3DDR2 / DDR3DDR2 / DDR3
Memory controllerInternal, Triple-ChannelExternal, Dual-ChannelExternal, Dual-Channel
QPI (QuickPath-Interconnect)JaNeinNein
Transistors731 Mio167 oder 291Mio291 oder 410 Mio
Die-size263 mm2107 mm22 x 107 mm2
L1-Execution Cache32 KByte32 KByte32 KByte
L1-Data Cache32 KByte32 KByte32 KByte
L2 Cache256 KByte2 - 6 MB4 - 12 MB
L3 Cache8 MB shared n. A.n. A.
TDP130 Watt65 Watt95 Watt
C1E technologyYesYesYes
Enhanced Intel
SpeedStep (EIST)
YesYesYes
VirtualisationVanderpoolVanderpoolVanderpool
Instruction setsMMX, SSE, SSE2, SSE3
SSE4.2, EM64T
MMX, SSE, SSE2, SSE3
SSE4.1, EM64T
MMX, SSE, SSE2, SSE3
SSE4.1, EM64T


Original Image


Original Image


Discuss this article in the forum.
[pagebreak]
Microarchitecture design I

Design

Original Image


The Front-Side-Bus-architecuture on which Intles Core 2 CPUs and older ones are base might be sufficient for notebooks but high performance desktops, workstations and servers demand for higher bandwidths. Notebooks therefore access one or two cores and the are cost and energy sensitive also the need low latency for a single task. On principle desktop systems are similar to notebook systems but they need to be able to handle extreme bandwidth when it comes to discrete graphic solutions. Servers can access a nearly unlimited number of CPUs and they also need to handle many different tasks. Therefore they require low latency and massive bandwidth.

With Nehalem, Intel designed a flexible, modular and highly scalable CPU which comes up to all these demands.


Discuss this article in the forum.
[pagebreak]
Microarchitecture design II

Original Image


Nehalem is Intels first monolithic quad core CPU which comes with an integrated last level cache. A central queue connects the different cores with eachother and also the uncore region where the L3-cache, the integrated memory controller and the QPI links are located.

Through integrating the memory controller on to the CPU there is no more need of a Front-Side-Bus. Instead there now are QPI links (Quick Path Interconnect). Summarized QPI is a packet based point to point connection which provides high bandwidth and low latency. Best case up to 6.4 GT/s are possible. Every link is implmented as a 20 bit wide interface. QPI packets are 80 bit long and will be transmitted in 4 to 16 cycles. 16 of 80 bit are reserved for flow-control and CRC. The other 64 bit handle data. Every link is able to transfer 12.8 GB/s and because of the fact that QPI links are bi-directional 25.6 GB/s can be transferred. Nehalem will scale with the number of QPI links which will be determined by the targeted marketsegment.


Discuss this article in the forum.
[pagebreak]
Memory controller, SMT and SSE4.2

Memory controller

Nehalems integrated memory controller supports tripple channel DDR3 which does 1.33 GT/s. The maximum bandwidth therefore is 32 GB/s. Every controller can act independently. This has a positive impact on the overall performance. To profit of the four times higher bandwidth every core supports now up to ten data cache misses and 16 total outstanding misses. A Core 2 on the other hand supports 8 data cache misses and 14 total misses in-flight.

Also the integrated memory controller improves memory latency substantially. In example Nehalem proviedes 60 ns latency compared to 100 ns on Harpertown.


SSE4.2 and SMT

Also the instruciton set has been extended. SSE4.2 therefore now supports string comparsion, a CRC instruction as well as a popcount. Demonstrations under optimal conditions showed a performance increase 6 to eighteen times but in every day applications the increase will be much smaller.

Much more important is the return to simultaneous multithreading (SMT). Intel first implemented the on a 130 nm Pentium 4. SMT demands a lot of bandwidth to deploy its potential because it generates substantially more outstanding misses. Also the validation is very complex and requires a design which is basically conceived for SMT. Further SMT is a very elegant way to get more performance out of one core because it es very energy efficient.

Original ImageOriginal Image



Discuss this article in the forum.
[pagebreak]
TLBs and virtualisation

TLB

Nehalem also features changes in the TLB hierarchy wich come hand in hand with the changes in the cache hierarchy.


Original Image


Nehalem now has a true two level TLB hierarchy which can be allocated dynamically between threads. The first level TLB serves all memory acceses and contains 64 entries for 4 KB pages as well as 32 entries for 2M / 4M pages whereas it keeps four way associativity. Further Nehalem contains a second level unified TLB with 512 entries for small pages which again kepps the four way associativity.

To allocate the whole cache every core has 576 entries for small pages and 2304 for the whole chip. The number of TLB entries makes the translation of 9216 KB possible which is more than enough for the 8 MB L3-Cache Nahlem comes with.


Virtualisation

Nehalems TLB can also access VPIDs (Virtual Processor ID). Every TLB entry caches the translation of a physical to a virtual address. The translation is specific to a given process and virtual machine. Earlier Intel CPUs needed to flush the TLB whenever it was switched between virtualized guest and host instance. Intel estimates that the latency for a VM round trip is 40 percent compared to Merom (65 nm Core 2).

A further improvement referring to virtualisation can be found when we look at the extended page tables. They are now able to eliminate many VM transitions and not only reduce the latency like VPID does. Earlier Intel designs needed a hypervisor which was handling page faults. Now the page tables can be simply compared which saves many unnecessary VM exits.


Discuss this article in the forum.
[pagebreak]
Power Management

Overview

Compared to its predecessors the Core i7 comes with a power control unit (PCU) which is able to shut down the cores. The advantage is obvious: leakage in indle state can be reduced and therefore also energy comsumption. Futher the PCU - which by the way contains as many transistors as a whole 486 CPU - can also deactivate the uncore regions which reduces the power consumption even more.


Minimizing the power consumption in idle-state

On principle the following three circumstances generate dissipation power: processes which run a high frequency cause leakage, high frequency needs high performance referring to golbal clock distribution as well as the logic and the local clock distribution are responsible for the power consumption of a CPU.

Power consumption can be minimized when the operating system tells the CPU that no task is ready for execution. The CPU now goes in to idle mode or a so called C-State. There it waits for futher instructions.

Before Nehalem the following C-states were available:
  • C0: CPU ist aktiv
  • C1, C2: stops the pipeline and most clocks
  • C3: stops the residual clocks
  • C4, C5, C6: lower the core voltage/frequency and reduces leakage


Integrated Power Gate

To reduce the voltage of every single core the so calld power gate is needed. This is in other words a switch between VR-output and the current supply. The advantages of this are the following:

  • very low on-resistance
  • very high off-resistance
  • Current can be provieded much faster
  • makes C6-state (current regulation) per core possible
  • dissipation power near zero


The integrated power gate gives the option to reduce the power consumption in idle state to near zero watts. This can also happen for every core independantly.

Further also the memory subsystem draws less power because the memory clocks can be reduced during requests and low load. Even the QPI links draw less power because they are less active. A similar behaviour is showed by the PCI Express links.

Summarized one can say that the power consumption of the whole platform is going to be lower.


Discuss this article in the forum.
[pagebreak]
Turbo mode

At this point we start with an example: When the operating system only needs the computing power of one core because a single task applications needs to be executed as fast as possible Power Gate can shut down three of the four cores and overvolt/overclock one single core within the same thermal envelope. This feature Intel calls Turbo Mode.

This feature is thanks to the Power Gate again very flexible so every single core, two, three or four of them can be overvolted/overclocked simultaneously.

In other words Turbo mode extends Intels already well known SpeedStep technology by P-states (power states) which the operating system can ask for when it needs more computing power.



[pagebreak]
Testenvironment

Hardware
 
MainboardIntel DX58SO Extreme, Intel X58 Express Chipset, LGA 1366 Sockel
CPUIntel Core i7-965 Extreme Edition, 3.20 GHz
Intel Core i7-920, 2.66 GHz
Intel Core 2 Quad Q9650, 3.0 GHz
Intel Core 2 Duo E8600, 3.30 GHz
Intel Core 2 Duo E7200, 2.53 GHz
MemoryQIMONDA 3 x 1GB DDR3 1067 MHz CL7 (Triple Channel Kit)
Graphic cardATI Radeon HD 4870 512 MB GDDR5
HarddiskSamsung 620GB F1
PSUCorsair HX 1000 Watt


Software
 
OSWindows Vista Home Premium 64Bit SP1
Grapics driverATI Catalyst 8.9, CCC
BenchmarksPCMark Vantage
3DMark Vantage
3DMark 06
SiSoftware Sandra Lite 2009
7-Zip
WinRaR
POV-Ray 3.7
Cinebench R10
Lame 3.97 (MP3 Audioencoding)
Virtual DUB (Videoencoding)
Sony Vegas Pro 8.0 (Videoencoding)
F.E.A.R
Call of Juarez
Crysis




[pagebreak]
Synthetic benchmarks

PCMark Vantage results in points
Intel Core I7-965 Extr. Edition, 3.2 GHz7'457
Intel Core I7-920, 2.66 GHz6'655
Intel Core2 Quad Extreme Q96506'004
Intel Core2 Dual E86005'230
Intel Core2 Dual E72004'297

more is better



3DMark Vantage, Performance Preset (CPU Score), results in points
Intel Core I7-965 Extr. Edition, 3.2 GHz 19'837
Intel Core I7-920, 2.66 GHz17'033
Intel Core2 Quad Extreme Q9650, 3.0 GHz12'627
Intel Core2 Dual E8600, 3.3 GHz6'883
Intel Core2 Dual E7200, 2.53 GHz5'095

more is better


3DMarkVantage not only profits from the overworked architecture it does so also because it is optimized to run in an environment with eight cores. The performance advantage for the Nehalem in this case is 24 percent.


SuperPI 1.5 - 1M
Intel Core I7-965 Extr.Edition 3.2 GHz12.106 sec
Intel Core2 Dual E8600, 3.3 GHz13.843 sec
Intel Core I7-920, 2.66 GHz14.562 sec
Intel Core2 Quad Extreme Q9650, 3.0 GHz14.390 sec
Intel Core2 Dual E7200, 2.53 GHz19.905 sec

less is better



SuperPI 1.5 - 32M
IntelCore I7-965 Extr. Edition 3.2 GHz 680.875 sec
Intel Core I7-920, 2.66 GHz 798.552 sec
Intel Core2 Dual E8600, 3.3 GHz 855.047 sec
Intel Core2 Quad Extreme Q9650, 3.0 GHz 912.469 sec
Intel Core2 Dual E7200,2.53 GHz DDR3 1105.000 sec

less is better


In SuperPi 1M Core i7 965 performs about 14 percent better then a Q9650. If one calculates Pi to 32 million digits Nehalem performs 26 percent better.



[pagebreak]
Synthetic benchmarks - SiSoftSandra

Sisoft Sandra XIIc CPU-Arithmetik Drystone Benchmarks, results in MIPS (million instructions per second)
Intel Core I7-965 Extreme Edition,3.2 GHz79'280
Inel Core I7-920, 2.66 GHz66'460
Intel Core2 Quad Extreme Q9650, 3.0 GHz55'781
Intel Core2 Dual E8600, 3.3 GHz30'983
Intel Core2 Dual E7200, 2.53 GHz23'544

more is better



Sisoft Sandra XIIc CPU-Arithmetik Whetstone Benchmarks, results in MFLOPS (million floating point operations per second)
Intel Core I7-965 Extr. Edition, 3.2 GHz 68'850
Intel Core I7-920, 2.66 GHz 58'170
Intel Core2 Quad Extreme Q9650, 3.0 GHz39'859
Intel Core2 Dual E8600, 3.3 GHz22'544
Intel Core2 Dual E7200, 2.53 GHz17'183

more is better


In SiSoftSandra Core i7 is up to 72 percent faster than its predecessor.


Sisoft Sandra XIIc Memory Bandwidth Integer, results in MB/s
Intel Core I7-965 Extr. Edition, 3.2 GHz18'150
Intel Core I7-920, 2.66 GHz17'010
Intel Core2 Quad Extreme Q9650, 3.0 GHz7'412
Intel Core2 Dual E8600, 3.3 GHz7'332
Intel Core2 Dual E7200, 2.53 GHz5'990

more is better



Sisoft Sandra XIIc Memory Bandwidth Fliesskomma, results in MB/s
Intel Core I7-965 Extr. Edition, 3.2 GHz 18'260
Intel Core I7-920, 2.66 GHz 17'090
Intel Core2 Quad Extreme Q9650, 3.0 GHz7'354
Intel Core2 Dual E8600, 3.3 GHz7'305
Intel Core2 Dual E7200, 2.53 GHz6'023

more is better


Even more drastic is the performance increase in memory bandwidth: A massive 248 percent.


Sisoft Sandra XIIc CPU-Multimedia Fliesskomma Benchmarks, results in MPixel/s
Intel Core I7-965 Extr. Edition,3 .2 GHz130.02
Intel Core I7-920, 2.66 GHz 109.22
Intel Core2 Quad Extreme Q9650, 3.0 GHz95.20
Intel Core2 Dual E8600, 3.3 GHz59.98
Intel Core2 Dual E7200, 2.53 GHz6.57

more is better



Sisoft Sandra XIIc CPU-Multimedia Integer Benchmarks, results in MPixel/s
Intel Core I7-965 Extr. Edition, 3.2 GHz168.11
Intel Core I7-920, 2.66 GHz141.08
Intel Core2 Quad Extreme Q9650, 3.0 GHz135.17
Intel Core2 Dual E8600, 3.3 GHz75.25
Intel Core2 Dual E7200, 2.53 GHz6.61

more is better


Also in the multimedia benchmarks Nehalem is up to 37 percent faster.



[pagebreak]
Datacompression and rendering

Data compression

WinRAR, 4 x 70MB 48 Bit TIFF pictures to a 297 MB archive file
Intel Core I7-965 Extr. Edition, 3.2 GHz77 sec
Intel Core I7-920, 2.66 GHz 89 sec
Intel Core2 Quad Extreme Q9650, 3.0 GHz118 sec
Intel Core2 Dual E8600, 3.3 GHz172 sec
Intel Core2 Dual E7200, 2.53 GHz185 sec

less is better



7-Zip Kompression 64MB, results in MIPS (million instructions per second)
Intel Core I7-965 Extr. Edition 3.2 GHz 16030
Intel Core I7-920, 2.66 GHz 11726
Intel Core2 Quad Extreme Q9650, 3.0 GHz10952
Intel Core2 Dual E8600, 3.3 GHz6342
Intel Core2 Dual E7200, 2.53 GHz4769

more is better


Datacompression performs on a Nehalem system up to 53 percent better.


Rendering

POV-Ray 3.7, All CPU Benchmark
IntelCore I7-965 Extr. Edition,3.2 GHz 62 sec
Intel Core I7-920,2.66 GHz74 sec
IntelCore2 Quad Extreme Q9650, 3.0 GHz103 sec
Intel Core2 Dual E8600, 3.3 GHz187 sec
Intel Core2 Dual E7200,2.53 GHz247 sec

less is better



POV- Ray 3.7, All CPU Benchmark, results in pixels per second (PPS)
Intel Core I7-965 Extr. Edition 3.2 GHz4'164.52
Intel Core I7-920, 2.66 GHz3'532.03
Intel Core2 Quad Extreme Q9650, 3.0 GHz2'542.23
Intel Core2 Dual E8600, 3.3 GHz1'398.71
Intel Core2 Dual E7200, 2.53 GHz1'057.86

more is better



Cinebench R10 Multi CPU Rendering, Cinebench Score
Intel Core I7-965 Extr. Edition 3.2 GHz 19'046
Intel Core I7-920, 2.66 GHz 15'920
Intel Core2 Quad Extreme Q9650, 3.0 GHz12'943
Intel Core2 Dual E8600, 3.3 GHz7'963
Intel Core2 Dual E7200, 2.53 GHz5'981

more is better


In both POV-Ray benchmarks Core i7 965 is at least 60 percent faster than Q9650 and in Cinebench it still is 47 percent.



[pagebreak]
Video- and audioencoding

Videoencoding

Sony Vegas, 320MB 1440x1080 HD Video zum 1920x1080 60i Blu-ray Disc ISO image
Intel Core I7-965 Extr.Edition 3.2 GHz171 sec
Intel Core I7-920, 2.66 GHz210 sec
Intel Core2 Quad Extreme Q9650, 3.0 GHz276 sec
Intel Core2 Dual E8600, 3.3 GHz379 sec
Intel Core2 Dual E7200, 2.53 GHz493 sec

less is better



VirtualDub 1.8.1 640 MB MPEG2 File zu AVI (DivX 6.8, SSE4)
Intel Core I7-965 Extreme Edition, 3.2 GHz 224 sec
Intel Core I7-920, 2.66 GHz265 sec
Intel Core2 Quad Extreme Q9650, 3.0 GHz318 sec
Intel Core2 Dual E8600, 3.3 GHz364 sec
Intel Core2 Dual E7200, 2.53 GHz399 sec

less is better


In video encoding a core i7 965 is 61 percent faster than a Q9650 and also in VirtualDub the Nehalem derivate is 42 percent faster.


Audioencoding

mp3-Kompression mit lame3.97 (656 MB Musik CD)
Intel Core I7-965 Extr.Edition, 3.2 GHz38 sec
Intel Core I7-920, 2.66 GHz41 sec
Intel Core2 Quad Extreme Q9650, 3.0 GHz54 sec
Intel Core2 Dual E8600, 3.3 GHz95 sec
Intel Core2 Dual E7200, 2.53 GHz121 sec

less is better


The ecoding of a musk CD with lame 3.97 and Core i7 965 takes place 42 percent faster than with a Q9650.



[pagebreak]
Games

Crysis 800x600, No AA, No AF, Details: Medium,  results in frames per second
Intel Core I7-965 Extr. Edition, 3.2 GHz 70.23
Intel Core I7-920, 2.66 GHz 59.92
Intel Core2 Quad Extreme Q9650, 3.0 GHz59.44
Intel Core2 Dual E8600, 3.3 GHz51.49
Intel Core2 Dual E7200, 2.53 GHz46.11

more is better



F.E.A.R, 800x600, No AA, No AF, Details: Medium, results in frames per second
Intel Core I7-965 Extr. Edition 3.2 GHz 451
Intel Core I7-920, 2.66 GHz441
Intel Core2 Quad Extreme Q9650, 3.0 GHz398
Intel Core2 Dual E8600, 3.3 GHz368
Intel Core2 Dual E7200, 2.53 GHz274

more is better



Call of Juarez 1024x768, No AA, No AF, Details: Medium,  results in frames per second
Intel Core I7-965 Extr. Edition 3.2 GHz81.2
Intel Core I7-920, 2.66 GHz78.0
Intel Core2 Quad Extreme Q9650, 3.0 GHz62.6
Intel Core2 Dual E8600, 3.3 GHz61.4
Intel Core2 Dual E7200, 2.53 GHz57.9

more is better



World in Conflict 800x600, No AA, No AF, Details: Medium,  results in frames per second
Intel Core I7-965 Extr. Edition,3.2 GHz125
Intel Core I7-920, 2.66 GHz81
Intel Core2 Quad Extreme Q9650, 3.0 GHz76
Intel Core2 Dual E8600, 3.3 GHz78
Intel Core2 Dual E7200, 2.53 GHz67

more is better


Also in games Core i7 performs better than Q9650. Therefore in Crisys Core i7 is 18 percent faster, in F.E.A.R it is 13 percent faster, i Call of Juarez it is nearly 30 percent faster and in World in Conflict it is even 64 percent ahead of a Q9650.


Diesen Artikel im Forum diskutieren.
[pagebreak]
Overclocking


 Intel Core I7-965 @ 4.2 GHz
27x156MHz / 1.45 VCore
Intel Core I7-965, 3.2 GHz
24x133MHz / 1.20 VCore
3DMark Vantage, CPU Score2267319837
Crysis, Frames pro Sekunde82.6470.23
Sandra Speicherbandbreite Int MB/s2302118150
Sandra Speicherbandbreite Float MB/s2300218260
SuperPI 1.5 32M9m 14.063s11m 20.875s
VirtualDub 1.8.1, Resultate in Sekunden163s224s
POV- Ray 3.7, Resultate in Sekunden57.7s62
POV- Ray 3.7 Resultate in Pixel pro Sekunden PPS45404164


Original Image




[pagebreak]
Conclusion

With Nehalem Intel shows a once in a decade overwork of their CPU architecture. Compared to other competitors their integration of the memory controller into the CPU as well as on die routing comes a little late. It seems that Intel was able to learn from the faults of AMD and Alpha which did this step earlier and the result Intel now presents can really convince.

Further Nehalem is the first major overhaul of Core 2 microarchitecture which was introduced in June 2006. By creating a scalable and flixible basis for the next ten years Intels Hillsboro developer teams seems to have done a remarkable job.

Overall Intel seems to have improved, accelerated or extended nearly every single unit with Core i7 except the functional units. Most of all changes one can find in the memory pipeling. The biggest modification in the cores is the implementation of SMT of which in future nearly every single application will profit.

If we concentrate on performance figures we are able to state that Nehalem is in none of the tests slower than its predecessor. The span goes from 13 percent in F.E.A.R. to 248 percent in SiSoftSandras memory bandwidth test.
It is a matter of fact that bandwidth intense benchmarks profit most because of the now integrated memory interface. As a further example POV-Ray performs more then 50 percent faster than an already fast Core 2.

An interesting feature is Power Gate. It is not only able to deactivate every single core independently it can also overclock them. It does so if a single thread application or another one which can use the power of two or three cores needs more performance to be executed more quickly.

At this point we only wait for the final prices which we will present as soon as possible.



Discuss this article in the forum.