Nvidia Maxwell Architecture Analysis – Delivering Double the Performance Per Watt on 28 NM

Nvidia Maxwell Architecture Analysis – Delivering Double the Performance Per Watt on 28 NM

Ok the NDA has finally lifted and the official architecture documentation is up. Its finally time to take a look at Maxwell in depth.  It manages to Double the Performance per Watt while staying on the same 28nm Process.  Which, there is no other way to put it, is nothing short of a miracle. Lets see just how exactly it manages that.

Maxwell 28nm Miracle  – How the Architecture makes Doubling Performance Per Watt Possible

Firstly, as I am sure most of you are aware, Maxwell does not work on SMX units. It works on SMMs, short for Streaming Maxwell Multiprocessors. Each SMM houses 128 CUDA Cores as opposed to the 192 housed by SMXs. Now unlike in the Kepler architecture where the CUDA Cores are housed in a single core fashion, Maxwell houses
Cuda Cores in 4 subsets of each SMM. Almost 4 Separate “Cores” within
the SMM. Lets call these “Major Cores” (as opposed to CUDA Cores) to
avoid confusion. Do realize that
this only refers to the 1st Generation of Maxwell and the division by
four could change in the next generation. The Major Cores tactic allows
Nvidia to achieve much higher efficiency rates and increase performance by 135% Per Core. Take a look at this diagram of Maxwell SMMs.

Maxwell Architecture Block Diagram SMMs

since we already know that SMMs have 128 CUDA Cores, simple maths would
tell you this block diagram is of the GTX 750 TI (128*5 = 640 = 750
Ti’s CUDA Core Count) But the thing we are interested in is the
division. Notice how each SMM is divided into 4 dedicated “major cores”.
This is one of the biggest changes that architecture has seen since Kepler which would have consisted of just one big sheet. Lets zoom in, straight into a Streaming Maxwell Multiprocessor.

Maxwell Architecture Streaming Maxwell Multiprocessor SMM

you were to count the CUDA Cores you would count exactly 128. It is
also very interesting how they have divided up the memory interface
width (bus) between the major cores giving 32-bit to each. The memory
interface width ofcourse adds up to 128 Bit. Now here’s the interesting
part. There are two L1 Caches and each is shared by two Major Cores
along with 4 Texture Units. The 64kb of Shared Memory is shared between 4
major cores, ie the entire SMM.

Here are the Kepler SMX in comparison:


By this point the major revolution of Maxwell architecture
should be becoming clear. Division, division and more division. You
might also have noticed that unlike in Kepler SMX the warp scheduler has
control over only its own ‘major core’. Nothing is being shared between
the 4 major cores except FP64 and Texture units (by the warp
schedulers). Taking power in numbers to an art form, it raises
interesting questions whether using the same division tactics to other architectures
yield the same benefits? It also implies that if we were somehow able
to split the 128 Cuda Cores into not 4, but 128 Major Cores, with 1-bit
each, would we have the perfect efficient architecture?

I would also like to mention concludingly that there is something in
the Maxwell architecture that Nvidia is not telling us. The ‘secret
sauce’ approach if you may, though its childish no one can argue with
its effectiveness.

Screen Shot 2014-02-18 at 8.48.24 AM_575px Screen Shot 2014-02-18 at 8.48.38 AM_575px

Add a Comment