93% of a GP100 at least…
Big Pascal finally embraces FP64 performance!
NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.
NVIDIA provided a comparison table, which we added what we know about a full GP100 to:
Tesla K40 | Tesla M40 | Tesla P100 | Full GP100 | |
---|---|---|---|---|
GPU | GK110 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) | GP100 (Pascal) |
SMs | 15 | 24 | 56 | 60 |
TPCs | 15 | 24 | 28 | (30?) |
FP32 CUDA Cores / SM | 192 | 128 | 64 | 64 |
FP32 CUDA Cores / GPU | 2880 | 3072 | 3584 | 3840 |
FP64 CUDA Cores / SM | 64 | 4 | 32 | 32 |
FP64 CUDA Cores / GPU | 960 | 96 | 1792 | 1920 |
Base Clock | 745 MHz | 948 MHz | 1328 MHz | TBD |
GPU Boost Clock | 810/875 MHz | 1114 MHz | 1480 MHz | TBD |
FP64 GFLOPS | 1680 | 213 | 5304 | TBD |
Texture Units | 240 | 192 | 224 | 240 |
Memory Interface | 384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 | 4096-bit HBM2 |
Memory Size | Up to 12 GB | Up to 24 GB | 16 GB | TBD |
L2 Cache Size | 1536 KB | 3072 KB | 4096 KB | TBD |
Register File Size / SM | 256 KB | 256 KB | 256 KB | 256 KB |
Register File Size / GPU | 3840 KB | 6144 KB | 14336 KB | 15360 KB |
TDP | 235 W | 250 W | 300 W | TBD |
Transistors | 7.1 billion | 8 billion | 15.3 billion | 15.3 billion |
GPU Die Size | 551 mm2 | 601 mm2 | 610 mm2 | 610mm2 |
Manufacturing Process | 28 nm | 28 nm | 16 nm | 16nm |
This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.
A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.
Pascal's advantage is that these shaders are significantly more complex. First, double-precision performance is finally at a 1:2 ratio with single-precision, which is the highest proportion for both to be first-class citizens. (You can compute two, 32-bit values for each 64-bit one with enough parallelism in your calculations.) This yields a double-precision performance of 5.3 TeraFLOPs at stock clocks, and with just 56 operational SMs, for GP100. Compare this to GK110's 1.7 TeraFLOPs, or Maxwell's 0.2 (yes, 0.2) TeraFLOPs, and you'll see what a huge upgrade this is in calculations that need extra precision (or range).
Second, NVIDIA has also added FP16 values as a first-class citizen too, yielding a 2:1 performance ratio with FP32. This means that, in situations where 16-bit values are sufficient, you can get a full, 2x speed-up by dropping to 16-bit. GP100, with 56 SMs enabled, will have a peak performance of 21.2 TeraFLOPs.
You can multiply by 60/56 to see what the full GP100 processor could be capable of, but we're not going to do that here. The reason why: FLOP rating is also dependent upon the clock rate. If GP100's 1328 MHz (1480 MHz boost) is conservative, as we found on GM200, then this rate could get much higher. Alternatively, if NVIDIA is cherry-picking the heck out of GP100 for Tesla P100, the full chip might be slower. That said, enterprise components are usually clocked lower than gaming ones, for consistency in performance and heat management, so I'd guess that the number might actually go up.
Third, yes this list is continuing, there is a whole lot more memory performance. GP100 increases the L2 Cache from 3MB with GM100 to 4MB with GP100. Since Maxwell, NVIDIA can disable L2 Cache blocks (remember the 970?) so we're not sure if this is its final amount, but I expect that it will be. 4MB is a nice, round number, and I doubt they would mess with the memory access patterns of a professional GPU for scientific applications.
They also introduced this little thing called "HBM2" that seems to be making waves. While it will not achieve the 1TB/s bandwidth that was rumored, at least not in the 16GB variant announced today, 720 GB/s is nothing to sneer at. This is a little more than double what the Titan X can do, and it should be lower latency as well. While NVIDIA hasn't mentioned this, lower latency means that a global memory access should take fewer cycles to complete, reducing the stall in large tasks, like drawing complex 3D materials. That said, GPUs already have clever ways of overcoming this issue, such as parking shaders mid-execution when they hit a global memory access, letting another shader do its thing, then returning to the original task when the needed data is available. HBM2 also supports ECC natively, which allows error correction to be enabled without losing capacity or bandwidth. It's unclear whether consumer products will have ECC, too.
Pascal also introduces two new features: NVLink and Unified Memory. NVLink is useful for multiple GPUs on an HPC cluster, allowing them to communicate at a much higher bandwidth. NVIDIA claims that Tesla P100 will support four "Links", yielding 160 GB/s in both directions. For comparison, that is about half of the bandwidth of Titan X's GDDR5, which is right there on the card beside it. This also plays in with Unified Memory, which allows the CPU to share memory space with the GPU. Developers could write serial code that, without performing a copy, can be modified by a GPU for a burst of highly-parallel acceleration.
Where can you find this GPU? Well, let's hear what Josh has to say about it on the next page.
I am not sure how they are
I am not sure how they are fitting that on an interposer. I thought that HBM2 die stacks are specified to be 92 mm2 compared to 49 mm2 for HBM1. Four of them should be close to 400 mm2. With a 610 mm2 GPU die, that would require a 1000 mm2 interposer. Is that a possibility? I guess they could be using only two 8 GB stacks, but they would need to increase the clock speed significantly to go from 512 (spec) to 720 GB/s.
As I understand it, silicon
As I understand it, silicon interposers aren’t restricted by the reticle size. The features are large enough that there isn’t an alignment problem; they can take up the entire wafer if need be.
That is probable a lot more
That is probable a lot more expensive if they have to resort to making them larger than the reticule size.
I’m not entirely sure either.
I'm not entirely sure either. Yes, you can have multiple exposures to make the interposer larger, but I think that how they are getting around it is that the 4 GB HBM2 dies are much smaller than the 8 GB HBM2 units.
What about Asynchronous
What about Asynchronous Compute for Pascal, has there been any improvement in Pascal over Maxwell for Nvidia’s new Pascal micro-Arch to better schedule processor threads to better utilize the GPU’s core execution resources. Is there still the need to wait until the end of a draw call to schedule graphics or compute threads in Nvidia’s new Pascal based GPUs or have they improved their thread scheduling granularity to a point that little execution resources are left idle while there is work backed up in the execution queues!
These issues aren’t going to
These issues aren't going to be discussed at a CUDA/OpenCL summit. No idea.
And yet at the Register, a
And yet at the Register, a non gaming website!!!
“Software running on the P100 can be preempted on instruction boundaries, rather than at the end of a draw call. This means a thread can immediately give way to a higher priority thread, rather than waiting to the end of a potentially lengthy draw operation. This extra latency – the waiting for a call to end – can really mess up very time-sensitive applications, such as virtual reality headsets. A 5ms delay could lead to a missed Vsync and a visible glitch in the real-time rendering, which drives some people nuts.
By getting down to the instruction level, this latency penalty should evaporate, which is good news for VR gamers. Per-instruction preemption means programmers can also single step through GPU code to iron out bugs.”(1)
(1)
http://www.theregister.co.uk/2016/04/06/nvidia_gtc_2016/
It looks like the websites the cover mostly professional server news have a better handle on the technical aspects of new GPU hardware more so than the gaming websites! I can not wait for the Zen server SKUs to be reviewed at the Register for some more information on that front! It looks like Nvidia fixed that on their server/HPC GPU variants so hopefully the same can be said for the consumer variants!
Having unallocated compute resources when the queue is backed up is very bad for processor utilization so it’s good that Nvidia fixed that! The HPC/workstation market is not going to tolerate any hardware gimping form Nvidia in this matter so keep up the improvements Nvidia, asynchronous compute in very important!
The fact that they think a
The fact that they think a single draw call can stall the pipeline for 5ms and cause a missed VR sync pretty much means they are only going off of one line of information and trying to write much smarter than they are. 50ns is a far cry from 5ms.
Looking at the Anandtech
Looking at the Anandtech article (again):
http://www.anandtech.com/show/9969/jedec-publishes-hbm2-specification
The slide titled “Mechanical Outline : Molded KGSD” indicates that the 40 mm2 and the 92 mm2 is the size of the package in the specification. The chips would need to be this size for the micro-bumps to line-up with the pads or micro-bumps on the interposer. The micro-bump array is part of the JEDEC specification. Unless they are making non-standard HBM2, it doesn’t look like it changes anything if they are going with 4 GB x 4 packages. It says 16 GB and a 4096-bit interface which does imply a 4×4 system. It looks like they will be ~92 mm2 each and the interposer will need to be around 1000 mm2. Is the picture in the article supposed to be an actual device or just a mockup? We already know these things will not be cheap, so they may just be making interposers larger than the reticle. Don’t expect such a device in the consumer market anytime soon, if ever though.
It’s definitely 4x4GB.
It's definitely 4x4GB.
It is actually about 40 mm2
It is actually about 40 mm2 for HBM1, not 49.
It is actually about 40 mm2
It is actually about 40 mm2 for HBM1, not 49.
It is surprising that they
It is surprising that they are making such a large die at 16 nm. I suspect that when, or if, any of these make it to the consumer market that they will be significantly more expensive than the Titan X was. There is no way that they are going to be able to make a 600+ mm2 die on 16 nm with yields anywhere close to what was possible with 28 nm. They may have a huge number of defective parts that can be salvaged though, so perhaps that can do a very cut down consumer version.
I will not be surprised if
I will not be surprised if this ends up as another Fermi – late, initially low-yielding and brute forcing its way to performance dominance. If so then it will probably really shine in its second iteration when they bring down the defect rate and/or with the smaller die variants.
Funnily enough it looks like AMD are taking the same approach they did back then, with a smaller and more area-efficient product.
It is interesting to see that
It is interesting to see that NV might be taking that route again. But at least with this implementation, there is no other competition at this extremely high end. Does not matter how hot it runs or yields, if they sell each card for 15K then they are more than covering their expenses getting these parts out.
If they are clocking their
If they are clocking their GPUs higher then maybe the 16nm process in engineered to have a larger pitch(distance between circuits/gates) than would normally be used for a GPU’s layout. CPUs have their layouts less dense to run at higher clocks but that does not negate the circuit gains from having 14nm/16nm gate sizes, it just means that the circuits are spaced fruther apart for better heat dissipation. Circuit pitch(spacing) plays an important part in the thermal ability of a processors to handle higher clock speeds at the cost of space savings even at 14nm/16nm gate sizes with their inherent advantages. So maybe the larger die size is a trade off for higer clock speeds at the cost of some loss of space savings for these high end server/pro SKUs.
Yeah, it’s interesting that
Yeah, it's interesting that NVIDIA's enterprise clock is around 400 MHz higher than it used to be. I'm curious to see how high consumer will go.
Seeing Fury X single
Seeing Fury X single precision being at 8.6 Gigaflops, Pascals 10.6 Gigaflops doesn’t sound such a major improvement.
Will admit that dual precision is where Pascal shines, exceeding 5 Gigaflops. Nicely done.
I am highly confused about the 1/2 precision. It feels like such a marketing play.. after all 20 Gigaflops of 1/2 precision is a big number. My question is what would this be used for?
Correction. Teraflops, not
Correction. Teraflops, not Gigaflops on all numbers …
Usually raw teraflops is not
Usually raw teraflops is not a very good measure of how fast you can compute in practice, memory bandwith tends to be the bottleneck. In this case I would expect the performance of the P100 exceed the Fury X by a larger marging than 20%, for the deep learning or the CFD that we use the GPUs for the P100 could easily run twice as fast as the Fury.
Is FP16 or half-precision
Is FP16 or half-precision used for deep learning?
With regards to 20% margin. For me it is just about their claim. I agree that actual performance might be bigger. HBM2 will in itself play a big role.
So i’m not even going to guess the actual performance.
Will admit I am really really happy that they are back to 1/2 speed dual precision compared to 1/32? in the previous ones.
Yeah, there are a lot of
Yeah, there are a lot of workloads where FP16 is more than adequate for their needs. Deep learning is one of them.
Deep learning. SP might be
Deep learning. SP might be underwhelming but then again you have to look how well nvidia GK110/GK210 defend their position vs much superior amd hawaii firepro.
That is done to compete with
That is done to compete with the Xeon Phi, and also the requirments of the server/HPC market for which this SKU is the intended target. So that DP FP to SP FP ratio is better for more DP compute performence that the server/HPC market requires. So this SKU is ahead of the Xeon Phi by a larger margin.
“AMD’s Fury may be the first
“AMD’s Fury may be the first GPU to feature an interposer and HBM1 memory, but the P100 will quickly outclass this product. If high end enthusiasts can even get a hold of it…”
And therein lies the rub. AMD was able to do it and get it into the hands of high-end gamers. NVIDIA can’t even do that.
Well, it hardly matters if
Well, it hardly matters if the memory isn’t bottlenecking and considering how well Ti-line performs compared to Fury, it is to be said, that in gaming it really doesn’t, at least, not yet. AMD failed on relying too much on HBM-technology too soon, because it seems to be evident that they can’t bring it to consumer products in any meaningful (monetizing) way before Nvidia does so too (HBM2). I doubt Fury brought them enough marketshare and profits compared to the costs of making the product in the first place.
Agreed, but it does mean
Agreed, but it does mean they’re on their second-gen product with the tech. They have already worked out the complex inter-company relationships needed to get these products running and out the door.
Whether that gives them any benefit /in practice/ remains to be seen.
Really AMD developed the HBM
Really AMD developed the HBM technology/standard with SK Hynix, and AMD will also be using HBM2, I do not see how that can be a fail for AMD when AMD is already demonstrating consumer based polaris SKUs. AMD is the co-creator of HBM, I doe not see Nvidia being in the fron lines of developing any open standards/JEDEC standards for ALL to use like AMD did with HBM. Hopefully AMD will make some inroads into the HPC workstation market with their server/HPC APUs on an interposer AMD needs that business more than ith needs only the consumer side of things.
Google will be using Power9 based servers so AMD needs to maybe get a power9 license from openpower, and do x86, ARM(K12), and some power based GPU acceleration products. Nvidia has the power GPU acceleration market all to itself currently.(1) PCI 3.0 is not fast enough for the HPC/exascale market so Both AMD and Nvidia will have to compete with Nvida currently leading in the server/HPC market, and x86 based SKUs are no longer the only game in town across all markets except the PC/laptop market but that will change too for the PC/laptop market if some power8/power licensee builds a PC variant using Power ISA based CPUs.
(1)
http://www.theregister.co.uk/2016/04/06/google_power9/
P.S. another article or
P.S. another article or Google and Rackspace using Power9’s
http://www.theregister.co.uk/2016/04/06/google_rackspace_power9/
LOOK out makers of x86 only based products(Intel) as at least AMD has its K12 custom ARM cores(Jim Keller designed). OpenPower is licensing power8’s, and newer designs, and Nvidia has a lead supplying GPU accelerators for Power8/9 based systems! Better look into that market also AMD and get a power8/9 license and integrate your GCN based Polaris/Vega IP into that power ISA based marketplace.
Well maybe AMD will be
Well maybe AMD will be working with Intel on a server based option since they been getting so chummie lately.
so… when are the consumer
so… when are the consumer enthusiast cards coming out????
My guess would be 2017?
My guess would be 2017?
so i should go ahead and buy
so i should go ahead and buy the 980 ti then… btw josh, you are hilarious. my type of humor
Thanks for watching. Don’t
Thanks for watching. Don't invest in 980Ti yet… give a couple of weeks for more rumors and leaks to come out before spending $600 on a card that could be overshadowed in 3 months.
Wrong.
July
Wrong.
July 2016.
http://techreport.com/news/29961/rumor-nvidia-to-launch-gtx-1080-and-gtx-1070-at-computex
Nvidia will unveil a consumer Pascal chip to the public at Computex 2016, in the form of its GeForce GTX 1080 and 1070 cards. Digitimes says that card makers will fire up mass production of Pascal-based GeForces during July. Asus, Gigabyte, and MSI are among the players expected to show cards at Computex.
1070/1080 are considered
1070/1080 are considered ‘high end’ (GP104) – the 1080ti/Titan are ‘enthusiast (GP100 core). The 1080 will likely be faster than a 980Ti.. but later a 1080Ti / Titan will come out that smack that down pretty strongly.. The card is likely either to be a ‘holiday shopping special’ or a 2017 card, based on the timing of everything here.
I bought a 980Ti a month ago to replace my 970 SLI and am very happy; although I bought it because i wanted to play Elite Dangerous at a reasonable setting on my Oculus. If you can wait – the new 1070 or 1080 should be a bargain compared to 980Ti pricing and offer => performance. Not to mention AMD has a whole new generation coming soon too..
shut up and take my money 🙂
shut up and take my money 🙂
Didn’t read the article. Does
Didn’t read the article. Does Pascal fully support Async Compute? Thanks.
They didn’t go into that
They didn't go into that level of granularity or address that exact question.
I get how async compute works
I get how async compute works in gaming. But their keynote was about sever, development and professional technologies and applications. Would any of those technologies presented benefit from async compute in any significant way?
Not sure, but I doubt it
Not sure, but I doubt it affects OpenCL or CUDA. It's designed to independently load the 3D and compute engines, but the former isn't used there.
If what you say is true, then
If what you say is true, then that makes sense of why it wasn’t brought up in their keynote.
Now we just wait and see when they announce about the consumer desktop GPU side of things.
If Pascal performs like crap
If Pascal performs like crap in DirectX 12 performance, I will therefore buy AMD Polaris, plus I don’t want to spend ridiculous amount of money on G-Sync monitor since there’s less benefit with high refresh rate monitor. Moreover, not all games would work with G-Sync, I don’t want to cope with extra input latency which majority PC gamers strongly prefer Vertical-Sync Off.
I don’t about NVIDIA, but it seems like NVIDIA is stepping into monopoly position based on their business standpoint, things like Gameworks, G-Sync, PhysX, etc. But in the meantime, I’d wait until there’s a real benchmark between Polaris and Pascal GPU. And hopefully, next gen. GPU announce at Computex 2016 in Taipei, Taiwan on either end of May or June.
Generally speaking, it makes
Generally speaking, it makes sense to choose the best results for your budget. If Pascal under-performs, then it makes sense to use Polaris. Likewise, vice-versa. We'll see.
Like the guy said if its the
Like the guy said if its the same he does not want a gsync monitor and Nvidia wont support Freesync like Intel.
>> Moreover, not all games
>> Moreover, not all games would work with G-Sync,
I've played a lot of games on FreeSync / G-Sync and the only real requirement appears to be full screen. G-Sync also has a mode that works on desktop / windowed games, but not as well as the full-screen experience.
>> I don't want to cope with extra input latency which majority PC gamers strongly prefer Vertical-Sync Off.
VRR displays continue to draw at the 'max' speed even when in varying frame rates, which means that the speed of the scan at 40 FPS is just as fast as it is at 144/165. The end result is that the advantage to running VSYNC-off is nearly negligible.
No tdp improvement. One day
No tdp improvement. One day a jillion watt gpu and it will be norm
no improvement?
look at the
no improvement?
look at the clock frequencies of the GPU and then we’ll talk
And again, we get a nice
And again, we get a nice little 3D rendering of “Pascal,” that’s pretty dandy
But where the hell is a physical pascal chip? Seriously, something must have gone WAY wrong if they haven’t even showed a chip publicly, let alone in a demo..
Meanwhile RTG has been setting up demos for what, three months now?
Something has definitely gone wrong.
Nice write up! For those who
Nice write up! For those who didn’t see it, Nvidia did indeed have a P100 demo at GTC.
http://a.disquscdn.com/uploads/mediaembed/images/3460/3367/original.jpg
It appears to be eight GP100 GPUs running in parallel.
Am I reading this right? “GPU
Am I reading this right? “GPU Die Size 551 mm2 601 mm2 610 mm2 610mm2”
61cm² dies? that’s almost as big as my intire case… someone get their metric system wrong?
6.1cm^2 dies
6.1cm^2 dies
100mm2 = 1cm2
1cm2 = 1cm *
100mm2 = 1cm2
1cm2 = 1cm * 1cm = 10mm * 10mm = 100mm2
How many inches is that?
😉
How many inches is that?
😉