
After practically six many years of getting smaller, quicker, cooler, and cheaper, transistors are getting increasingly costly with every era, and one may argue that this, greater than every other issue, goes to drive system structure decisions for the foreseeable future.
Both the reticle dimension of fab tools goes to restrict us, or the internetworking between elements, whether or not they’re on a single socket in 2D, 2.5D, or 3D configurations, goes to restrict us. We discover chiplet architectures maybe unavoidable in addition to attention-grabbing, and we admit that chiplet approaches have the potential to extend particular person element yields and due to this fact scale back semiconductor prices, however using chiplets additionally will increase package deal manufacturing prices and there’s a worth – and doubtlessly a really massive worth in computational effectivity and thermals – of not having monolithic compute parts very near their cache and important recollections.
Maybe we should always have invested a little bit bit extra in 450 millimeter wafer expertise? Possibly not. The silicon ingots that wafers are sliced from are 3X heavier and take 2X to 4X the time to chill, and the entire equipment in a contemporary fab that routinely handles the wafers in the course of the manufacturing course of must be modified together with the etching tools.
Some days, plainly 3D stacking of compute and reminiscence is the one manner out of this conundrum, and even that has big engineering and financial challenges.
It’s with this in thoughts that we learn a brand new paper revealed within the on-line journal for the Society for Industrial and Utilized Arithmetic written by Satoshi Matsuoka, director of the RIKEN supercomputing lab in Japan and a longtime professor on the Tokyo Institute of Expertise, and Jens Domke, the chief of the supercomputing efficiency analysis staff at RIKEN, that talked theoretically about supercomputing design within the wake of the “Fugaku” system delivered final yr and because the finish of Moore’s Legislation approaches.
We expect Matsuoka and Domke are being beneficiant in that it positive seems like Moore’s Legislation is kaput. Finito. No mas. Fini. Joined the choir everlasting. Gone the way in which of all flesh. Sure, transistor density remains to be rising and can proceed to extend, however that was by no means the purpose that Intel co-founder Moore was making in his seminal papers in 1965 and 1975. The purpose was that ever-cheapening transistors would drive the computing business ahead, at an exponential charge, which actually occurred.
Till now. Now, every part is more durable. And warmer. And costlier. And till we will attain down into the microcode of the BIOS of the bodily Universe and alter some basic legal guidelines, that si simply the way in which it’s with CMOS semiconductor expertise etched on silicon wafers.
To recap: Amdahl’s Legislation, has many phrasings and was coined by Gene Amdahl, the legendary architect of the System/360 mainframe at IBM. The one we have been taught was that a system is barely as quick as its slowest element. The concept was introduced by Amdahl on the 1967 spring convention of the American Federation of Info Processing Societies as such: “The general efficiency enchancment gained by optimizing a single a part of a system is proscribed by the fraction of time that the improved half is definitely used.” The extra parallel the appliance, the higher the speedup, which is what is often referred to as robust scaling within the HPC enviornment.
Like many nice concepts, it appears apparent as soon as said, however Amdahl’s Legislation has big implications for top efficiency computing of every kind, not simply simulation and modeling.
So does Gustafson’s Legislation, which was introduced in a 1988 article on the Affiliation of Computing Equipment referred to as Reevaluating Amdahl’s Legislation, by HPC legend and utilized mathematician John Gustafson and Edwin Barsis, who was the director of laptop sciences and arithmetic at Sandia Nationwide Laboratories when this paper got here out and Gustafson was working at Sandia.
Gustafson’s Legislation is akin to Particular Relativity, the place Amdahl’s Legislation is extra like Common Relativity, if a metaphor is required. Amdahl’s Legislation was about how a hard and fast downside scales on altering {hardware}, however the Sandia staff centered on how a altering downside scaled on altering {hardware} and will present higher decision of simulation over time – and tried to formulate a approach to gauge the effectivity of all of that. One in every of its assumptions is that the serial portion of workloads doesn’t develop even because the parallel parts do.
There’s a fascinating writeup within the New York Instances concerning the parallel computing algorithm breakthrough at Sandia, which is without doubt one of the few references to Barsis on the Web. And, quoting Barsis, it provides a really good description of the weak scaling precept of Gustafson’s Legislation: “We don’t maintain breaking apart the parallel half smaller and smaller and smaller. We maintain making the overall downside greater and greater and greater.”
What a poetic approach to describe the previous three and a half many years of HPC and to seize the spirit of Gustafson’s Legislation. Which is all about dishonest Amdahl’s Legislation as a lot as potential by intelligent {hardware} and software program engineering.
Which brings us all the way in which again to RIKEN Lab and the post-Fugaku world, the paper at SIAM, and a presentation by Matsuoka on the current Modsim Workshop hosted by Brookhaven Nationwide Laboratory.
Here’s a chart Matsuoka pulled from a lecture collection by Peter Bermel at Purdue College that exhibits the interaction of those two legal guidelines in 2D:
And here’s a stunning 3D chart put collectively by Matsuoka and Domke for the SIAM article:
“The supercomputing group sometimes regards Amdahl’s Legislation because the strong-scaling legislation beneath which using extra compute nodes accelerates a given parallelizable fraction of the workload and reduces the time to resolution,” Matsuoka and Domke write within the SIAM paper. “However this legislation additionally applies to accelerators, and the potential speedup is certain by the ratio of accelerated and non-accelerable fractions of the algorithm. Moreover, a second basic commentary referred to as Gustafson’s Legislation additionally governs fashionable HPC by limiting the achievable speedup for an issue based mostly on how properly the parallelizable or accelerable fraction could be weak scaled onto many nodes; one accomplishes this by rising the general workload and sustaining a continuing quantity of labor per node. Weak scaling overcomes an issue’s bottlenecks – slowdowns because of communication points with the interconnection community or inherent imbalances within the workload distribution – that turn out to be evident when one strong-scales to the identical variety of compute nodes.”
The gist of that second chart above, say the authors, is that an ideal accelerator can yield “a major speedup,” which is on the order of 10,000X on the chart above, however that any Amdahl’s Legislation inefficiencies inside the accelerator and any Gustafson’s Legislation inefficiencies throughout distributed collections of accelerators and knowledge transfers between compute nodes all maintain again scalability. And you’ll quantify this earlier than you design a next-generation supercomputer. Which is what Matsuoka’s prolonged and detailed presentation at Modsim 2022 was all about. (We have now not been capable of safe a recording of the session, simply the presentation, which is a captivating learn in itself.)
And this brings us all the way in which again to the FugakuNext strawman hypothetical proposal for a next-generation supercomputer for RIKEN anticipated between 2028 and 2030, which we lined a bit again in April when a paper from RIKEN and a bunch of different college researchers across the globe, labored collectively to benchmarks HPC efficiency on the AMD Milan-X Epyc 7773X massive cache processors. Because it seems, there are actually two FukaguNext strawmen within the discipline, one that’s an accelerated CPU (like A64FX) with a number of stacked L2 cache and one other, which Matsuoka confirmed in his Modsim 2022 presentation, that could be a hybrid CPU/accelerator gadget that has tons of 3D stacked reminiscence and cache on the units to supply robust scaling.
These preliminary AMD Milan-X checks, utilizing the MiniFE finite component evaluation software, proved that with a dataset that match contained in the L3 cache, MiniFE routines ran 3X quicker. The big cache reduces a giant Amdahl’s Legislation bottleneck – important reminiscence. In different phrases, final stage cache – both L2 cache or L3 cache, relying on the structure – is the brand new important reminiscence. All of a sudden, we’re having flashbacks to the servers of the late Nineteen Nineties. . . .
Anyway, RIKEN then extrapolated what a future A64FX processor with a ton of stacked L2 cache may seem like and the way it may carry out. This A64FX massive cache (LARC) processor was simulated with eight L2 caches stacked up atop the A64FXNext processor with 384 MB of L2 cache at 1.5 GB/sec of bandwidth, and was modeled to yield a median of 10X enchancment within the efficiency of a FugakuNext socket over a present Fugaku socket.
Fairly, isn’t it? And it’s not a CPU with beefy vector engines just like the A64FX. Not that RIKEN has decided come what may on that as but. These two strawmen methods are simply thought experiments for now. However they are going to inform proposals and design choices, for positive.
This potential hybrid FugakuNext compute engine has a common objective CPU – little doubt based mostly on the Arm structure – and has a coarse-grained reconfigurable array (CGRA) accelerator. These might be, in keeping with Matsuoka, a GPU with clock-level synchronization, an FPGA material like these from Xilinx or Intel, or the Intel dataflow engine referred to as the Configurable Spatial Structure, or CSA, that we caught wind of in patent filings manner again in 2018.
Additionally, you will notice that there are 2D SRAM caches stacked on high of each the CPU and the accelerator, and that the CPU has DRAM stacked on high of the SRAM. The interposer additionally has twelve ports of 1 Tb/sec silicon photonics networking coming proper off the package deal. RIKEN reckons that this chippery will likely be etched in 1.5 nanometer processes.
This potential FugakuNext socket would have greater than 1 petaflops of efficiency per node at FP16 precision, which most likely means greater than 500 teraflops at FP32 single precision and 250 teraflops at FP64 double precision, and higher than 20 TB/sec of reminiscence bandwidth of SRAM bandwidth. This potential FugakuNext system would have round 80,000 nodes with someplace between 2 EB/sec and three EB/sec of combination reminiscence bandwidth, round 100 exaflops of combined precision efficiency, and burn round 30 megawatts of juice.
This sounds fairly cheap as a want. Query is: Can it’s made, and might anybody afford to make it?