The release of the 47th TOP500 list of the world’s top supercomputers on June 20 at the International Supercomputing Conference in Germany showed a new No. 1 system – the Sunway TaihuLight machine in China that is nearly three times as fast and three times as efficient as system it displaces in the top spot.
The Sunway TaihuLight, at the National Supercomputer Center in Wuxi, has a theoretical peak performance of 125 petaflops and achieved a performance of 93 petaflops running the LINPACK benchmark. To get some perspective on the system and what it means for HPC, the SC Blog talked with Jack Dongarra, one of the four editors of the TOP500 list, a longtime member of the SC conference planning committee and one of the developers of LINPACK.
Dongarra, who is a professor of computer science at the University of Tennessee in Knoxville with a joint appointment at Oak Ridge National Laboratory, has a report on Sunway TaihuLight at http://bit.ly/sunway-2016.
SC16: So, what happened here?
China has built a very powerful machine that is 2.75 times as powerful as the former No. 1 system, Tianhe-2, which is also in China. The Sunway TaihuLight has 10.6 million cores and has a theoretical peak performance of 125 petaflops. Running LINPACK at 93 petaflops means it performed at 74 percent of the theoretical peak. Tianhe-2 achieves 62 percent of its theoretical peak with LINPPACK, while Titan at Oak Ridge National Lab achieves 65 percent.
It also has the best power efficiency, performing 6 gigaflops per watt, while the other top systems are around 2 gigaflops per watt. Tianhe-2 is at 1.9 and Titan at Oak Ridge National Lab is at 2.1. So, this new machine has three times the efficiency of the next two most powerful supercomputers on the list.
SC16: Was this system’s showing a surprise?
There was a rumor for the last year plus that China is building two systems on the order of 100 petaflops, this one and an upgrade of Tianhe-2 called Tianhe-2A, which isn’t ready yet. This system is bigger than expected with more performance. The fact that they got LINPACK to run on the system and achieved this efficiency is very impressive.
SC16: What other insights can you provide about the system’s performance?
Even though LINPACK runs fast, the machine does have slow memory. With HPCG, the High Performance Conjugate Gradients benchmark, Sunway TaihuLight achieved 0.371 petaflops, or 0.3 percent of the theoretical peak. Compare this with Titan, which achieves 1.2 percent of the theoretical peak on HPCG and Tianhe-2, which posted 1.1 percent. So, TaihuLight is a lot slower for applications that involve a lot of memory traffic (data movement).
There are currently four key application domains for the Sunway TaihuLight system:
- Advanced manufacturing: CFD, CAE applications
- Earth system modeling and weather forecasting
- Life science
- Big data analytics.
There were three finalist submissions for the Gordon Bell Prize at SC16 that are based on the new Sunway TaihuLight system. These three applications are: (1) a fully-implicit nonhydrostatic dynamic solver for cloud-resolving atmospheric simulation; (2) a highly effective global surface wave numerical simulation with ultra-high resolution; (3) large scale phase-field simulation for coarsening dynamics based on Cahn-Hilliard equation with degenerated mobility.
All three of these applications have scaled to around 8 million cores (close to the full system scale). The applications that come with an explicit method (such as wave simulation and phase-field simulation) have achieved a sustained performance of 30 to 40 petaflops. In contrast, the implicit solver achieves a sustained performance of around 1.5 petaflops, with a good convergence rate for large-scale problems.
The system has a heck of a lot of memory – 1.3 petabytes, compared to 0.7 petabytes for Titan.
It’s also a big step up in terms of its impressive efficiency. It consumes 15.3 megawatts, which is impressive given the number of cores and rate of execution.
SC16: What does this mean for China?
Tianhe-2A was supposed to be upgraded with Intel’s Knights Landing processors, but last year the U.S. Department of Commerce blocked the export of Intel technology to some parts of China.
When the Commerce Department blocked the exports, China invested heavily in HPC research and development and they are replacing Intel with their own designs. This system is based on a Chinese processor with 260 cores. For comparison Intel’s Knights Landing has 72 cores. Both of the processors have about the same cycle time – 1.45 gigahertz for the Chinese processor and 1.40 GHz for Knights Landing. It means that China has continued in leapfrogging the U.S. by a considerable amount.
For comparison, Tianhe-2, which was the top machine on the last six TOP500 lists, was twice as fast as the Oak Ridge Titan, or the equivalent of Titan combined with Sequoia at Lawrence Livermore. This new system is twice as powerful as Tianhe-2.
SC16: What does this mean for the U.S.?
China now has a very big machine running and producing real results. We are going to have three machines with similar power in 2017, going into production probably in 2018. China has plans to deploy an exascale system in 2020, and the U.S. target date for exascale is 2023.
There is every indication that if they put four times the processors in each node, they can build half of an exascale machine. If they turn the crank a couple of times, they can get to an exaflop. China is clearly ahead of where we are in petascale deployment. If they are truly competitive for the ACM Gordon Bell Prize at SC16, we’ll see the real impact of this system.