MLCommons released the inaugural MLPerf HPC results last year, measuring how quickly different systems could train a neural network. The initial benchmark suite has been used to measure systems that generally use somewhere between 500 to 4,000 processors or accelerators – quite a bit smaller than the leading supercomputers. But while the initial version worked well for many scientifically-oriented workloads, it didn’t really scale up to full supercomputing capabilities. For instance, at scale, interconnect begins to matter a lot more. “It’s important to keep in mind that small systems and large systems behave very differently,” David Kanter, the head of MLPerf, said in a briefing with reporters. Most systems run multiple jobs at the supercomputer scale, such as training ML models – in parallel. So, in addition to the time-to-train metric, MLCommons added a throughput metric. It measures how many models per minute a system can train – “a very good proxy for the aggregate machine learning capabilities of a supercomputer,” Kanter said. It captures the impact on shared resources, such as the storage system and interconnects. Submitters can choose the size and number of instances they test, allowing them to exhibit different supercomputing capabilities. For this release, submitters also had to report their “strong-scaling” results – the “time to train” metric. For this benchmark round, MLCommons received submissions from eight supercomputing organizations, including Argonne National Laboratory, the Swiss National Supercomputing Centre, Fujitsu and Japan’s Institute of Physical and Chemical Research (RIKEN), Helmholtz AI (a collaboration of the Jülich Supercomputing Centre at Forschungszentrum and the Steinbuch Centre for Computing at the Karlsruhe Institute of Technology), Lawrence Berkeley National Laboratory, the National Center for Supercomputing Applications, NVIDIA, and the Texas Advanced Computing Center. More than 30 results were released, including eight using the new throughput (weak-scaling) metric. In terms of results, Fujitsu and RIKEN had the best throughput when training for proficiency in CosmoFlow. Nvidia had the best throughput for DeepCAM training. In addition to adding the new metric, MLCommons also introduced a new graph neural network benchmark for molecular modeling. The OpenCatalyst benchmark predicts the quantum mechanical properties of catalyst systems to discover and evaluate new catalyst materials for energy storage applications. The new benchmark uses the OC20 dataset from the Open Catalyst Project, the largest and most diverse publicly available dataset of its kind, with the task of predicting energy and the per-atom forces. The reference model for OpenCatalyst is DimeNet++, a graph neural network (GNN) designed for atomic systems that can model the interactions between pairs of atoms as well as angular relations between triplets of atoms. Given the extensive calculations involved, simulations of atomic systems and molecular modeling are some of the dominant HPC workloads. These systems are best represented in graph form, but MLPerf previously didn’t include any graph neural network benchmarks. Adding the benchmark was important, MLCommons said, given that graph neural networks have computational characteristics that are fairly different from models like a convolutional neural network or recommender system.