Sustainable Metal Cloud publishes world-first MLPerf Training power consumption results, establishes new benchmark

Singapore-based Firmus Technologies, through its innovative AI GPU cloud platform, Sustainable Metal Cloud (SMC), has announced the release of its first MLPerf Training benchmark results. As one of the newest members of MLCommons, SMC utilizes its proprietary single-phase immersion platform, a "Sustainable AI Factory," to deliver world-class AI training performance with a commitment to energy efficiency.

13 June 2024Innovation
Sustainable Metal Cloud publishes world-first MLPerf Training power consumption results, establishes new benchmark

Singapore-based Firmus Technologies, through its innovative AI GPU cloud platform, Sustainable Metal Cloud (SMC), has announced the release of its first MLPerf Training benchmark results. As one of the newest members of MLCommons, SMC utilizes its proprietary single-phase immersion platform, a “Sustainable AI Factory,” to deliver world-class AI training performance with a commitment to energy efficiency.

SMC’s operations are primarily based in Asia, with a globally expanding network of scaled GPU clusters and infrastructure, including NVIDIA H100 SXM accelerators. This milestone marks SMC’s debut in the MLPerf Training benchmark, joining industry leaders such as NVIDIA, Dell Technologies, Hewlett-Packard Enterprise, Supermicro, Oracle, and Google. For the first time, submitters can publish the power data of their equipment during testing, and SMC stands out by being the only member to publish a comprehensive suite of power results along with performance data for training on a single node, to clusters as large as 64 nodes (512 GPUs).

SMC’s results underscore its leadership in energy efficiency, demonstrating a significant reduction in power consumption. The GPT-3 175B model, trained using 512 NVIDIA H100 Tensor Core GPUs and connected with NVIDIA Quantum-2 InfiniBand networking, consumed only 468 kWh of total energy, showcasing substantial energy savings compared to conventional air-cooled infrastructure. When combined within their partner ST Telemedia Global Data Centres Singapore-based facility, the SMC platform has proven to save up to 50% total energy whilst maintaining industry-leading benchmark training performance.

Ted Pretty, Chairman of SMC, stated, “We are thrilled to be a part of MLCommons and contribute to advancements in energy-efficient AI infrastructure, these results, verified by MLCommons members, validate the transformative power of our Sustainable AI Factories in reducing the environmental impact of large-scale AI. As the demand for AI grows, addressing resource consumption is critical. MLCommons has provided us the platform to validate our technology which now presents as a viable solution for using less power, water, and space.”

“MLPerf’s benchmarks exist to make machine learning better. It’s fantastic to see Sustainable Metal Cloud (SMC), one of our newest members, submit to MLPerf Training with our first-ever power measurements. SMC’s release establishes a baseline for best practice power consumption” said David Kanter, MLCommons Executive Director. 

The significant reduction in energy consumption is primarily due to the unique design of infrastructure running the AI model – SMC’s Sustainable AI Factories. SMC’s proprietary single-phase immersion cooling technology plays a crucial role in achieving these breakthrough results. The SMC design removes heat directly from the processors through submerging the servers directly into a liquid tank, eliminating the need for energy-intensive fans and chilled water-based air-conditioning systems that are found in traditional data centres. This leads to a 30% energy saving at the server level with a further 20% energy saving from retrofitting their immersion platform directly into traditional air-cooled data centre floor space.

Viewed at the level of a single NVIDIA H100 HGX server, SMC’s training results on average consumed 6.58 kW under test conditions, and when factoring in SMC’s Singapore-based data centre integrated facility PUE of 1.10, grosses up to 7.23 kW. A typical air-cooled based H100 HGX server that is not running in an SMC Sustainable AI Factory will consume an average of 9-11 kW at the server level, and if using a benchmark PUE of 1.5, requires as much as 15 kW net power to operate. Considered this way, SMC’s achievements represent a paradigm shift in the energy consumption profile of AI workloads.

Sustainable Metal Cloud adopted an enhanced software stack post-submission, achieving further energy reductions to 451 kWh and a 7% performance improvement, positioning their customer cloud training performance environment just 6% below NVIDIA’s flagship Eos AI supercomputer1.

As a part of MLCommons, SMC aims to showcase progressive technologies, set benchmarks for best practices, and advocate for long-term energy-saving initiatives.

David Kanter further added,  “The MLPerf benchmarks help buyers understand how systems perform on relevant workloads. The addition of the new power consumption benchmark gives our members, buyers, and the entire AI community a new way to rate energy efficiency and environmental impact. As a community, it’s important that we can measure things, so we can improve them. I hope SMC’s initial results will help drive transparency around the power consumption of AI training. This is an example of why MLCommons exists – to bring the best of industry together and have new, scaling tech platforms benchmarked against the world’s largest infrastructure providers.”

Sustainable Metal Cloud is committed to advancing energy-efficient AI training, with its results verified by MLCommons. The company’s energy transparency and technology aim to influence industry narratives and establish new benchmarks for sustainable AI practices. 

View more of our results at https://smc.co/mlperfv40

Media Contact
Lauren Crystal, Head of Communications

lauren.crystal@smc.co 

1 Result not verified by MLCommons Association