Google has revealed new details about the supercomputers it uses to train its artificial intelligence models, stating that they are faster and more power-efficient than comparable systems from Nvidia. Google's custom chip, the Tensor Processing Unit (TPU), is used for over 90% of the company's work on AI training. The TPU is now in its fourth generation, and Google has published a scientific paper outlining how it has connected over 4,000 of these chips using its own optical switches to create a supercomputer.
Improving the connections between chips has become a vital point of competition among companies building AI supercomputers. The large language models that power technologies like Google's Bard or OpenAI's ChatGPT have exploded in size, meaning they are far too large to store on a single chip. These models must be split across thousands of chips, which then work together for weeks or more to train the model.
"Google’s TPU-based supercomputer, called TPU v4, is “1.2x–1.7x faster and uses 1.3x–1.9x less power than the Nvidia A100,” the Google researchers wrote. |
Google's supercomputers make it easy to reconfigure connections between chips on the fly, helping avoid problems and tweak for performance gains. The company said that its chips are up to 1.7 times faster and 1.9 times more power-efficient than a system based on Nvidia's A100 chip that was on the market at the same time as the fourth-generation TPU, for comparably sized systems.
While Google is only now releasing details about its supercomputer, it has been online inside the company since 2020 in a data center in Oklahoma. Google said that startup Midjourney used the system to train its model, which generates fresh images after being fed a few words of text. Google has hinted that it might be working on a new TPU that would compete with Nvidia's H100 but provided no details.