According to Meta’s new research report, the cluster of 16384 NVIDIA H100 GPUs used to train the 405 billion-parameter Llama 3 model turned out to be an absolute pain in the ass. It malfunctioned 419 times in 54 days. This means, on average, one malfunction every three hours.
Meta Llama 3 language model crashes every three hours
The scale of the Llama 3 language model system and the synchronization of tasks are so precise that if even a single GPU fails, the entire training process stops and has to be started again. According to the Meta team’s report, 148 (30.1%) of these 419 failures were caused by various GPU issues, and 72 (17.2%) were caused by the GPU’s high bandwidth memory (HBM3). Unbelievable but true, there were only two CPU failures in 54 days. 41.3 percent of other unexpected outages were caused by software errors, network cables and adapter problems.
The Meta team has developed a great set of tools and strategies to get out of this chaos. They took steps such as reducing task launch and checkpoint times, diagnosing performance issues using PyTorch’s NCCL flight recorder, and identifying lagging GPUs. They also considered the effects of environmental factors; They considered factors such as the impact of noon temperature fluctuations on GPU performance and the strain on the data center power grid from large amounts of GPUs running simultaneously.
As the number of parameters of AI models increases, such as Meta Llama 3 with 405 billion parameters, such huge training sets will become more common. For example, the 100k H100 graphics card cluster included in the xAI plan suggests that more challenges may arise in future AI training. That’s why Meta’s efforts to solve these problems now are critical for larger-scale projects in the future.
Meta was able to deliver over 90 percent effective training time. However, it would have been much more efficient without these malfunctions. These experiences will help Meta develop more robust and durable systems in its future projects.
What are you thinking? You can write your opinions in the comments section below.
Source link: https://shiftdelete.net/gpular-isyanda-meta-llama-3-dil-modeli-ariza