Announcement_4
Excited to share our work on understanding how silent data corruption errors affect LLM training! We study silent errors ocurring in real-world unhealthy hardware swept out by production fleet management, characterize the magnitude of these errors, and analyze they impact tensors during training and model quality.