ByteDance Researchers Publish High-Performance AI Training Method

TikTok parent ByteDance, a major AI player in China, releases open technique for training LLMs that it says outperforms DeepSeek

2 min
The headquarters of TikTok parent ByteDance. Image credit: ByteDance
Getting your Trinity Audio player ready...

Researchers from ByteDance, Tsinghua University, and the University of Hong Kong have released an open-source system for AI reinforcement learning that they say outperforms a reasoning system from DeepSeek.

The DAPO (Dynamic Sampling Policy Optimisation) system is designed to provide reinforcement-learning techniques for large language models (LLMs) that can be reused by other researchers.

AI companies often release only partial details of their RL methods, the researchers said, making the techniques difficult to reproduce.

DeepSeek, Copilot, ChatGPT, Character.AI, Perplexity, and Gemini artificial intelligence (AI) apps displayed on a smartphone screen. Keywords: artificial intelligence. Image credit: Unsplash
Image credit: Unsplash

Open approach

In a new research paper, they said they tried to reproduce DeepSeek’s GRPO (group relative policy optimisation) method, but their results trailed DeepSeek’s by 17 points in an AIME benchmark score, “suggesting that critical training details may have been omitted in the R1 paper”.

R1 is DeepSeek’s latest “reasoning” AI model.

Reasoning models deliberately “think” longer before delivering an answer, double-checking their responses and reducing the potential for errors.

In the interests of transparency and reproducibility, the DAPO team released the algorithmic details, training procedures and datasets used in their research.

The project includes training codes and a prepared dataset called DAPO-Math-17K for mathematical reasoning tasks.

The team said DAPO delivered significant performance improvements over DeepSeek’s GRPO on the American Invitational Mathematics Examination (AIME) 2024 benchmark, with a score of 50 points when using the open-source Qwen2.5-32B base model from Alibaba, compared to 47 points for GRPO.

Efficiency

DAPO achieved the score with half the training steps of GRPO, underscoring its efficiency, the team said.

The project is led by ByteDance intern Yu Qiying, a doctoral student at Tsinghua, with other participants being a Tsinghua undergraduate and a University of Hong Kong doctoral student, as the company seeks to work with top-level AI researchers before they have graduated.

The TikTok parent has invested heavily in AI, and its Doubao chatbot has become China’s most popular chatbot since its launch last May, ranking as the world’s second most popular after OpenAI’s ChatGPT.

Advertising
Silicon UK In Focus Podcast
sponsorisé
Silicon UK In Focus Podcast: Leadership and Culture in…