Did DeepSeek Use Gemini to Train Its New AI?

By Nipuni Tharanga Jun 4, 20250

DeepSeek, a Chinese AI company, recently launched an updated version of its R1 reasoning model, called R1-0528. This new model shows strong performance in math and coding tasks. However, some developers and researchers now believe that it may have been trained using data from Google’s Gemini AI models.

Melbourne-based developer Sam Paech pointed out that R1-0528 often uses similar words and expressions found in Gemini 2.5 Pro. In a post on X, he shared examples that hint at this overlap. Another developer behind the SpeechMap AI tool also noted that DeepSeek’s thought patterns during problem-solving are almost identical to what Gemini models generate.

This is not the first time DeepSeek has been linked to data from other AI systems. Back in December, their older model, DeepSeek V3, sometimes introduced itself as ChatGPT. This raised concerns that it might have been trained on ChatGPT logs, suggesting possible use of OpenAI’s data.

Earlier this year, OpenAI claimed to have found links between DeepSeek and the use of a method called distillation. This method involves training a smaller model by using output from a larger, more advanced AI. While not illegal, it goes against OpenAI’s rules, which clearly state that users cannot use their models to build rival AI systems.

Microsoft, an OpenAI partner, reportedly discovered large amounts of data leaving OpenAI accounts in late 2024. The company believes these accounts were tied to DeepSeek. This raised further suspicion that DeepSeek may have accessed OpenAI or Gemini outputs to train its own tools.

The challenge here is that today’s internet is flooded with AI-generated content. This makes it very hard for companies to filter out model outputs from training data. AI-generated content appears on blogs, forums, and social media, which are often used as training sources by AI companies. Because of this, different AI models may start to sound the same.

Still, some experts believe DeepSeek might have used outputs from top-performing models like Gemini. Researcher Nathan Lambert shared his thoughts on X, saying that if DeepSeek has limited GPUs but plenty of funding, using synthetic data from advanced models could help them train more efficiently.

To stop the misuse of their models, AI companies are tightening their security. OpenAI now asks users to verify their identity using a government ID before accessing its advanced tools. This rule applies only to countries supported by OpenAI, and China is not one of them.

Google has also added new protection by summarizing traces generated by its AI models on its developer platform. This makes it harder for rivals to copy these models. Anthropic, another AI company, is doing the same to protect its technology.

As of now, Google hasn’t made any public comments on the issue.

Tags: