According to TechCrunch several voices on X, including Hugging Face CEO Clement Delangue, believe the secret lies in the datasets. "Reasoning models like o1 are trained on datasets containing a lot of Chinese characters," Delangue hinted. But Ted Xiao, a researcher at Google DeepMind, took it further, suggesting that the reliance on third-party Chinese data labeling services is at the heart of this linguistic quirk.
"Labs like OpenAI and Anthropic use third-party data labeling services for PhD-level reasoning data for science, math, and coding," Xiao pointed out on X. "[F]or expert labor availability and cost reasons, many of these data providers are based in China."
According to Xiao, o1's Chinese tendencies are an example of "Chinese linguistic influence on reasoning."
But not everyone buys into this theory. A contingent of experts argues that o1's language gymnastics are not limited to Chinese.
"The model doesn't know what language is, or that languages are different," says Matthew Guzdial, an AI researcher at the University of Alberta. "It's all just text to it."
Tiezhen Wang, a software engineer at AI startup Hugging Face, concurs, suggesting that these linguistic inconsistencies are tied to the model's training. "By embracing every linguistic nuance, we expand the model's worldview and allow it to learn from the full spectrum of human knowledge," Wang explained on X. "For example, I prefer doing math in Chinese because each digit is just one syllable, which makes calculations crisp and efficient."
Luca Soldaini, a research scientist at the nonprofit Allen Institute for AI, cautions that pinpointing the exact cause is a tall order.
"This type of observation on a deployed AI system is impossible to back up due to how opaque these models are," they told TechCrunch. It seems transparency in AI development is more crucial than ever.