The study, which has not been peer-reviewed, found that human participants correctly identified other humans in only 63 per cent of the interactions -- and that a 1960s computer program surpassed the AI model that powers the free version of ChatGPT.
Even with limitations and caveats, which we'll cover below, the paper presents a thought-provoking comparison between AI model approaches and raises further questions about using the Turing test to evaluate AI model performance.
In the recent study, listed on arXiv at the end of October, UC San Diego researchers Cameron Jones (a PhD student in Cognitive Science) and Benjamin Bergen (a professor in the university's Department of Cognitive Science) set up a website called turingtest.live, where they hosted a two-player implementation of the Turing test over the Internet to see how well GPT-4, when prompted different ways, could convince people it was human.
Through the site, human interrogators interacted with various "AI witnesses" representing other humans or AI models, including GPT-4, GPT-3.5, and ELIZA, a rules-based conversational program from the 1960s.
"The two participants in human matches were randomly assigned to the interrogator and witness roles," the researchers said.
"Witnesses were instructed to convince the interrogator that they were human. Players matched with AI models were always interrogators."
The experiment involved 652 participants who completed a total of 1,810 sessions, of which 1,405 games were analysed after excluding certain scenarios like repeated AI games (leading to the expectation of AI model interactions when other humans weren't online) or personal acquaintance between participants and witnesses, who were sometimes sitting in the same room.
ELIZA, developed in the mid-1960s by computer scientist Joseph Weizenbaum at MIT, scored relatively well during the study, achieving a success rate of 27 per cent. GPT-3.5, depending on the prompt, scored a 14 per cent success rate, below ELIZA. GPT-4 achieved a success rate of 41 per cent, second only to actual humans.
The study's authors concluded that GPT-4 does not meet the success criteria of the Turing test, reaching neither a 50 per cent success rate nor surpassing the success rate of human participants.
The researchers speculate that GPT-4 or similar models might eventually pass the Turing test with the right prompt design. However, the challenge lies in crafting a prompt that mimics the subtlety of human conversation styles. And like GPT-3.5, GPT-4 has also been conditioned not to present itself as human."