Rise of the clones? Playing around with AI audio

Wouldn’t it be great if we could make podcasts and videos in one language and then, by running them through an AI tool, get them translated into countless others?

Imagine the potential: access to massive new audiences, without the cost and hassle of re-recording, producing, hiring talent, getting new guests etc etc…

Of course, if you doomscroll any kind of social media, there’s a lot of that kind of thing happening already, but it often looks quite shonky (I’ve seen too many short videos unconvincingly dubbed by Generic Robotic-sounding American Man).

But I work in communications at the UN, where we have to be extra careful about the way we use AI. Move fast and break things doesn’t really cut it when a mistranslation could have serious diplomatic consequences.

Nevertheless, there’s been a lot of excitement about the potential these tools have to reach more people during a period of belt-tightening, so I decided to experiment with some of my colleagues, to find out exactly what is possible.

Wondering about Wondercraft

I started by getting in touch with Wondercraft, an AI audio company whose tools are being used by a wide range of companies and organisations, including the Institute for Economic Development (part of the World Bank Group), which worked with them to make “Ideas for Impact,” described as WBG’s first “AI-generated, expert-led podcast series,” in which research papers are turned into audio content presented by cloned voices.

The CEO of Wondercraft, Oskar Serrander, kindly offered to let me trial his software for the experiment. He told me that his company “allows anyone to go from a simple script to a fully produced podcast or voiceover in minutes, without mics or studios or really any technical skills.”

Some of the initial results were extremely impressive (or worrying, if you’re an audio professional concerned about job security). For example, UN Secretary-General António Guterres seemed very comfortable speaking Hindi, and UN Humanitarian chief Tom Fletcher, a Brit, appeared to have faultless French. The prospect of creating news bulletins and podcasts in dozens of languages, without engaging a huge team, seemed in reach.

But it’s not that simple. Not yet anyway. At the UN, trust is paramount. Our producers are schooled in journalistic ethics and take great pains to ensure that our content is accurate, well sourced and verified. We found that, to maintain those standards, a significant amount of human intervention is still needed.

The experiment

To really push the Wondercraft software to its limits, we created a short podcast in which presenters and interviewees recorded themselves speaking Urdu, Kiswahili, Punjabi, English and Swedish. The idea was to find out if all the voices could be rendered into English in a way that accurately reflected their original speeches in terms of language, accent and tone.

From this English version, we then created, relatively easily, a credible French version.

The Urdu language, however, proved to be more difficult, particularly in terms of the accents.

The software works by “reading” an audio file and transcribing it and cloning the voice of the speaker. That transcript needs to be checked (mistakes can and do occur, particularly if the speaker has an accent). Once the text has been corrected, it can be translated (into around 70 languages). The translated script needs to be checked and corrected by someone who understands the language to fluency or near fluency, before the new recording can be generated.

We found that, when we had to go in and correct the new transcript (idiom, in particular, can be challenging), it could mess with the tone and intonation of the cloned voice. You then have to go in and write prompts (e.g. speak faster, speak slower, more engaged, more emotion etc), tweaking the audio until the desired result is achieved.

Some languages posed more problems than others. Both our Hindi and Urdu producers found that, when they cloned their voices (to test the text to audio capabilities of the software), the software had a hard time accurately replicating the accent and audio version of the script. However, given the constant evolution of the technology, I’m sure that its capabilities will rapidly improve.

Our main conclusion was that, with a human staff member available to thoroughly check the translated script and recording, and adjust where necessary, it could be possible to create good quality podcasts in a wide range of languages, bringing UN audio content to parts of the world we rarely reach.

Pros, cons and bad actors

This experiment was just one part of a much wider conversation taking place within the UN around the opportunities, challenges and ethical considerations of using AI tools (both internally and in the wider world).

In anticipation of this debate, the UN created the Office of Digital and Emerging Technologies (ODET), headed up by Special Envoy Amandeep Singh Gill. Mr. Gill told me that AI has become extremely popular because it is making people’s lives easier, helping them to connect and to access knowledge. Audio tools have specific advantages: “when political leaders give an interview or deliver a speech, people can listen directly in another language. This is a huge benefit.”

However, the misuse of AI to mislead and misinform is one of his main concerns. Bad actors, he says, “use AI to erode public trust in our institutions, whether political or international. You can spread confusion and rumours among people and encourage unrest and violence. So, it’s essential that whatever is used to alter voices should be in the hands of responsible actors and those we can trust.”

Crossing the line: Secret robot presenters

I first came across Wondercraft and the world of voice cloning thanks to James Cridland, a noted commentator on the evolution of audio broadcasting (and podcast Hall of Famer). I heard an audio file of him speaking French on his Podnews site. His voice, his intonation, but in French. And he doesn’t speak French.

Wondercraft made it happen, and it sounded incredibly good so I wanted to know what he thought about the use of cloned voices and AI-assisted translated audio. Like Amandeep Gill, he put the emphasis firmly on trust.

“I use a cloned voice in a show that I do because I want a different announcer each week. But that’s just an announcer, not a host, and I think a line is crossed when you have an AI voice hosting a show and you’re not making that clear to the audience. AI tools can help with things like researching guests or taking a first pass at questions, and lots of successful podcasts are using AI in that way. But if we are going to use AI as a presenter, then we need to be very clear about it. It needs to be disclosed to make sure the audience is happy with that.”

AI, like any technology with the potential to make life easier or more convenient, will play an ever-greater role in our personal and work lives, and is working in the background to power more and more of the online tools helping content producers (including those at UN News) to streamline some of their more mundane tasks. As AI guidelines are developed, and the technology is improved, it appears likely that AI, with a “human in the machine” will help the UN to widen its audio audience. Exactly how it does that, however, remains an open question.