Hi there, and happy new year !
As I'm developing my game, Flight of the Swallow, one thing has constantly been a concern of mine. Character voices.
If you don't want to read, here's the "tl;dr" version :
I am progressively replacing the voices I created with xVASynth by other, clearer voices, that don't belong to video game characters so I won't have any moral or legal issues. I've been itching to do that for a while, but technical reasons (lip sync in particular) were holding me back until now.
Since pretty much the beginning of the development of this game, i.e. at the end of 2022, I wanted it to tell a story, and to make it told by the characters themselves. And what better way for a character to tell a story than to make them tell it with their own voices ?
Problem is, hiring voice actors costs quite a bit of cash, and I want my game to be profitable, at least a little. It probably won't pay the bills, but I'd like voice acting not to take all of its budget if I can help it. Not to mention that it requires finding the actors, convincing them to voice characters in an adult game, reserving the recording place, and other headaches... Ugh. No thanks. I already have a million other things to do. And last but not least, I'm writing the story and the dialogues as I go, so I want the voices now, not in two years. And I want to be able to tweak a voice later if I'm not satisfied with it. You can't do that if all you have is one recording session and no time machine.
Plus, there is lip synchronization. When a character speaks, you want to see their lips (or whatever) move in sync, right ? I know most games don't do it properly (I'm looking at you, Elden Ring) or even not at all, but some are pretty good at it. It makes the characters look so much more natural. Until recently, like, last week, I had no idea how to create such data when recording an actual human being saying a line that I wrote for them. Probably a few commercial tools exist, but they're meant for big studios, not for a solo dev like me.
So, with all those requirements, what was the obvious solution ? xVASynth, made by Dan Ruta. Its purpose is to let a modder add to the story of a game by making its
characters speak almost as naturally as if the original voice actors had
been saying those lines. For example, if you want to make a mod for Skyrim where Esbern teaches Delphine how to cook apple cabbage stew, and you want it voiced, no problem, xVASynth got you covered.
It's free, it has many voices, you have full control over how to make one voice pronounce the line exactly you want, and what's more, along with the audio file, you get a text file that contains all the relative durations of the letters (for v1 and v2 voices) or phonemes (for v3 voices). In my case, after programming the necessary stuff to turn that file into actual visemes (the facial expressions that look like the character is actually pronouncing a phoneme), all I had to do was paste the dialogue line in xVASynth, click "Generate", tweak the pitches and durations a bit until I was satisfied, then when I was done, just move the two files to the target folder in the Voice folder. That was it. Easy peasy.
The only requirement is that the generated voice files cannot be sold, which is why xVASynth is primarily used for free mods for Skyrim, Fallout, Cyberpunk etc. I made it so my game would use the voices the same way, like a mod, to be downloaded separately and for free. Making them optional clears that part since the voices are not a part of the price of the game, but let's be honest, every player expects the game to be voiced, it doesn't matter where the voices come from, and it would just have been one click away anyway.
If you're curious, here are the voices I used in the demo until now :
Player: Female V (Cyberpunk 2077)
Sandy : Panam (Cyberpunk 2077)
DAL: Delamain (Cyberpunk 2077)
Mother Superior : EDI (Mass Effect)
Virginia: Judy (Cyberpunk 2077)
Dave : Danse (Fallout 4)
Frank : Jacob (Mass Effect)
Robert : Illusive Man (Mass Effect)
Ignatus : Codsworth (Fallout 4)
The unidentified voice : Miranda (Mass Effect)
To me, that solution held water six months ago. Not so much nowadays, after all the talk about Midjourney and other generative AIs, and how they are often trained on copyrighted material.
xVASynth is basically an AI that has been trained to speak like certain video game characters, whose voices belong to the actors who created them. As for the voice files used as training material, they belong to their respective editors (for example CD Projekt Red for Cyberpunk's voices, the ones I use the most), so it pretty much falls into the "trained on copyrighted material" department.
So six months ago, I was okay with the idea of providing the xVASynth voices separately, for free, and let the player decide whether to download them and add them to the game or not, because it wasn't really an issue for anyone at the time. Now, it is becoming one.
It's not really the problem I have with this, though. The real problem is ethics.
I'll say it right here, people are on both sides of the fence when it comes to using the voices of video game characters in my game. Some find it cool (I'm one of them, obviously, I love those characters, especially V, Panam and Judy, and that's one reason why I wanted them), others find it weird or shady. I heard both.
xVASynth's little brother, xVATrainer (which was created to train the AI in the first place), has a "do not train" list. I wasn't aware of it before recently because I don't use xVATrainer, and simply because the latter was released only a few months ago. And it just so happens that one Cherami Leigh's name is present in this list. Mrs Leigh dubbed Female V (i.e. the female version of the player character) in Cyberpunk 2077, and I used her voice for the player character in Flight of the Swallow. At the time of this writing, she is the only one in that list whose voice I used. I really love her voice, and I absolutely wanted it in my game. Initially, I wanted it to be Sandy's voice, but V speaks like an ice-cold killer (or an adorable murderpuppy if you prefer), while Sandy is one hot-blooded, loud troublemaker. And who's better suited to dub such a character than the hotheaded motor-girl known as Panam Palmer ? The player character in my game, however, is Sandy's conscience/super-ego/guardian angel (pick your favorite), so a cool, soothing voice is perfect for her role.
A few months ago, there was a turmoil in the voice acting world about adult mods using character voices. I don't remember whether it was about Fallout 4, Skyrim or Cyberpunk, but some voice actors clearly stated that they did not want any AI to be trained on their voices. Mrs Leigh was one of them, hence her name being in the "do not train" list. This explains why the "Female V" voice has been taken off xVASynth's list of voices (unlike, for example, Male V, dubbed by Gavin Drea). So, officially, Female V is no longer a voice anyone is allowed to use.
Mrs Leigh does not want her voice in any game she hasn't voiced herself, and that's perfectly understandable. So I need to remove it from this game. But replacing the player's voice in my game with another whose actor is not in the list would not be enough. Voice actors in general are not happy when their voices are reused without their consent, even when it is made clear (like I did) that we are merely imitating their voices, not impersonating their actors. I'm trying to keep this post technical and neutral-toned, but I actually do care about that. I wouldn't be happy if someone trained an AI on my own voice to dub characters in adult games either (nobody will do that, I'm no voice actor, it's just for the sake of the example). To me, replacing only one voice is not enough. They all have to be replaced. After all, maybe Emily Woo Zeller (who voiced Panam) might want her name added to that list tomorrow, or next month. It is not because she is not in that list today that she is okay with anyone using her voice to say anything she didn't personally approve, at any time in the future.
Long story short, I wasn't aware that the voice actors had a real problem with this, and now I am, after I stumbled on xVATrainer's "do not train" list. I could say I wish that list existed in 2022, but I don't. Why ? Because I'm glad I could develop my game with those voices in the first place. Doing so taught me a lot about how to voice a character, how to write the dialogues in a way that they sound good when voiced, and how important the tone can be when you want to convey a message. Not to mention that I had to code ways to play the voices properly (with the correct AudioMixers, sound occlusion, reverb, default voices in case of a missing file, delays and pitch in various time scales, etc), with lip sync on top of it. Without the voices, I would have had no material to develop all that stuff.
There's also a more general issue. I read two words that made me think seriously about this, and I think those words are what triggered this post, and made me want to redo all the voices. Those words are "exposure burn".
In plain English, the more a voice is heard, the less value it has. Players and investors don't really want to always hear the same voice (real or AI-generated), so if suddenly all indie games used, for example, Panam's voice for their main protagonist, Mrs Zeller would quickly find herself out of a job because players would become fed up with hearing the same voice over and over again. It wouldn't be an issue at my small level, but if everybody did that, then it would become one.
To Mrs Leigh, Mrs Zeller, and the few other voice actors I used the voices of, if you ever read this, I will not use your voices after all. Last thing I want is to hurt your business. But let me point out that I'm a fan. I love your voices, that's the prime reason why I wanted them in my game in the first place.
Now, I'm going to use new voices that don't have copyright or moral issues. Which means I have to stop using xVASynth to create them. Problem is, that also means saying goodbye to the lip sync.
Or does it ? I figured out a solution to make lip sync files automatically and quickly from any voiced speech. It transcribes the speech to a timestamped text file, which plays pretty much the same role as xVASynth's text file. If you're curious, I'm batching the use of a local Speech-To-Text library. In fact, my engine can use both lip sync files, so even without this solution, I could still align the lip movement with the voices manually with xVASynth and Audacity, but that takes several minutes per file. And at the time of this writing, there are 4200 of them (a little less than 1400 in the demo), so I much prefer the automatic version.
So now I'm in the process of replacing all the xVASynth voices with other, clearer and more natural ones. And boy, do I see the difference. When I play Panam's voice in a wav file then the same line with Sandy's new voice immediately after, the difference in quality is obvious. I didn't realize how bad the sound quality was before.
This is a big undertaking. There are a lot of files to redo, more than a year of work, and the creation process is not free. It will take months to do. But this puts my mind at ease for the future. Like Sandy, I too have a conscience that just won't leave me alone.
Also, well... xVASynth gives you a lot of control over the pitch of a voice, down to the letter or phoneme, and quite often the first generation is pretty bad so one has to change the pitches and durations manually until it sounds good enough. This is a lot of micromanagement, and since my accent is probably different from yours, what sounds good to me might not sound good to you. Sometimes, it may even sound completely off. Not to mention that xVASynth voices struggle on some words (it's impossible to make Panam say "Machine" properly). The new voices sound much more natural.
See for yourself. Here is the old version (same video as in the teasing blog post from last month) :
And here is the new version, same scene, with the new voice :
Look at this lip sync. Mmmm.
Playing both videos at the same time, pausing one after Sandy says one line and listening to the same line in the other video, helps a lot with comparing both voices.
Sandy, when she was voiced by Panam, used to have an American accent. Now, she has a clear British accent. Fun fact, I initially wrote Sandy as a British army girl from the Royal Air Force. After all, her last name, "Curnow" derives from "Kernow" which means "Cornwall" in Cornish. And I wanted her to have some kind of British sense of humor. So I had to change her story a little to make her American and that changed her personality in a way that I was not completely happy with. And now, I have to retcon her to be British again, but still a soldier in the US Air Force. Her background in the journal had to be amended (again) to that effect.
Her new voice sounds a lot like the voice of Claudia Black (Aeryn Sun in Farscape), or Dominique Tipper (Naomi Nagata in The Expanse), don't you think ? But it's neither of them. Also, the player's voice and Sandy's are now the same (the former is a little higher-pitched, after all she is "Sandy's little voice"), unlike before.
Don't be fooled by Sandy's new smooth voice, though. The video above shows her when she's calm. She's... not always calm. She is just as capable of yelling at everyone (even at the player) as before.
I hope you'll like the new voices !
Marine