Illustrations by Rozalina Burkova
Animation by Alex Kuzoian
Mike Pappas, as CEO, and Carter Huffman, as CTO, are the co-founders of Modulate. Pappas has a B.S. in physics and applied mathematics from MIT and was previously a software architect at Lola. Huffman has a B.S. in physics from MIT and was previously a technologist at NASA’s Jet Propulsion Lab.
Kevin Delaney is the editor in chief and co-CEO of Quartz, which he co-founded in 2012. Previously, Delaney was managing editor of WSJ.com, a reporter for SmartMoney magazine, and a TV producer in Montreal. He is a member of the Council on Foreign Relations and has a degree in history from Yale University. Follow him at @delaney
You’re not constrained by your own personal biology. I’m completely free to express myself as I want.
MIKE PAPPAS: When people think of voice changers, they’ll think of the Darth Vader toy that makes you sound like him – but not something as nuanced as the technology that we’re building.
CARTER HUFFMAN: Sometimes you’re vibrating your vocal folds, sometimes you are not. Sometimes your nasal cavity is getting involved in this, and sometimes they’re not.
They’re not constrained by their own personal biology.
PAPPAS: I’m completely free to express myself how I want.
HUFFMAN: You can actually say, “Hey, what’s halfway between Ariana Grande and Barack Obama?” And it tries its best to synthesize that. And so is this really the person it’s supposed to sound like – or is it not?
CATERINA FAKE: That was Mike Pappas and Carter Huffman. Their startup is creating “voice skins,” technology that lets you change how your voice is heard, in real time.
We usually consider our voice a permanent fixture of who we are – something we’re born with and then stuck with for life, like our fingerprints or our DNA. But what if our voice was more like our hair? Something we could cut, color, shape, and style any way we wanted, whenever we wanted. Something we can edit in real time, and then revert to its original.
Carter and Mike want to create a world where this could happen. But the implication of hiding and changing identity go deep.
FAKE: I’m Caterina Fake, and we know that the boldest new technologies can help us flourish as human beings – or destroy the very thing that makes us human. It’s our choice.
A bit about me: I co-founded Flickr, and helped build companies like Etsy, Kickstarter, and Public Goods from when they first started. I’m now an investor at Yes VC, and your host.
On today’s show: Mike Pappas and Carter Huffman, the founders of Modulate. Mike and Carter are using machine learning to build voice skins that allow you to change your voice in real time.
You’re probably already familiar with skinning, because you can “skin” your Gmail, for example, with scenes that include cute Japanese anime or nature scenes. You’re only changing what’s on the outside and how it appears, but you’re not changing what’s essential. That stays the same.
Modulate does this to your voice. It lets you sound like somebody who is nicer, or more charming, or more authoritative, or more sexy, or more aggressive. And so, you can change your voice to be a kid’s voice, adult’s voice, or monster’s voice. You can sound like a chipmunk, you can sound like an elderly man – you can sound like anybody.
Mike and Carter are working with game developers to integrate their voice skin technology. It’s in private beta with a few companies now. And in the coming year, they hope Modulate becomes a staple of gaming everywhere. But their long-term vision extends far beyond this use case. They envision a world where everyone will ultimately have the ability to modify their voice in real-time. I wonder though: What would it mean to live in that world?
On one level, voice skins are kind of like putting on a costume or a wig, dressing up like a wizard. And I think in that sense, it’s fairly innocuous, right? And on a deeper level, this could also give us a powerful freedom of choice. Imagine how freely we could express ourselves if we weren’t confined to biology. For trans people, and others who don’t feel represented by their bodies or their voices, it could be transformative.
But here’s crushing thing that happens when you introduce any kind of technology like this – whether it’s something that changes how you appear, like a photo filter, or how you sound, like a voice skin – everybody suddenly goes for the most socially elevated version of themselves. They aim to move higher in the hierarchy of being. So they choose a voice that gives them status – and everything is lost: their idiosyncrasies and their beauty and their quirks. Everything that makes somebody themselves is lost. And I hate that.
It makes me think of Ukiyo-e, an idea which comes from 17th-century Edo Japan. It’s when this false world of celebrity and beauty and envy appeared – and all of these perfect people of great wealth paraded up and down the streets. Artists started selling wood block prints of all of these people. And they called it Ukiyo-e: the floating world, the fake world. It’s the image world, not the real world. And I think about this all the time in reference to the internet – because the internet can become the floating world, where you assume a different persona. People do this all the time, this kind of social peacocking. They’ve lost something essential about themselves in favor of a false self.
What will the world look like when nobody looks or sounds like themselves? Which voices will become valued? Which will disappear? The creators of Modulate imagine this world to be one of delightful immersion and super creativity – and I love that. But I also wonder if it’s the opposite: A virtual hall of mirrors, where we all look and sound like each other, and lose what it means to be ourselves.
To understand how Carter and Mike came to launch Modulate, you need to envision their friendship. It began during orientation week at MIT. Mike was wandering down a hall.
PAPPAS: I noticed, hey, there’s this guy over here at one of the whiteboards that’s just on the hallway of the dorm, working on what looks like a really fascinating physics problem. And so I broke off from the group and went over to look at what he was working on. It turns out this was Carter, and he was working on a mechanics problem, and we talked through it together and I ended up sort of helping him solve it on the whiteboard.
FAKE: That is such a nerdy way to meet.
HUFFMAN: I want to clarify, in specific, that I think you came over behind me. I was working on this whiteboard, and you just sort of stood there, kind of in silence for a minute, and you pointed to the third line on the whiteboard and said, “I think you’re missing a minus sign.” Lo and behold, you were correct. I was missing a minus sign, and that solved the problem. That I think is exactly how it went down.
FAKE: This is probably how people meet at MIT all the time.
FAKE: In 2015, when Carter was working on machine learning autonomous boats at NASA, he had an aha moment.
HUFFMAN: I was part of the team that was working on object detection: Is this a buoy or a wave or what have you? About that time, there was a lot of cool stuff going on in deep learning. I don’t know if you’re familiar with the Prisma app, where you take a picture of yourself and you get a Van Gogh painting? A lot of this research was going on in computer vision, so processing images and video. And not a lot of it was being applied to audio and I thought, “Well, audio’s kind of a similar sort of data. Why can’t you do the same for audio? You ought to be able to do this, you ought to be able to give people total freedom to realistically manipulate audio in ways that they haven’t been able to before.”
FAKE: Carter’s idea struck an immediate chord with Mike, who’s an avid gamer.
PAPPAS: My excitement about video games is these enormous spaces where I just want to go off and have an adventure and to design a new character for that world and really embody that character as I’m playing.
HUFFMAN: I remember very prominently, from playing Halo and Call of Duty especially, is that a lot of really good players would use voice chat to coordinate with the other players on their team. And I would never do that because I was a 12-year-old kid. I didn’t want people to know that about me because I felt self-conscious. I thought that would be super cool but I didn’t feel comfortable participating.
FAKE: In the coming months, Modulate will be integrated into a number of games so that users can create customizable voice skins, allowing them to talk through skins, as opposed to in their own voice. This application was something Carter had dreamed up as a kid. But Mike and Carter’s vision for Modulate expands beyond gaming. They hope that voice skins will encourage limitless creativity and potential all over the virtual world.
And they are onto something, since we’re already on this path. For example, on Snapchat we do it all the time when we choose filters to apply to our photos. Online, we conform to standards of beauty that eliminate our individuality and quirks. We’ll get more into this later during our workshop, but first let’s talk about how Modulate works.
It’s important to note that Modulate is not the world’s first voice changer. In fact, online voice changers have been around for years.
HUFFMAN: A lot of these previous voice changers gave you the ability to change your voice in a couple of basic ways. They all do things like “sound like Darth Vader,” which is, say, cut off the high frequencies in your voice and just keep the low frequencies.
Or “sound like a robot,” you shift some pitches around and then you get a new frequency distribution, but it sounds like a robot not a human because you’ve lost a lot of that nuance and structure that is induced by a normal human vocal tract.
FAKE: Modulate recognizes patterns of speech, inflection, and tone from the hundreds of voices that Carter and Mike feed it. Then Modulate generates a brand new voice according to what a user wants.
There are three stages to Modulate’s technology. Let’s start with the first stage, which I’ll call the extraction stage.
HUFFMAN: If you want to change your voice, we have to listen to what you’re saying and extract out all of the things that need to be used in the new voice. So if I’m excited, we need to get you’re excited. If I’m saying specific words, we need to preserve that content.
FAKE: Imagine there’s a voice actor listening to what you’re saying, and immediately repeats it in a different voice, but still captures all the emotion. It’s like that.
HUFFMAN: We have a network that learns a bunch of different audio filters and captures different frequencies in your voice as you’re talking. And it distills that down into some high level set of features.
FAKE: Think about how your phone’s speech-to-text function works. But beyond your phone transcribing your voice into words, it could also pick up on and translate the emotional content of how you’re speaking.
This emotional content is crucial for the second step in Modulate’s process, which translates your voice into the target voice, the voice of who you want to sound like.
HUFFMAN: In our data set, we have say a hundred different people speaking and they each have different voices. So we have a hundred different labels of voices that we can use to specify, “I want to sound like person A or person B,” for example. And it takes that information and synthesizes it into a new waveform that represents the content of what you said, but in this target speaker’s voice.
FAKE: But how does Modulate know how to transform one voice into another? Carter explains.
HUFFMAN: If we’re training a “Mike” voice skin, for example, we give it the “Mike” label and though maybe it’s taking in my speech, it’s outputting its attempt at Mike’s speech.
The listener listens to that and says, “Hey, I can tell that that’s not actually Mike’s speech because you got the pitch slightly wrong, or he’s a little more nasally than that.”
Those features that it uses to tell the difference flow back into the synthesis network and it updates itself to not make that mistake again. And so you run this sort of back and forth training game for a long time and eventually you reach a point where all of the mistakes that the listener could use to tell the difference have been updated away – and it no longer makes any of these mistakes.
FAKE: We asked Mike and Carter to demonstrate what Modulate sounds like. Here they are using what they call the “Katie” skin.
VOICE: Here’s Carter with the Katie voice skin.
VOICE: And here’s Mike with the Katie voice skin.
FAKE: Could you differentiate them? Tell who was who? If not, that’s the point. While Mike and Carter have distinct individual voices, when they apply the Katie filter, they sound almost identical. And here’s what my voice sounds like with a voice skin – or a series of voice skins that Mike and Carter applied to it.
VOICE: I’m Caterina Fake, and your host of Should This Exist?
VOICE: I’m Caterina Fake, and your host of Should This Exist?
VOICE: I’m Caterina Fake, and your host of Should This Exist?
FAKE: This process is quick. Modulate will take 60 minutes of the target voice – the person you want to sound like – and put it into its voice library.
This is part of what makes Modulate’s technology so much more impressive than your standard Darth Vader voice changers of the past. Not only can you sound like Mike or me, you could design an entirely new voice that doesn’t belong to anyone.
HUFFMAN: It not only learns just these specific individual voices that we’ve trained it on but it also learns, in general, how human voices sound and it sort of maps them out. So like, maybe deep male voices are over in one corner and high female voices are over in another corner and you can actually walk around that map and say, “Hey, what’s halfway between Ariana Grande and Barack Obama in this map?” You know, who knows?
FAKE: As I listened to Mike and Carter play out their vision for Modulate, I could see how fun this could be, how gratifying in games, how you could fully absorb your character. But also, if people could choose how they sound and so easily slip in and out of different voices, would local accents disappear? Could we unleash an era of vocal fraud, where we think we know who we’re talking to but then we realize we could be talking to anyone, using any voice skin? It’s like photoshopped images – but for sound.
So much of our identity is caught up in how people perceive us based on the sound of our voice – and we make assumptions about others based on the sound of theirs. Technology like Modulate puts this tendency on display.
For this episode’s workshop, I spoke to four people with diverse viewpoints, each of whom has a unique relationship to the concept of the human voice. We discussed how Modulate could change the meaning and the relevance of voice, in both good and bad ways.
First we have Serena Daniari, a reporter who focuses on gender issues and gender diversity.
SERENA DANIARI: As a trans person I’ve always had a unique conflict with my own voice. I just want a voice that just makes me feel more complete, not necessarily to live up to other people’s standards of how I should sound.
FAKE: Would Serena have used Modulate if it already existed? I asked her.
DANIARI: The younger me would have definitely used that. My escape was online and I would pretend I was a girl because I wasn’t in real life, at the time.
But I do think it poses some dilemmas and predicaments. You’re hearing someone’s voice without seeing them. I think it creates this duality of identities, one that is sort of virtual and constructed but you still have who you actually are.
So I think conceptually it’s a really cool idea. But in terms of how it is actually used practically, the coolness of the idea it stops once you exit wherever you’re using the software and enter the real world.
FAKE: This point struck home for Carter and Mike.
PAPPAS: Yeah, I mean it’s obviously a huge question that goes right to the heart of what we’re building. The promise of the internet, in some respects, is sort of escapism or a new kind of self-expression. And we’ve seen that in not just people changing the way that they sound but also designing avatars in different games or in different forums, even just sort of choosing a profile picture that they feel better expresses themselves than just a picture of their face. A big part of the advantage that the internet conveys is that you can decide how you want to express yourself and make all of those choices intentionally.
FAKE: It’s funny, because in the very early days of the internet, The New Yorker cartoon of these two dogs having a conversation in front of their computer saying, “On the internet no one knows you’re a dog.”
PAPPAS: Yes, yes.
FAKE: It became very famous, right, and for a reason, because suddenly you could be whoever you were, or felt yourself to be, or who you wanted to be.
HUFFMAN: Yeah. I’m really interested in the way that you asked that question in particular, differentiating between your “online self” and your “real self”, because I think your real self is sort of more of a question that’s left up to you than is constrained by the physical world in some sense.
We actually had several people reach out to us just via our contact link, some of them trans, streamers, for example, who have mentioned that they really like being able to stream their games, but don’t like the sound of their biological voice and would love to use our technology to create a voice that sounds like who they feel they really are.
And I think that’s part of one of the big promises of technology is removing some of the constraints of the physical world.
FAKE: Anita Sarkeesian, one of the hosts of the podcast “Feminist Frequency Radio,” has spent years tracking the representation of women in pop culture, and especially in the gaming world. She was the target of online trolling and violent threats after publishing her popular video series “Tropes Versus Women in Video Games,” which exposed sexist depictions of women in games. She warns that Modulate could contribute to a world where women are silenced in gaming, rather than supported.
ANITA SARKEESIAN: Harassment in online gaming, which is when people are playing games online and they’re speaking to each other, when it is revealed that a woman is in the game or who they perceive to be a woman is in a game, there is a very high probability that she’ll get harassed for that – and this is based on her voice, most likely, because it’s not visual.
There is an interesting opportunity to be able to choose different voices when you are gaming online. Now part of me thinks, “Oh, this could be really interesting in some ways.” And the other part of me that’s really cynical thinks it’s really sad that women might choose to mask their voice in order to participate fully in an online game without the risk of harassment. That we need to develop technology to pretend to be men in order to not be harassed – like, how is that the future?
FAKE: When is assuming a different persona perpetuating the bias, especially if you are using the voice skin as a shield to protect yourself? By enabling people to mask their voices, technology like Modulate runs the risk of silencing them. It’s silencing trans people, it’s silencing women, it could contribute to the hegemony of men by reinforcing men’s authority and power, and it could take it away from women. I dug into this question with Mike and Carter.
PAPPAS: I think we one hundred percent agree with that. Yes, it is extremely sad that some people might feel that they have to do this in order to feel safe playing games. But, first of all, I think it’s better for them to have that option as opposed to not, but we still absolutely have to find better ways to reduce this problem of harassment in general so that fewer and fewer people feel the need to make a choice like that.
HUFFMAN: Furthermore, I think that part of the benefit to this technology being in an environment such as gaming in specific is that we have the user profile, the gamer profile, of the person playing the game. So if you have that routinely adopts a voice skin for say, another gender and then is harassing people using that voice skin, we could recognize that behavior and then ban them from the platform. So I think that’s a really powerful solution to this problem.
FAKE: I hear what he’s saying, that you could use voice skinning to change the power dynamic giving female voices to powerful characters. It would be a small but important step to giving women more authority.
DANIARI: If you’re adopting a new voice online because it makes you feel more like yourself then I think that idea is really noble and really beautiful, to be honest.
FAKE: That’s Serena, the reporter from earlier, who believes that there are beautiful possibilities, but also thinks people could use it in offensive, racist ways, like a kind of verbal blackface.
DANIARI: But I think there’s also this other idea of appropriating voices or appropriating aspects of someone’s culture or identity that don’t belong to you. So once again, I think that’s a little issue in terms of misuse. If someone adopts a voice that is specific to a race or they’re presenting it as specific to a race and they’re trying to make a joke out of it or trying to use it to troll or to be racist, it just seems like the ramifications could be upsetting.
There should be a high level of moderation in terms of who can use what voice and how they’re used, and regulate levels of abuse and ignorance.
PAPPAS: I think that’s a fantastic point and I’m really glad that appropriate levels of moderation were brought up as well.
It seems very apparent to me that there probably aren’t intrinsically good or intrinsically bad voices that people can adopt. Like, these are all sort of physically biologically plausible voices here that were allowing people to adopt. And so it’s not so much what voice you choose in particular but how you use it?
So I go a little bit away from the idea that should certain people be allowed to use certain voices or not, and maybe more towards the idea of in what contexts, in what communities, with what degree of moderation can these voices be used?
FAKE: Also remember the scandal that came up with the Apu voice on “The Simpsons” being done by a white man. His accent was meant to be comic, and there was a huge outcry from the Indian community. There was even a movie about it, called “The Problem with Apu”. You can use voices in a racist way. That is a fear around what Modulate does.
PAPPAS: Yeah, I think it’s a fair concern for sure. First of all, again, we are only changing your vocal chords. And that means that if you decide to sort of speak in a particularly offensive way, you still could do that yet. But I think it’s important to be clear that we’re very specifically not designing things that are meant to come with those stereotypes.
I think further though, over five or ten years, there’s kind of a vision that these stereotypes might be able to die over time because of technology like this. If every time you’re interacting with someone in a digital space their voice is in fact completely uncorrelated with their physical self, we stop making the same kinds of assumptions about what a voice means and we stop sort of drawing that back to those physical things in the first place.
Interviewing is a great example where we’ve talked to companies that are really interested in the idea of: What if you could interview someone and they had a randomly sampled voice so that you were really forced not to be able to bias yourself based on things that you think you’re learning about their physical self, but are forced to really just pay attention to the content of what’s being said?
FAKE: Oh, yeah, this happens all the time with, for example, resumes. Dozens of experiments have been done with this, that if your name is Jamal or your name is Margaret, that you’re less likely to get the interview than if your name is Richard or Michael.
FAKE: For example, with voices, you may have heard this in the ABC News podcast “The Dropout” about Elizabeth Holmes. She was the founder of Theranos, a company now infamous for falsely claiming that they could conduct hundreds of blood tests using a single drop of blood.
She is also known for her remarkably low baritone voice, but as reported on “The Dropout”, her coworkers claim she switches to a higher more feminine voice in unguarded moments. So it sounds like she’s assuming the voice of authority, and that the voice of authority is male. I talked to Mike and Carter about this.
FAKE: Do you think that you could with Modulate change our expectations of what kind of voice expresses power? For example, James Earl Jones, who has this amazing bass voice, was the voice for Darth Vader. What if you gave Darth Vader, who is frightening, powerful, scary – what if you gave Darth Vader a feminine voice? Do you think that over time we would adapt to thinking of women as more powerful?
PAPPAS: I don’t want to pretend that I have deeper expertise than I do. This is, of course, only my own sort of opinions and speculation here, but, honestly, I feel like the right answer is that why on earth would a voice have anything to do with whether you’re powerful or competent or anything like that? It’s silly, in a sense, to think that your voice should indicate anything about that.
I feel like the right answer that we ought to try and be getting to is a place where you don’t draw conclusions based on a voice, because all the voice tells you is that’s what this person wanted to sound like in this particular context. But it’s not meant to relate to any of these other things. It’s purely a sort of choice for creative expression in a similar way to how people can choose what clothes they want to wear as they go out, they go to work, or they go and spend time with their family. You can choose how you want to express yourself in different circumstances, and there’s nothing deeper than that choice to read into it.
FAKE: If Modulate were as common as Photoshop: everything digital we hear – not just what we see – could be faked.
This shaky reality has been a long time in the making for people like Ed Primeau, an expert in forensic audio, who has been preparing for this moment for years.
ED PRIMEAU: I’ve been practicing for 35 years, and my job involves evidence recovery: audio and video and image forensic enhancement, as well as authentication and analysis testing. I’ve testified in many courts around the world; federal, state and local courts here in the United States.
FAKE: Ed is ready to navigate a post-Modulate world, and sees a potential dystopia in which Modulate becomes part of a criminal’s playbook.
PRIMEAU: I understand its application in the gaming community, but I could certainly see how fraud could come into play with being able to change somebody’s voice to sound like a relative, for instance.
A few weeks ago my sister received a phone call from somebody who sounded exactly like her grandson and they were demanding money be sent to this attorney. They’re stuck in jail. They’ve been in a horrible accident. They’ve got hospital bills that need to be paid – and that’s very disturbing for somebody who’s not familiar. It’s kind of like when Photoshop came out. We’ve got to have our our radar up and just be cautious. If something’s too good to be true, it probably is, and we just have to have our radar up.
FAKE: And technology like Modulate could potentially help criminals.
PRIMEAU: So if the quality continues to improve and the samples become more and more realistic, I guess is a good word, as opposed to synthesized, it would certainly be a great excuse for criminals to say, “Hey, that’s not my voice in that recording. Somebody must have created it using a voice skin.”
I have a case in house currently – actually have a couple cases in house currently that I can’t really talk about – that both involve criminals who are saying, “That’s not me in that recording.” I even have people call here saying, “I didn’t say that in the interview,” even though that’s my voice and I can hear my voice. You know, that’s another thing to consider with this software is that it can be an excuse for people who have committed a crime and the crime has been recorded and they want to be able to say, “Hey, there’s no way, that’s not my voice.”
FAKE: So I asked Mike and Carter: If all digital media is thrown into suspicion, does voice skinning technology give bad actors, criminals, and liars a way of crying “fake news” at every turn?
PAPPAS: This is something that we’ve definitely spent a lot of time thinking about, and I think there’s a couple pieces of this. This isn’t something where I think it’s going to be viable for us to say, “Everyone stop doing this.” There’s going to be a point where we reach audio that does sound completely realistic, and so we need to build in solutions from the ground up that are going to prevent it from being misused in this way.
First of all, we add a watermark to our own synthetic audio so that you can actually confirm, “Oh, hey, this audio was synthesized by Modulate so we know that that’s not legitimate” if someone’s, say, trying to impersonate someone using that tool. But then this flip side that you’re bringing up is, well, what if someone claims, “Hey, that was synthesized by someone who’s not using a watermark…” or something. And for that, that’s not something that we can solve entirely on our own, but it’s something…
FAKE: That is so interesting. How do you do a watermark on an audio file?
HUFFMAN: Yeah. One of the really cool things about audio is how much goes in to the content of what you’re saying and the identity of your voice. There’s the frequencies that are induced by your vocal cords. There’s the cadence, the emotion, the intonation that you put in. Even the words you use, right? All of these things go into making you who you are when you’re talking.
For watermarking, there’s a lot of this information that people don’t need in order to understand what you’re saying. Tiny little micro timing differences, or slightly different frequencies being used, all of the other stuff that you don’t need to understand what’s being said, we can modify to produce audio that you can enjoy but nevertheless encodes information in a pattern that we can detect later.
FAKE: Modulate could actually have this watermark that would show that this technology had been used.
FAKE: I like that Mike and Carer built their technology with not just the utopias in mind, but also the dystopias. In speaking to Kevin Delaney, the editor in chief of Quartz, he immediately identified how the watermark could be a tool used in newsrooms.
KEVIN DELANEY: I think that journalists are always looking for clues about the origin of things, the watermark is in the tradition of that technology that provides journalists with a shortcut to try and verify. Can a watermark be counterfeit? Probably, but it’s definitely a helpful way to start thinking about this technology.
The idea that you could assemble fictional characters who are so compelling to people that they assume real world significance, that’s super troubling to me. As we add ways in which people can simulate reality, it’s possible that the fiction overcomes our reality in a way. So technology like Modulate makes it possible for one to imagine that a 2020 presidential candidate could actually just be a fictional creation that has a scientifically designed voice that would appeal to people across voter demographics.
PAPPAS: I’d actually like to maybe invert that point a little bit, where if the claim is someone with the right set of vocal cords gets an unfair advantage on the political stage, it’s possible today that someone’s born into that set of vocal cords, and so they just get that advantage, and they have a lock on it. Modulate’s technology, if that’s true, democratizes access to that, to force people to actually battle out in terms of ideas instead of just getting an advantage from having a particularly smooth voice.
FAKE: Oh, that’s super interesting. I really like that framing, that basically the suspicion of people’s voices and people being aware of the ability to change your voice would force people to focus more on the message and what the people are saying.
PAPPAS: Absolutely. I don’t think I would be happy in a world where all of the political candidates just chose the same scientifically-backed voice – there’s clearly going to be some problems if that were to happen – but even in that world, at least you sort of can rest assured that, all right, they’re all at a level playing field with respect to the voice that they’ve chosen.
HUFFMAN: Oh, yeah. I think in particular our work on watermarking our outputs is particularly important in this way, right? We’re not in the business of deceiving people. We’re giving people creativity over their voice, but we don’t want to put out a product where nobody could ever tell that this is a synthetic voice, and so including that watermark I think is a big step towards that.
We’re not going to go post this on the internet for anyone to download from GitHub – that is a terrible idea. But we are conscious of the fact that other people are going to be developing this kind of technology as well.
PAPPAS: Photoshop came out, and we were able to find the right ways to navigate that kind of a world, and I think it will be a similar thing here.
FAKE: Voice skinning technology brings up a lot of ideas about what voices say to us, and what assumptions we make about the people using them. It can be used for fun, like putting on a costume, and becoming a troll, a fairy, or a monster. You can be a man if you’re a woman, or a woman if you’re a man.
But voice skinning brings up other questions too: Who is believed? Who has a right to speak? Who has authority, and who deserves to be heard? When is voice skinning a refuge from harassment, or a door into a fuller expression of who we are? And can we use voices to silence not only other people, but ourselves?
When thinking about this technology I thought of two stories: the Cassandra myth, and the Boy Who Cried Wolf. In the Cassandra myth, Cassandra is given the gift of prophecy by the god Apollo, who is in love with her. But when she rejects his advances, he curses her, so no one will believe her. She sees the future, and tells the truth – the Greeks are invading Troy, hidden in a wooden horse! – but she is not believed, like Bill Cosby’s 60 accusers were not seen or heard until comedian Hannibal Buress called Cosby a rapist on stage.
In Aesop’s fable The Boy Who Cried Wolf, the boy is believed and believed and believed – until finally his lies are uncovered, and no one believes him any more. But if you juxtapose these two stories you see who has the privilege of being heard, and believed, from the start.
We live in a world where only some voices are listened to, and others are not. The internet’s promise was giving a voice – literally and figuratively – to all. So we can use technology to make the world more fair and more just – or we can use it to perpetuate injustice. It’s our choice.