With questionable copyright claim, Jay-Z orders deepfake parodies off YouTube – Encyphr

On Friday, I linked to several videos by Vocal Synthesis, a new YouTube channel dedicated to audio deepfakes — AI-generated speech that mimics human voices, synthesized from text by training a state-of-the-art neural network on a large corpus of audio.

The videos are remarkable, pairing famous voices with unlikely dialogue: Bob Dylan singing Britney Spears, Ayn Rand and Slavoj Žižek dueting Sonny and Cher, Tucker Carlson reading the Unabomber Manifesto, Bill Clinton reciting “Baby Got Back,” or JFK touting the intellectual merits of Rick and Morty.

Many of the videos have been remixed by fans, adding music to create hilarious and surreal musical mashups. Six U.S. presidents from FDR to Obama rap N.W.A.’s, Fuck Tha Police, George W. Bush covers 50 Cent’s In Da Club, Obama covers Notorious B.I.G.’s Juicy, and my personal favorite, Sinatra slurring his way through the Navy Seal copypasta, a decade-old 4chan meme.

Videos Taken Offline

Over the weekend, for the first time, the anonymous creator of Vocal Synthesis received a copyright claim on YouTube, taking two of his videos offline with deepfaked audio of Jay-Z reciting the “To Be or Not To Be” soliloquy from Hamlet and Billy Joel’s “We Didn’t Start the Fire.”

According to the creator, the copyright claims were filed by Roc Nation LLC with an unusual reason for removal: “This content unlawfully uses an AI to impersonate our client’s voice.”

Both videos were immediately removed by YouTube, but can still be viewed on LBRY, a decentralized and open-source publishing platform. Two synthetic Jay-Z videos remained online, in which he raps the Book of Genesis and the Navy Seal copypasta.

The video’s creator announced the takedown in a creative way: using the voices of Barack Obama, Donald Trump, Ronald Reagan, JFK, and FDR.

Here’s an excerpt transcript from the video:

“Over the past few months, the creator of the channel has trained dozens of speech synthesis models based on the speech patterns of various celebrities or other prominent figures, and has used these models to generate more than one hundred videos for this channel. These videos typically feature a synthetic celebrity voice narrating some short text or a speech. Often, the particular text was selected in order to provide a funny or entertaining contrast with the celebrity’s real-life persona.

“For example, some of my favorites are George W. Bush performing a spoken-word version of “In Da Club” by 50 Cent, or Franklin Roosevelt’s powerful rendition of the Navy Seals Copypasta.

“The channel was created by an individual hobbyist with a huge amount of free time on his hands, as well as an interest in machine learning and artificial intelligence technologies. He would like to emphasize that all of the videos on this channel were intended as entertainment, and there was no malicious purpose for any of them.

“Every video, including this one, is clearly labeled as speech synthesis in both the title and description. Which brings us to the reason why we’re delivering this message.

“Over the past two days, several videos were posted to the channel featuring a synthetic Jay-Z rapping various texts, including the Navy Seals Copypasta, the Book of Genesis, the song “We Didn’t Start the Fire” by Billy Joel, and the “To Be Or Not To Be” soliloquy from Hamlet.

“Unfortunately, for the first time since the channel began, YouTube took down two of these videos yesterday as a result of a copyright strike. The strike was requested by Roc Nation LLC, with the stated reason being that it, quote, “unlawfully uses an AI to impersonate our client’s voice.”

“Obviously, Donald and I are both disappointed that Jay-Z and Roc Nation have decided to bully a small YouTuber in this way. It’s also disappointing that YouTube would choose once again to stifle creativity by reflexively siding with powerful companies over small content creators. Specifically, it’s a little ironic that YouTube would accept “AI impersonation” as a reason for a copyright strike, when Google itself has successfully argued in the case of “Authors Guild v. Google” that machine learning models trained on copyrighted material should be protected under fair use.”

No Intent to Deceive

At its core, the controversy over deepfakes is about deception and disinformation. Earlier this year, Facebook and Twitter banned deepfakes that could mislead or cause harm, largely motivated by their potential impact on the 2020 elections.

Though it’s worth nothing that the use of deepfakes for fake news is largely theoretical so far, as Samantha Cole covered for VICE, with most created for porn. (And, no, Joe Biden sticking his tongue is not a deepfake.)

In this case, there’s no deception involved. As he wrote in his statement, every Vocal Synthesis video is clearly labeled as speech synthesis in the title and description, and falls outside of YouTube’s guidelines for manipulated media.

Copyright and Fair Use

With these takedowns, Roc Nation is making two claims:

These videos are an infringing use of Jay-Z’s copyright.
The videos “unlawfully uses an AI to impersonate our client’s voice.”

But are either of these true? With a technology this new, we’re in untested legal waters.

The Vocal Synthesis audio clips were created by training a model with a large corpus of audio samples and text transcriptions. In this case, he fed Jay-Z songs and lyrics into Tacotron 2, a neural network architecture developed by Google.

It seems reasonable to assume that a model and audio generated from copyrighted audio recordings would be considered derivative works.

But is it copyright infringement? Like virtually everything in the world of copyright, it depends—on how it was used, and for what purpose.

It’s easy to imagine a court finding that many uses of this technology would infringe copyright or, in many states, publicity rights. For example, if a record producer made Jay-Z guest on a new single without his knowledge or permission, or if a startup made him endorse their new product in a commercial, they would have a clear legal recourse.

But, as the Vocal Synthesis creator pointed out, there’s a strong case to be made this derivative work should be protected as a “fair use.” Fair use can get very complicated, with different courts reaching different outcomes for very similar cases. But there are four factors judges use when weighing a fair use defense in federal court:

The purpose and character of the use.
The nature of the copyrighted work.
The amount and substantiality of the portion taken.
The effect of the use upon the potential market.

There’s a strong case for transformation with the Vocal Synthesis videos. None of the original work is used in any recognizable form—it’s not sampled in a traditional way, using an undisclosed set of vocal samples, stripped from their instrumentals and context, to generate an amalgam of the speaker.

And in most cases, it’s clearly designed as parody with an intent to entertain, not deceive. Making politicians rap, philosophers sing pop songs, or rappers recite Shakespeare pokes fun at those public personas in specific ways.

Vocal Synthesis is an anonymous and non-commercial project, not monetizing the channel with advertising and no clear financial benefit to the creator, and the impact on the market value of Jay-Z’s discography is non-existent.

There are questions about the amount and substantiality of the borrowed work. But even if the model was trained on everything Jay-Z ever produced, it wouldn’t necessarily rule out a fair use defense for parody.

Ultimately, there are two clear truths I’ve learned about fair use from my own experiences: only a court can determine fair use, and while it might be a successful defense, fair use won’t protect you from getting sued and the costs of litigating are high.

Interviewing the Creator

As far as I know, this is the most prominent example of a celebrity claiming copyright over their own deepfakes, the first example of a musician issuing a takedown of synthesized vocals, and according to the creator, the first time YouTube’s removed a video for impersonating a voice with AI. (Previously, Conde Nast took down a Kim Kardashian deepfake by claiming copyright over the source video, and Jordan Peterson ordered a voice simulator offline.)

I reached out to the anonymous creator of Vocal Synthesis to learn more about how he makes these videos, his reaction to the takedown order, and his concern over the future of speech synthesis. (Unfortunately, Roc Nation didn’t respond to a request for comment.)

How do you feel about the takedown order? Were you surprised to receive it?
I was pretty surprised to receive the takedown order. As far as I’m aware, this was the first time YouTube has removed a video for impersonating a voice using AI. I’ve been posting these kind of videos for months and have not had any other videos removed for this reason. There are also several other channels making speech synthesis videos similar to mine, and I’m not aware of any of them having videos removed for this reason.

I’m not a lawyer and have not studied intellectual property law, but logically I don’t really understand why mimicking a celebrity’s voice using an AI model should be treated differently than someone naturally doing an (extremely accurate) impression of that celebrity’s voice. Especially since all of my videos are clearly labeled as speech synthesis in both the title and description, so there was no attempt to deceive anyone into thinking that these were real recordings of Jay-Z.

Can you talk a little about the effort that goes into generating a new model? For example, how long does it typically take to gather and train a new model until it sounds good enough to publish?
Constructing the training set for a new voice is the most time-consuming (and by far the most tedious) part of the process. I’ve written some code to help streamline it, though, so it now usually takes me just a few hours of work (it depends on the quality of the audio and the transcript), and then there’s an additional 12 hours (approximately) needed to actually train the model.

Are you using Tacotron 2 for synthesis?
Yeah, I’m using fine-tuned versions of Tacotron 2.

I saw you’ve struggled getting enough dialogue to fully develop some models, like with Mr. Rogers. Have there been other voices you’ve wanted to synthesize, but it’s just too challenging to find a corpus to work from?
Yeah, several. Recently I tried to make one for Theodore Roosevelt, but there’s only about 30 minutes of audio that exists for him (and it’s pretty poor quality), so the model didn’t really come out well.

The Crocodile Hunter (Steve Irwin) is another one I really want to do, and I can find enough audio, but I haven’t been able to find any accurate transcripts or subtitles yet (it’s very tedious for me to transcribe the audio myself).

How do you decide the voices and dialogue to pair together?
I try to consistently have all my voices read the Navy Seals Copypasta and the first few lines of the Book of Genesis, since it’s easier to hear the nuances of each voice when I can compare them to other voices reading the same text. Other than that, there’s no real method to it. If I have an idea for voice/text combination that I think would be funny or interesting enough to be worth the effort of making the video, then I’ll do it.

What do these videos mean to you? Is it more of a technical demonstration or a form of creative expression?
I wouldn’t really consider my videos to be a technical demonstration, since I’m definitely not the first to make realistic speech synthesis impersonations of well-known voices, and also the models I’m using aren’t state-of-the-art anymore.

Mainly, I’m just making these videos for entertainment. Sometimes I just have an idea for a video that I really want to exist, and I know that if I don’t make it myself, no one else will.

On the more serious side, the other reason I made the channel was because I wanted to show that synthetic media doesn’t have to be exclusively made for malicious/evil purposes, and I think there’s currently massive amounts of untapped potential in terms of fun/entertaining uses of the technology. I think the scariness of deepfakes and synthetic media is being overblown by the media, and I’m not at all convinced that the net impact will be negative, so I hoped that my channel could be a counterexample to that narrative.

Are you worried about the legal future for creative uses of this technology?
Sure. I expect that this technology will improve even more over the next few years, both in terms of accuracy and ease of use/accessibility. Right now it seems to be legally uncharted waters in some ways, but I think these issues will need to be settled fairly soon. Hopefully the technology won’t be stifled by overly restrictive legal interpretations.

It seems inevitable that, at some point, an artist’s voice is going to be used for other uses against their will: guesting on a track without permission, promoting products they aren’t paid for, or maybe just saying things they don’t believe. What would you say to artists or other public figures who are worried that this technology will damage their rights and image?
There are always trade-offs whenever a new technology is developed. There are no technologies that can be used exclusively for good; in the hands of bad people, anything can be used maliciously. I believe that there are a lot of potential positive uses of this technology, especially as it gets more advanced. It’s possible I’m wrong, but for now at least I’m not convinced that the potential negative uses will outweigh that.

Thanks to the anonymous creator of Vocal Synthesis for their time. You can subscribe to the YouTube channel (for now) for new videos, follow updates and remixes in the /r/VocalSynthesis subreddit, and the video mirror on LBRY.