Merchant: The AI industry's plan to nuke its copyright problem

ChatGPT replicates New York Times articles nearly verbatim, a lawsuit by the newspaper claims. But OpenAI and other AI companies say what they do is legally fair use.

(Richard Drew / Associated Press)

By Brian MerchantTechnology Columnist

Jan. 12, 2024 3 AM PT

This time in 2023, the world was in thrall to the rise of OpenAI’s dazzling chatbot. ChatGPT was metastasizing like a fungal infection, amassing tens of millions of users a month. Multibillion-dollar partnerships materialized, and investments poured in. Big Tech joined the party. AI image generators like Midjourney took flight.

Just a year later, the mood has darkened. The surprise sacking and rapid reinstatement of OpenAI Chief Executive Sam Altman gave the company an embarrassing emperor-has-no-clothes moment. Profits are scarce across the sector, and computing costs are sky high. But one issue looms large above all and threatens to bring the fledgling industry back to earth: Copyright.

The legal complaints that cropped up throughout last year have grown into a thundering chorus, and the tech companies say they now present an existential threat to generative AI (the kind that can produce writing, pictures, music and so on). If 2023 was the year the world marveled at AI content generators, 2024 may be the year that the humans who created the raw materials that made that content possible get their revenge — and maybe even claw back some of the value built on their work.

In the last days of December, the New York Times filed a bombshell lawsuit against Microsoft and OpenAI, alleging that “millions of its articles were used to train automated chatbots that now compete with the news outlet as a source of reliable information.” The Times’ lawsuit joins a host of others — class-action lawsuits filed by illustrators, by the photo service Getty Images, by George R.R. Martin and the Author’s Guild, by anonymous social media users, to name a few — all alleging that companies that stand to profit from generative AI used the work of writers, reporters, artists and others without consent or compensation, infringing on their copyrights in the process.

Our experiments make it all but certain that these systems are in fact training on copyrighted material.

— Cognitive scientist Gary Marcus

Each of these lawsuits have their merits, but the Gray Lady’s entrance into the arena changes the game. For one thing, the Times is influential in shaping national narratives. For another, the Times lawsuit is uniquely damning; it’s loaded with example after example of how ChatGPT replicates news articles nearly verbatim, and offers the responses to its paying customers, free of attribution.

It’s not just the lawsuits: The heat is getting turned up by Congress, researchers and AI experts too. On Wednesday, a congressional hearing saw senators and media industry representatives agree that AI companies should pay licensing fees for the material they use to train their models. “It’s not only morally right,” said Sen. Richard Blumenthal (D.-Conn.), who chairs the subcommittee that held the hearing, according to Wired. “It’s legally required.”

Meanwhile, a fiery study recently published in IEEE Spectrum, co-written by the cognitive scientist and AI expert Gary Marcus and the film industry veteran Reid Southern, shows that Midjourney and Dall-E, two of the leading AI image generators, were trained on copyrighted material, and can regurgitate that material at will — often without even being prompted to.

“Our experiments make it all but certain that these systems are in fact training on copyrighted material,” Marcus told me, something that the companies have been coy about copping to explicitly. “The companies have been far from straightforward in what they’re using, so it was important to establish that they are using copyrighted materials.” Also important: that the copyright-infringing works come spilling out of the systems with little prodding. “You don’t need to prompt it, to say ‘make C3P0’ — you can just say ‘draw golden droid.’ Or ‘Italian plumber’ — it will just draw Mario.”

This has serious implications for anyone using the systems in a commercial capacity. “The companies whose properties are infringed — Mattel, Nintendo — are going to take an interest in this,” Marcus says. “But the user is left vulnerable too — There’s nothing in the output that says what the sources are. In fact the software isn’t capable of doing that in a reliable way. So the users are on the hook and have no clue as to whether it’s infringing or not.”

There’s also a sense of momentum that’s beginning to build behind the simple notion that creators should be compensated for work that’s being used by AI companies valued at billions or tens of billions — or hundreds of billions of dollars, as Google and Microsoft are. The notion that generative AI systems are at root “plagiarism machines” has become increasingly widespread among their critics, and social media is teeming with opprobrium against AI.

But those AI companies aren’t likely to relent. We saw a foreshadowing of how the AI companies would respond to copyright concerns at large last year, when famed venture capitalist and AI evangelist Marc Andreessen’s firm argued that AI companies would go broke if they had to pay copyright royalties or licensing fees. Just this week, British media outlets reported that OpenAI has made the same case, seeking an exemption from copyright rules in England, claiming that the company simply couldn’t operate without ingesting copyrighted materials.

“Because copyright today covers virtually every sort of human expression — including blogposts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials,” OpenAI argued in its submission to the House of Lords. Note that both Andreessen and OpenAI’s statements underscore the value of copyrighted work in arguing that AI companies shouldn’t have to pay for it.

What can they do about it?

First, they’re pleading poverty. There’s just too much material out there to compensate everyone who contributed to making their system work and to making their valuation go through the roof. “Poor little rich company that’s valued at $100 billion can’t afford it,” Marcus says. “I don’t know how well that’s going to wash, but that’s what they’re arguing.”

The AI companies also argue what they’re doing falls under the legal doctrine of fair use — probably the strongest argument they’ve got — because it’s transformative. This argument helped Google win in court against the big book publishers when it was copying books into its massive Google Books database, and defeat claims that YouTube was profiting by allowing users to host and promulgate unlicensed material.

Next, the AI companies argue that copyright-violating outputs like those uncovered by Marcus, Southern and the New York Times are rare or are bugs that are going to be patched.

“They say, ‘Well this doesn’t happen very much. You need to do special prompting.’ But the things we asked it were pretty neutral — and we still got” copyrighted material, Marcus says. “This is not a minor side issue — this is how the systems are built. It is existential for these companies to be able to use this amount of data.”

Finally, aside from just making arguments in court and in statements, the AI companies are going to use their ample resources to lobby behind the scenes and throw their power around to help make their case.

Again, the generative AI industry isn’t making much money yet — last year was essentially one massive product demo to hype up the technology. And it worked: The investment dollars did pour in. But that doesn’t mean the AI companies have figured out ways to build a sustainable business model. They’re already operating under the assumption that they will not pay for things such as training materials, licenses or artists’ labor.

Of course, it is in no way true that the likes of Google, Microsoft, or even OpenAI cannot afford to pay to use copyrighted works — but Silicon Valley is at this point used to cutting labor and the cost of creative works out of the equation, and has little reason to think it would not be able to do so again. From Uber to Spotify, the business models of many of this century’s biggest tech companies have been built on the assumption that labor costs could be cut out or minimized. And when creative industries argued that YouTube allowed pirated and unlicensed materials to proliferate at the workers’ expense, and backed the Stop Online Piracy Act (SOPA) to fight it, Google was instrumental in stopping the bill, organizing rallies and online campaigns, and lobbying lawmakers to jump ship.

William Fitzgerald, a partner at the Worker Agency and former member of the public policy team at Google, tells me he sees a similar pressure campaign taking shape to fight the copyright cases, one modeled on the playbook Google has used successfully in the past: Marshaling third-party groups and organs such as the Chamber of Progress to push the idea that using copyrighted works for generative AI is not just fair use, but something that’s being embraced by artists themselves, not all of whom are so hung up on things like wanting to be paid for their work. He points to a pro-generative AI open letter signed by AI artists, that was, according to one of the artists involved, organized by Derek Slater, a former Google policy director whose firm does tech policy campaign work on AI — the same person who took credit for organizing the anti-SOPA efforts. Fitzgerald also sees Google’s fingerprints on Creative Commons’ embrace of the argument that AI art is fair use, as Google is a major funder of the organization.

“It’s worrisome to see Google deploy the same lobbying tactics they’ve developed over the years to ensure workers don’t get paid fairly for their labor,” Fitzgerald said. And OpenAI is close behind. It is not only taking a similar approach to heading off copyright complaints as Google, but it’s also hiring the same people: It hired Fred Von Lohmann, Google’s former director of copyright policy, as its top copyright lawyer.

“It appears OpenAI is replicating Google’s lobbying playbook,” he says. “They’ve hired former Google advocates to affect the same playbook that’s been so successful for Google for decades now.”

Things are different this time, however. There was real grassroots animosity against SOPA, which was seen at the time as engineered by Hollywood and the music industry; Silicon Valley was still widely beloved as a benevolent inventor of the future, and many didn’t see how having an artist’s work uploaded to a video platform owned by the good guys on the internet might be detrimental to their economic interests. (Though many did!)

Now, however, workers in the digital world are better prepared. Everyone from Hollywood screenwriters to freelance illustrators to part-time copywriters to full-time coders can recognize the potential material effect of a generative AI system that can ingest their work, replicate it, and offer it to users for a monthly fee — paid to a Silicon Valley corporation, not them.

“It’s asking for an enormous giveaway,” Marcus says. “It’s the equivalent of a major land grab.”

Now, there are many in Silicon Valley who are of course genuinely excited about the potential of AI, and many others who are genuinely oblivious to matters of political economy; who want to see the gains made as quickly as possible, and do not realize how these work-automating systems will be used in practice. Others may simply not care. But for those who do, Marcus says there’s a simple way forward.

“There’s an obvious alternative here — OpenAI’s saying that we need all this or we can’t build AI — but they could pay for it!” We want a world with artists and with writers, after all, he adds, one that rewards artistic work — not one where all the money goes to the top because a handful of tech companies won a digital land grab.

“It’s up to workers everywhere to see this for what it is, get organized, educate lawmakers and fight to get paid fairly for their labor,” Fitzgerald says. “Because if they don’t, Google and OpenAI will continue to profit from other people’s labor and content for a long time to come.”

Column: The AI industry has a battle-tested plan to keep using our content without paying for it

More to Read

More From the Los Angeles Times