Merchant: How websites and apps harvest your data to build AI

Apple CEO Tim Cook with a wall of app logos behind him

By Brian MerchantTechnology Columnist

Aug. 16, 2023 4 AM PT

When, earlier this month, Zoom users realized that the company had updated its terms of service to allow it to use data collected from video calls to train its artificial intelligence systems, the backlash was swift. Celebrities, politicians and academics threatened to quit the service. Zoom quickly backtracked.

These are tense times. Many are worried, and quite rightfully so, that AI companies are threatening their livelihoods — that AI services like OpenAI, Google’s Bard and Midjourney have ingested work that artists, writers, photographers and content creators have put online, and can now emulate and produce it for cheap.

Other anxieties are more diffuse. We’re not yet entirely certain what these AI companies are capable of, exactly, or to what ends their products will be used. We worry that AI can be used to mimic our digital profiles, our voices, our identities. We worry about scams and exploitation.

Which is why the outrage against Zoom’s policy makes perfect sense — videoconferencing is one of the most intimate, personal and data-rich services we use. Whenever we Zoom — or FaceTime or Google Meet — we are transmitting detailed information about our faces, homes and voices to our friends, family and colleagues; the notion that data would be mined to train an AI that could be used for any purpose a tech company saw fit is disconcerting, to say the least.

And it raises the question: What kind of info are we comfortable forking over to the AIs, if any? Right now we are in the midst of a destabilizing moment. It’s alarming, yes, but it’s also an opportunity to renegotiate what we do and do not want to hand over to tech giants that have been gathering our personal data for decades now. But to make those sorts of decisions, first we have to know where we stand. What are the websites and apps we use every day doing with our data? Are they using it to train their AI systems? What can we do about it if so?

A good rule of thumb, to begin with: If you are posting pictures or words to a public-facing platform or website, chances are that information is going to be scraped by a system crawling the internet gathering data for AI companies, and very likely used to train an AI model of one kind or another. If it hasn’t already.

WEBSITES

If you have a website for your business, a personal blog, or write for a company that publishes stories or copy online, that information is getting hoovered up and put to work training an AI, no doubt about it. Unless, that is, the website owner has put in certain safeguards to keep AI crawlers out, but more on that in a second.

The sort of AI that has made headlines this year — OpenAI’s ChatGPT and DALL-e, Google’s Bard, Meta’s LLaMa — is more technically known as a large language model, or LLM. Simply put, LLMs work by “training” on large data sets of images and words. Very large data sets: Google’s “Colossal Clean Crawl Corpus,” or C4, spans 15 million websites.

Earlier this year, investigative reporters at the Washington Post teamed up with the Paul Allen Institute to analyze the kinds of websites that were scraped up to build that data set, which has played a major role in training many of the AI products you’re most familiar with. (Newer AI products have been trained on data sets that are even bigger than that.)

Everything from Wikipedia entries to Kickstarter projects to New York Times stories to personal blogs was scanned for use in amassing the AI data set. Perhaps we should see it as a badge of honor that we here at the Los Angeles Times provided C4 with the 6th-largest amount of training data of any site on the web. (Or maybe we should, you know, ask for some compensation for our contributions.) The largest source of data in C4, by some margin, is the U.S. patent office. My own embarrassing personal website, brianmerchant.org, was scraped by the AI crawler and deposited into C4 — when you chat with an AI bot, just bear in mind that it may be 1/15,000,000th the online CV of Brian Merchant.

OK, so let’s say you don’t want OpenAI building ChatGPT-7 with fresh posts from your personal blog, or your copywriters’ finely crafted prose. What can you do?

Well, just this week, OpenAI announced its latest web-crawling tool, GPTBot, along with instructions on how to block it. Website owners and admins who want to block future crawling should add an entry to their site’s robots.txt file and tell it to “Disallow: /”. As some have noted, not all crawlers obey such commands, but it’s a start. Still, any data that have already been scraped will not be removed from those data sets.

Furthermore, the web trawlers looking for data aren’t supposed to penetrate paywalls or any websites requiring passwords for entry, so putting your site under lock and key will keep it from AI adoption.

So that’s the open web — what about apps?

First off, the same principle that goes for the web goes for 99% of apps out there — if you are creating something to post publicly, on a digital platform, chances are it’s going into one AI crawler or another, or already has. Remember, most social media apps have, from the beginning, predicated their entire business models on encouraging you to produce content that they will analyze and use to sell you ads with automated systems. Nothing is sacred here, or even truly private, unless the service in question offers end-to-end encryption or particularly good privacy settings.

TIKTOK

Take TikTok, which is one of the most-downloaded apps in the world, and boasts over a billion users. It has run on AI and machine learning from the start. Its much-discussed algorithm, which serves users the content it thinks they’ll want most, is based on battle-tested AI techniques such as computer vision and machine learning, and has been from the start. Every post submitted to TikTok is being scanned, stored and analyzed by AI, and is training its algorithm to improve its ability to send you content it thinks you’ll like.

Beyond that, we don’t have much information about what ByteDance, the Chinese company that owns TikTok, might plan to do with all the data it’s processed. But they’ve got a vast trove of it — from users and creators alike — and a lot is possible.

INSTAGRAM

Now, with Instagram, we know that your posts have been fed into an AI training system operated by Meta, the company that owns Instagram and Facebook. News broke in 2018 that the company had scraped billions of Instagram posts for AI data training purposes. The company said it was using those data to improve object recognition and its computer vision systems, but who knows.

FACEBOOK

Technically, Facebook prohibits scraping, so the biggest crawlers probably haven’t scooped up your posts for wider use in products like ChatGPT. But Meta itself is very much in the AI game, just like all the major tech giants — it has trained its own proprietary system, LLaMa — and it’s not clear what the company itself is doing with your posts. But we do know that it’s been earmarking user posts for AI processing in the recent past. In 2019, Reuters reported that Facebook contractors were looking at posts, even those set as private, in order to label them for AI training.

TWITTER/X

Like Facebook, X-née-Twitter has technically prohibited scraping of its posts, making it harder for bots to get at them. But owner Elon Musk has said that he’s interested in charging the AI scrapers for access, and in using them to train X’s own nascent AI efforts.

“We will use the public tweets — obviously not anything private — for training,” Musk said in a Twitter Spaces chat in July, “just like everyone else has.”

REDDIT

The popular and massive web forum Reddit has been scraped for data plenty. But recently, its CEO, Steve Huffman, has said that he intends to start charging AI scrapers for access. So, yes, if you post on Reddit, you’re feeding the bots.

We could keep going down the line — but this sampling should help make the gist of the matter clear: Almost everything is up for grabs if you’re creating content online for public consumption.

So that leaves at least one big question: What about messages, posts and work you make with digital tools for private consumption?

The reason the Zoom issue turned into a mini-scandal is because it’s a service not usually meant for public-facing use. And this is where it gets more complicated. It’s case by case, and if you really want to be sure about whether the products you’re using are harvesting your words or work for AI training, you’re going to have to dive into some terms of service yourself — or seek out products built with privacy in mind.

GOOGLE / GMAIL

Let’s start with a big one. It’s easy to forget that until a few years ago, Google’s AI read your email. In order to serve you better ads, the search giant’s automated systems combed your gmail for data. Google says it doesn’t do that anymore, and claims that any of the Work products you might use, such as Docs or Sheets, won’t be used to train AI without your consent. Nonetheless, authors are uneasy about the prospect that their drafts will wind up training an AI, and quite reasonably so.

GRAMMARLY

Grammarly, the popular grammar and spell-checking tool, explicitly states that any text you place in its system can be used to train AI systems in perpetuity. Every customer, its terms of service says, “acknowledges that a fundamental component of the Service is the use of machine learning…. Customer hereby grants us the right to use, during and after the Subscription Term, aggregated and anonymized Customer Data to improve the Services, including to train our algorithms internally through machine learning techniques.”

In other words, you’re handing Grammarly AI training material every time you check your spelling.

APPLE MESSAGES

Apple’s in the AI game too, though it doesn’t publicly flaunt it as much. And it insists that the kind of machine learning it’s interested in is what’s known as on-device AI — instead of taking your data and adding it to large data sets stored on the cloud, its automated systems live locally on the chips in your device.

Apple harnesses machine learning to do things like improve autocorrect in your text messages, recognize the shape of your face, pick out friends and family members in your camera roll, automatically adjust noise cancellation on your Airpods when it’s loud, and ID that plant you just snapped on a hike. So Apple’s machine learning systems are reading your texts and scanning your photos, but only within the confines of your iPhone — it’s not sending that information to the cloud, like most of its competitors.

ZOOM

And finally, we return to Zoom. Because I have one last point to add to the dust-up that got us started here. Which is, while Zoom may have added one little line to its terms of service indicating that it will not use your on-call data for its AI services — unless the host of your call has consented, which is a pretty major exception — it can still keep your data for just about everything else.

Here’s the part that still remains very much in effect, every time you boot up Zoom:

“You agree to grant and hereby grant Zoom a perpetual, worldwide, non-exclusive, royalty-free, sublicensable, and transferable license and all other rights required or necessary to redistribute, publish, import, access, use, store, transmit, review, disclose, preserve, extract, modify, reproduce, share, use, display, copy, distribute, translate, transcribe, create derivative works, and process Customer Content.”

In other words, they can do just about anything they want with our private recorded conversations, except for training AI without our consent. That still seems rather onerous!

And therein, ultimately, lies the rub.

So much of what the tech industry is doing with AI is not orders of magnitude more invasive or exploitative than what they’ve been doing all along — they’re incremental amplifications. The tech giants have harvested, hoarded, scraped and sold our personal data for well over a decade now, and this is just another step.

But we should be grateful that it’s a genuinely unnerving one: It gives us a chance to demand more from the companies that have erected the digital infrastructure, services and playgrounds we spend so much of our time on, even depend on. It gives the opportunity for us to renegotiate what we should consider socially — and economically — acceptable in how our data are taken and used.

Adobe, for instance — whose Beta users automatically opt in to having their work help train AI — has promised to pay creators who opt into a program that trains AI on their works. Few have seen any returns, as of yet, but it’s an idea, at least.

The best solution, right now, if you want to keep your words, images and likeness away from AI is to use encrypted apps and services that are good on privacy.

Instead of using Zoom for texting and video calls, use Signal, which is widely available, popular and boasts end-to-end encryption. For email, try a service like Proton mail, which does not rely on harvesting ads for revenue, and puts privacy first. If you have a blog or a personal site, you can tell OpenAI not to scrape through robots.txt. You can put up a paywall, or require a password to enter.

If you’re a developer or a product manager working on a project, in good faith, that relies on gathering other people’s data, seek consent first. And by all means, keep making noise when other folks don’t. We have a real chance to reevaluate and reestablish a true doctrine of consent online, and set new standards — before our words are sucked up and mutated and integrated into the chat-borg bots of the future.

Column: These apps and websites use your data to train AI. You’re probably using one right now.

More to Read

More From the Los Angeles Times