What about the data we generate as content and art? It’s certainly not legal to copy someone else’s work and then present it as your own. But there are AI systems that attempt to scrape as much human-generated content from the web as possible in order to generate content that’s similar. Can GDPR or any other EU-centered policies protect this kind of content? As it turns out, like most things in the machine learning world, it depends on the data.
Privacy vs ownership
GDPR’s primary purpose is to protect European citizens from harmful actions and consequences related to the misuse, abuse, or exploitation of their private information. It’s not much use to citizens (or organizations) when it comes to protecting intellectual property (IP). Unfortunately, the policies and regulations put in place to protect IP are, to the best of our knowledge, not equipped to cover data scraping and anonymization. That makes it difficult to understand exactly where the regulations apply when it comes to scraping the web for content. These techniques, and the data they obtain, are used to create massive databases for use in training large AI models such as OpenAI’s GPT-3 and DALL-E 2 systems. The only way to teach an AI to imitate humans is to expose it to human-generated data. And the more data you shove in an AI system, the more robust its output tends to be. It works like this: imagine you draw a picture of a flower and post it to an online forum for artists. Using scraping techniques, a tech outfit sucks up your image along with billions of others so it can create a massive dataset of artwork. The next time someone asks the AI to generate an image of a “flower,” there’s a greater-than-zero possibility that your work will feature in the AI’s interpretation of the prompt. As to whether such use would be ethical remains an open question.
Public data versus PII
While the GDPR’s regulatory oversight could be described as far-reaching when it comes to protecting private information and giving Europeans the right to erasure, it seemingly does very little to protect content from scraping. However, that doesn’t mean GDPR and other EU regulations are entirely feckless in this regard. Individuals and organizations have to follow very specific rules for scraping PII, lest they fall afoul of the law — something that can become quite costly. As an example, it’s becoming nigh impossible for Clearview AI, a company that builds facial recognition databases for government use by scraping social media data, to conduct business in Europe. EU watchdogs from at least seven nations have either issued hefty fines already or recommended fines over the company’s refusal to comply with GDPR and similar regulations. On the complete other side of the spectrum, companies such as Google, OpenAI, and Meta employ similar data scraping practices either directly or via the purchase or use of scraped datasets for many of their AI models without any repercussion. And, while big tech’s faced its fair share of fines in Europe, very few of the infractions have involved data scraping.
Why not ban scraping?
Scraping, on the surface, might seem like a practice with too much potential for misuse not to ban outright. However, for many organizations that rely on scraping, the data being obtained isn’t necessarily “content” or “PII,” but information that can serve the public. We reached out to the UK’s agency for handling data privacy, the Information Commissioner’s Office (ICO), to find out how they regulated scraping techniques and internet-scale datasets and to understand why it was so important not to over-regulate. A spokesperson for the ICO told TNW: In other words, it’s more about the kind of data being used than how it’s gathered. Whether you copy paste images from Facebook profiles or use machine learning to scrape the web for labeled images, you’re likely to run afoul of GDPR and other European privacy regulations if you build a facial recognition engine without consent from the people whose faces are in its database. But it’s generally acceptable to scrape the internet for massive amounts of data as long as you either anonymize it or ensure that there is no PII in the dataset.
Further gray areas
However, even within the allowed use cases, there still exist some gray areas that do concern private information. GPT-2 and GPT-3, for example, are known to occasionally output PII in the form of addresses, phone numbers, and other information that’s apparently baked into its corpus via large scale training datasets. Here, where it’s evident that the company behind GPT-2 and GPT-3 are taking steps to mitigate this, GDPR and similar regulations are doing their job. Simply put, we can either choose not to train large AI models or allow the companies training them the opportunity to explore edge cases and attempt to mitigate concerns. What might be needed is a GDUR, a General Data Use Regulation, something that could give clear guidelines into how human-generated content can legally be used in large datasets. At a minimum, it seems like it’s worth having a conversation about whether European citizens should have as much right to have the content they create removed from datasets as their selfies and profile pics. For now, in the UK and throughout the rest of Europe, it seems the right to erasure only extends to our PII. Anything we put online is likely to end up in some AI’s training dataset.