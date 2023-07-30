Since the rise of ChatGPT, the tech industry has been obsessed with the power of generative AI. While coverage has focused on the impact of generative AI on creative work, little has been written about its long-term impact on itself.

How AI works

At their core, tools like ChatGPT, Google Bard, and Llama are advanced auto-complete systems. Looking at what is being typed, the AI systems guesses the next word, and how it should be sequenced with the previous content to sound understandable.

To do so, a generative AI has to be “trained.” The basic concept of training is a simple rote business of pointing the algorithm at a large set of information and saying “this is what that information means.” So feed an AI millions of pictures of a bicycle and it will conclusively be able to identify bicycles consistently in the next round of images it is being fed. These days, when you click on a CAPTCHA to prove you’re human, there is a high chance that what you are doing is helping to train an AI understand a concept (or better its understanding of that concept).

As auto-correct often fails on your phone, generative AI is prone to mistakes. The AI PR machine would like you to think of those mistakes as “hallucinations,” a term they have kidnapped to cover up that AI quality if not always to be trusted, and part of a broader campaign on convincing you that current AI is working very well (I will cover this topic separately in a future piece). What you have to remember is that the mistakes are due to the AI’s “understanding” of what it has been fed.

Those mistakes, when happening in large enough a number, can be caught by an AI engineer who can then “retrain” the AI to avoid making that mistake in the future. For example, when Microsoft was training initially training some of their AIs, the AI had an issue identifying black women as women. In their analysis, what they discovered was that their AI was using the presence of make-up on a face as the way to identify a “woman” vs. “not a woman” (they type of example that was easily spoofed in the “Silicon Valley” TV series on HBO:

Hot Dog Identification

The Microsoft AI was sent back to school and trained on a non-US women data set and was able to develop new criteria to identify a woman that were not embedded into the subconscious of US image data set.

How Generative AI works

The rise of the cloud as a concept allowed to tie a large amount of machines together to do that analysis in parallel at fairly high speed. So generative AI set emerged by training the code across multiple machines using large amounts of information (for text-based systems, those information sets are called Large Language Models or LLMs for short). For example, OpenAI, which created ChatGPT, is known to have used the data from Common Crawl, a non-profit that aggregated the content of billions of web pages to help researchers or anyone interested in using web data an organized way of doing so. Google, when developing Google Bard, used the index of the web it uses for search to train its system.

Pointing the software at a large portion of the web is a great way to train an algorithm as it uses decades of human created content to learn. Think of it as teaching a kid by getting them to read a whole library. The knowledge embedded in such a large collection can lead to a basic “understanding” and make “connections” between different items.

This “Big Data” approach is why so many companies over the last couple of decades have been busy building tools that aggregate data points. It’s how online advertising works (it looks at what you’ve done in the past, what other people have done, and based on that sends you an ad it thinks you’re likely to react to); It’s how search works (it looks at how well people have reacted to a piece of content on the internet for a given topic and points you to that); it’s how credit scores and mortgages work (they looks at good you are at repaying things, how good people with your education, your zip code, your income are at repaying things); it is basically the foundation of most “modern” knowledge.

When you think of it in terms of human science and creativity, it is also how we, as human, function. We learn from the people who have preceded us and build on that knowledge and creativity. A decade ago, Kirby Ferguson presented this in a great TED talk (and even better series online and later on TV):

Everything is a remix

So a generative AI model basically takes the content of the web and “remixes” it based on what it’s been asked. This leads to some interesting assumptions:

The web is trustworthy

The web will always be trustworthy

The first assumption is mostly valid. Content that has been created on the internet for the past 3 decades is worth our trust, for the most part. The qualifiers is important.

Over the last decade, as the web increasingly became a first party source of information, bad actors started to realize that content on the web could be modified to subtly affect how people thought of certain things. In 2005, satirist Stephen Colbert coined the turn “truthiness,” a version of the truth not based in reality, and highlighted that books may be the last place for fact. Around the same time, Harry Frankfurt published “On Bullshit”, a short volume on how the foundation of truth that had served our civilization was starting to erode.

This is the source of a lot of the mistakes generative AI makes. When it does (and, again, they are mistakes, not hallucinations), it is because the AI has been trained on faulty data. And AI generated data is already challenges, as AI researcher Emily Bender has demonstrated in an excellent paper where she coined the term stochastic parrot.

Generative AI in the future

The challenge with that faulty foundation is that it will need a strong corrective. And considering the amount of data that is required to generate a solid data set, that corrective may be hard to find. GTP-4, the most recent data model behind Chat-GPT, has 1 Petabyte of data. By comparison, the US Library of Congress has around 20 Petabytes of data. Or if you were to use a smaller data set (let’s say what’s on Amazon’s book database), you’d have about 750 million books. In other words, it’s a lot of content and yet it needs more to get better.

To get at more content, the easy way is to look at the internet. Everyday, billions of people and businesses are posting on social media, sharing pictures and sounds, and creating more content. That is what the second assumption of generative AI (the web will always be trustworthy) is based on. All that content is what the next generation of AI will learn from.

Embedded in that model is the idea that humans will continue generating that content at the same speed or faster… except humans can be lazy, which leads us to the Generative AI Downward Loop

Generative AI Downward Loop

One of the wonderful parlor tricks of generative AI is how quickly it can generate content that is “close enough” to good that it can pass for quality content. And in order to understand why I’m pointing to a downward loop, you have to consider humans again, and how a lot of content on the web is generated.

In 2009, I asserted that media could be classified across 3 dimensions:

My model did not account for the rise of semi-professionalized is the rise of “influencers,” who are making a living selling content that emerged from a subsidy. An area where truth is less important than the presentation of a certain viewpoint as fact. In other words, the social web has turned truthiness into a business model and the data generated from it has made it that much more difficult to assert truth from fact, as editors were largely relegated to the side, in favor of a more “market-driven” approach.

But as more and more influencers emerged, each influencers has been forced to generate more and more content in order to get heard. With volume of content being more important than quality of content, it has become more difficult for truly “informative” content to break through.

The emergence of generative AI has presented a new tool for influencers to create larger amounts of content more quickly. This, in itself, would not be a bad thing if one could always trust the output of generative AI, or used it in ways that did not lead to more issues.

Marketers (another group generating content on the web) are finding generative AI to be a cheaper source of content generation than actual humans. So they will invest more heavily into AI-generated content.

Publications are also experimenting with replacing journalists with AI, as their business is not to generate content but to generate views on ads. AI can generate more content more cheaply (regardless of truth) against which more ads can be put, which generates more revenue. Because the AI PR machine has convinced people that AI is now “good enough,” the number crunchers are not paying close attention to the quality of the content against which these ads are attached.

Remember that generative AI models are getting their data FROM the web.

And influencers, marketers, and publications are posting this new AI generated content TO the web.

THAT is creating a perfect loop for problems in the long run.

With the quality of content on the web degrading, the quality of the content generative AI systems will degrade. As those systems use the most common (volume) as opposed to the highest quality (vetted) content to decide how to move forward, this will turn those AI algorithms into idiots (or AIdiots, for short) spewing out garbage content.

Avoiding the Gen AI Doom Loop

At this point, you may say, ” OK great, but can we avoid this?”

The answer is yes, we can. In some quarters, it already is starting to happen. I stumbled on one of the answers as I was recently looking at something on Wikipedia a few months ago. The article had a warning that it needed work because it used “too many primary sources.” (The discussion of primary sources on Wikipedia has been archived but I believe it may be high time to re-open it). I did not think much about it at the time and went on to do further research on more reliable sites, places where an editor had been involved in creating the content. So books, trusted magazines, or research papers published in vetted areas were my other sources.

What I failed to appreciate at the time is that a primary source is one that is still up to interpretation. One can make a fake artifact (for example, Der Stern magazine published a forged set of “Hitler Diaries,” in a scandal that still is remembered 50 years later); and in a world where it is easier to create deep fakes (nowadays, deepfake audio or video is just an app away) and where generative AI can create “close enough” fake primary content, a suspicious eye to any primary source is increasingly important.

That eye has existed for centuries: it is the gatekeeper, the editor, the critic. So while I’ve been advocating for a few decades about a primary source approach that gives everyone free access to speech, I now believe that there are two components to free speech: first, there is a right of speech, which should remain the same. The internet has given everyone the ability to be that primary source and to interact with primary sources around the globe. However, there is increasingly a need for a return to the “vetted” model.

The blue checkmarks on Twitter had initially created an environment where such validation was possible. The paid-model to validation that is emerging (whether it is “X Blue” or “Verified by Meta”) solve part of the problem but are not the ultimate solution. The reason they are not the ultimate solution is that “vetted parties” are also increasingly engaging into truthiness. In a world where the primary source is no longer necessarily trust-worthy, it is now important to create a second layer of truth seekers to verify the primary content.

That, in itself, returns us to the eternal challenge of who watches the watchers.

Again, here, the model of Wikipedia can be instructive. There are people who obsess about specific topic and get very good at understanding what is and isn’t true within a topic. Creating a world where a balance of wisdom of the crowd with paid editor, unpaid editor, algorithmic tools for proofing, and other yet-to-be-invented AI-driven technologies to weight truth vs. truthiness levels, may be a way to a better future.

I would also recommend that the generative AI manufacturers look critically at the data they use for their system and be transparent about where their AI learned from. This would go a long way towards creating a world where AI and humans can continue to co-create without creating a loop that would doom content on the internet.