Artificial intelligence is nothing new, but it seems like we made an overnight jump from rudimentary tools like Alexa and Siri to mindblowing AIs like ChatGPT and Midjourney that can generate elaborate content on the fly. The AI revolution is sure to change the fields of customer service, sales, medicine, law, programming, and many others, likely in ways we can’t predict.
There’s no question we’re in the midst of an AI revolution, but how did we get here? AI isn’t a tech technology—its origins date back to the 1940s, but what has caused the sudden explosion in powerful AI apps?
To find out, we spoke to Jon Stokes, co-founder of Ars Technica, who now writes about AI on his Substack. But Jon has been studying AI since the early 1990s.
“I was an electrical engineering major with a computer focus and a math minor. I took neural network and computer vision classes. And so we did a lot of pattern recognition stuff like that,” Stokes recalls.
We spoke with Stokes to find out how we got to the current AI revolution.
Early Days: Debate over neural networks
“There’s been a sequence of innovations with neural nets that have brought us to this point. So probably the first was backpropagation in the 80s, and so I learned backpropagation when I did neural networks in undergrad in the 90s,” Stokes says.
Neural networks (or neural nets) are computer systems that attempt to mimic how the human brain functions. They’re not at all new: the idea was invented in 1944. Neural nets learn by example. For instance, an image-recognition neural network might be fed thousands of car images to understand what a car looks like.
“It wasn’t clear that that was actually going to work—just add more layers—but it did work,” Stokes says.
The popularity of neural networks has risen and fallen over the decades but has ultimately proved to be key to the AI revolution. You may have heard the term “deep learning,” which means a neural network with multiple layers. Another popular term used these days is Artificial Neural Networks (ANNs).
Stokes also mentioned backpropagation. In the simplest terms, backpropagation is a method of training an AI by running data through multiple layers of a neural net and sorting out errors to produce accurate results.
Stokes says that neural networks and backpropagation would eventually have gotten us to this point by throwing scale at them in the form of additional neural net layers, increasingly powerful hardware, and massive amounts of online training data. But a breakthrough in 2017 accelerated the current AI revolution.
The AI revolution breakthrough: transformers
A transformer is a new deep-learning model developed by Google researchers in 2017. The GPT in ChatGPT stands for Generative Pre-trained Transformer.
“The transformer is an architecture that was created for natural language processing. It’s a type of neural network architecture that is good at keeping track of sequences. It’s good at finding structure in input, whether that’s code or language,” Stokes says.
For example, previous neural network models were not good at figuring out that a word at the start of a sentence influenced the meaning of the last word. The transformer model lets neutral networks understand connotation and context, making them far more efficient.
“The transformer was a software optimization that let us do more with smaller networks and let us do things more efficiently,” Stokes says.
OpenAI, founded in 2015, took the concept and ran with it. The original GPT was merely theoretical, and GPT-2, announced in 2019, was largely experimental. When ChatGPT was launched in November 2022, giving a user-friendly UI to GPT-3.5 and now GPT-4, that’s when it began to reach critical mass.
ChatGPT reached 100 million monthly users in just a couple of months, making it the fastest-growing application in history.
“GPT-2 was the first language model where you could still tell it’s bot, but it’s kind of creepily good. It’s weird that the dog can like walk on two legs, and look, they’ve trained it to ride a unicycle,” Stokes says.
“And then GPT-3 came along, and it’s like, ‘Okay, now the dog plays Bach and is a master organist.’ While riding the unicycle,” Stokes continues.
Massive scale in the AI revolution
While the transformer was… transformative, we can’t overlook hardware advances in the AI revolution.
“The difference between GPT-2 and GPT-3 was scale. They kept turning up the scaling on the hardware, the training hardware, and the data,” Stokes says.
Not able to play the video? Click here to watch the video
While there’s been a lot of hype around machine-learning-specific hardware, like Apple’s Neural Engine, Stokes says that much of the hardware that powers AI is off-the-shelf.
“There’s specialty hardware for this, but we’re using graphics processing hardware that works really well for games,” Stokes says.
Standalone GPUs (graphics processing units) took a foothold in the gaming market in the 1990s, but they have since been adapted for other tasks, like mining cryptocurrency and artificial intelligence.
Stokes explains that GPUs are excellent at “embarrassingly parallel” tasks, which is a term used to describe computing tasks that are easily split into multiple pieces that can be worked independently.
Imagine you’re using a food processor to shred cabbage for coleslaw, and it’s a big potluck, so you’re making a lot of slaw. If you only have one food processor, you have to feed the cabbage into the food processor one at a time.
But now, imagine that you have ten food processors and nine assistants to help you. You can shred ten heads of cabbage at once instead of one at a time, getting the job done in a tenth of the time it would have taken otherwise. Shredding cabbage is embarrassingly parallel.
That, in a nutshell, is how parallelization works, and it’s especially effective for AI operations.
On the technical front, Stokes explains that Nvidia—a producer of video cards—makes a developer library called CUDA (Compute Unified Device Architecture) that powers most AI software. Developers use Python—an interpreted scripting language—as a front end for CUDA.
The sheer amount of scale allowed by GPUs, cloud computing, and broadband not only power the “thinking” of an AI but allow developers to scale up the training data.
“We also discovered the value of scale in neural networks where you train it on the entire internet, and you get results that you wouldn’t get if you just trained it on the Encyclopedia Britannica,” Stokes says.
ChatGPT does not crawl the internet directly. OpenAI is being vague about the training data used for its just-released GPT-4, but we know that GPT-3 used pre-assembled sets of crawled web data, such as Common Crawl and the text of Wikipedia. That means that while ChatGPT has access to an enormous amount of data, it’s not always up to date.
AI, in short, is like a million monkeys at a million typewriters.
“But in this case, the monkeys have been trained on Shakespeare,” Stokes says.