05 – Got You

Got You#

The Technique Required for Scale#

Sometime in June 2023, George Hotz—founder of Comma.ai and Tinycorp—leaked information about GPT-4’s architecture during a podcast interview. He claimed that this model was not a “normal” GPT, but rather a Mixture of Experts (MoE), with 8 experts of 220B parameters each, totaling 1.8T parameters.

The Mixture-of-Experts (MoE) approach was not a new technique invented by OpenAI. It essentially allows a model’s capacity to be increased while keeping (or even reducing, at comparable quality) the compute per token.

How MoE Works

It achieves this by replacing the Transformer’s feed-forward (FFN) layer—the one that comes after the attention mechanism—with multiple FFN “experts” and a router that, for each token, selects only a subset (top-k) and then combines their outputs.
Assume a model with 64 experts, each with 16B parameters, and only 8 experts active per token. During the forward pass, we don’t need to compute over the full 1024B parameters of the model, but only over the 128B parameters of the active experts.

Building and serving these models is more complex, but it’s worth it: you get the behavior of a massive model while executing only a fraction of its parameters at each step.

GPT-3.5-Turbo Captured#

After this leak, it would take several months before companies managed to train and serve MoEs at scale. The first to succeed was the French startup Mistral, which enjoyed significant popularity at the time. It was composed primarily of former Meta employees who had worked on LLaMA, along with members from other labs. They announced themselves in April 2023 as “the European answer” to U.S. AI development.

Our French friends gave us a major reason to celebrate with the release of Mixtral 8×7B in December. For the first time, there was broad consensus that we finally had a model superior to GPT-3.5-Turbo. On top of that, it was open source and—while not exactly small by the standards of the time—it was still within reach for many to run locally.

GPT-4-Turbo Captured#

As 2023 came to a close, more than a year after ChatGPT’s release, OpenAI still sat at the top, and GPT-4 remained untouchable.

In February 2024, a lab called the Allen Institute, founded by former Google Brain employees, made an unprecedented proposal that deserves an honorable mention. They introduced Olmo, the only series of models to date (as of 2025) that is fully open source and reproducible, complete with a detailed paper, weights, training scripts, and public datasets covering every phase of LLM creation (pre- and post-training).
They did not match GPT-4-Turbo—a smaller, more polished, and cheaper version of GPT-4—on any metric, but their transparency and commitment to open science were remarkable.

The lab that would achieve this feat just one month later, in March, was Anthropic. They released the Claude 3 model family in three sizes (Haiku, Sonnet, and Opus), with Haiku being the smallest and Opus the largest.

As for whether Opus 3 was a better model than the GPT-4-Turbo of that time—it was debatable. But that’s precisely the point: it was debatable. OpenAI’s supremacy was no longer unquestionable.

And as clear evidence of Anthropic’s model capabilities, many people canceled their ChatGPT subscriptions and switched to Claude, which also featured a highly polished UI with a new function—so useful that every other competitor eventually copied it: artifacts, a side panel next to the chat that renders the code generated by the model, offering a glimpse into the company’s future focus (code).