- The first open source models to top Chatbot Arena were built for as little as $300 in compute costs.
- Vicuna-13B, Guanaco-33B, and WizardLM-70B proved open source models could compete with GPT-4 era benchmarks.
- QLoRA, developed by Tim Dettmers, made fine-tuning 65B parameter models possible on a single consumer GPU.
- These projects laid the foundation for today’s open-weight AI ecosystem, from Mistral to DeepSeek-R1.
The Open Source Models Nobody Remembers — But Everyone Owes
Right now, the open source models conversation is dominated by Llama 3, Mistral, and DeepSeek-R1. But in 2023, before any of those existed, four projects quietly proved that you didn’t need OpenAI’s budget to build something extraordinary. Vicuna-13B, Guanaco-33B, Vicuna-33B, and WizardLM-70B were the first open source models to reach the top of Chatbot Arena — and most people have already forgotten their names.
That’s worth fixing. Because the story of how they got there is one of the most important chapters in the recent history of AI, and it starts with $300 and eight A100s.
Vicuna: $300 and a ShareGPT Scraper
In February 2023, Meta released LLaMA — not as a product, but as a research artefact. Stanford researchers quickly grabbed LLaMA-7B and built Alpaca, a fine-tuned instruction model that got people excited. But a team at UC Berkeley, led by Wei-Lin Chiang and Lianmin Zheng, thought they could go further.
Their insight was simple and slightly audacious: scrape ShareGPT.com for roughly 70,000 real conversations people had shared from ChatGPT, use those as training data, fine-tune LLaMA-13B with supervised fine-tuning, and spin it all up on eight A100 GPUs using SkyPilot. Total compute cost: around $300.
On March 30, 2023, they released Vicuna-13B and immediately ran it through a GPT-4-judged evaluation against ChatGPT, Bard, Alpaca, and the base LLaMA. The result? Vicuna scored roughly 92% of ChatGPT’s quality according to that automated judge. Was GPT-4-as-evaluator a perfect methodology? No — the community pushed back on that almost immediately. But the demo went viral anyway. Over 500,000 people tried it within days. HuggingFace’s servers felt it.
When LMSYS launched Chatbot Arena on May 3, 2023 — a blind, human-preference leaderboard where users vote on which model gives better answers — Vicuna-13B debuted with an Elo rating of 1,169. At that point, it sat just below GPT-4. For a model that cost less to train than a weekend flight to Vegas, that was remarkable. Vicuna proved that open source models could punch well above their weight class.
Guanaco and the QLoRA Breakthrough
Vicuna showed you could fine-tune cheaply. Then Tim Dettmers came along and showed you could do it on almost nothing.
Dettmers, who built the bitsandbytes quantization library that had already become essential plumbing for anyone running LLMs on consumer hardware, published QLoRA in mid-2023. The paper — which landed at NeurIPS 2023 and accumulated over 650 citations — described a technique for fine-tuning a 65B parameter model on a single GPU. Three innovations made that possible:
- 4-bit NormalFloat (NF4): a new quantization data type optimised specifically for normally-distributed neural network weights
- Double Quantization: quantizing the quantization constants themselves, squeezing out extra memory
- Paged Optimizers: using NVIDIA’s unified memory to handle gradient checkpointing spikes without crashing
The result was Guanaco — a family of open source models fine-tuned using QLoRA on the OASST1 dataset. The 33B version hit Chatbot Arena in June 2023 with an Elo score of 1,065, briefly edging out Vicuna-13B (Elo 1,061). By July, it had been dethroned by Vicuna-33B, but the point had been made.
QLoRA’s real legacy wasn’t Guanaco the model. It was the democratisation of fine-tuning itself. Before QLoRA, serious LLM fine-tuning required enterprise GPU clusters. After it, a researcher with a single RTX 3090 could meaningfully experiment with open source models at scales that would have seemed inaccessible months earlier. The technique directly influenced Orca, Phi, and the entire wave of efficient fine-tuning research that followed. Luke Zettlemoyer, a leading NLP researcher who collaborated on Guanaco, described QLoRA as fundamentally reshaping what academic labs could do in the space.
Vicuna-33B and the Bigger-is-Better Phase
With QLoRA making larger open source models trainable, LMSYS took the obvious next step. On June 22, 2023, they released Vicuna-33B, trained on the same ShareGPT data but starting from LLaMA-33B instead of 13B. They evaluated it on MT-Bench — a set of 80 multi-turn questions designed to probe reasoning, coding, and instruction-following — and the results were solid enough to justify the release.
By July, Vicuna-33B sat atop the Arena leaderboard with an Elo of 1,096, displacing Guanaco-33B. The gap between 13B and 33B was real and measurable, even if it didn’t make Vicuna-33B a GPT-4 killer. It held that position until October 2023, when WizardLM-70B finally pushed it aside.
This era also spawned a small ecosystem. StableVicuna merged Vicuna with RLHF techniques from Stability AI. The 16K context version of Vicuna-13B-v1.5 addressed one of the original model’s biggest practical limitations. Meanwhile, the FastChat framework — the serving infrastructure Chiang and Zheng built around Vicuna — became the backbone of Chatbot Arena itself and remains in use today.
WizardLM and the Art of Synthetic Data
WizardLM took a different approach to the problem. Where Vicuna mined real human-ChatGPT conversations and Guanaco used human-written preference data, WizardLM’s team asked a more provocative question: what if you let the LLM generate its own training instructions?
Their answer was Evol-Instruct, a pipeline that uses GPT-4 to take simple seed instructions and evolve them — making them more complex, adding constraints, deepening the reasoning required, branching into harder variants. The generated dataset is entirely synthetic, but the complexity gradients baked into it turn out to matter enormously for training instruction-following models.
In October 2023, WizardLM-70B debuted on Chatbot Arena and immediately took the top open-source spot, displacing Vicuna-33B. The jump to 70B parameters (using LLaMA 2 as a base) helped, but so did the training methodology. Evol-Instruct produced richer, more varied instruction distributions than scraping ShareGPT could provide at scale. Among open source models of that era, WizardLM-70B represented the clearest evidence yet that synthetic data could rival human-curated alternatives.
Microsoft Research, which backed WizardLM, then extended the approach to coding (WizardCoder) and mathematics (WizardMath), both of which became influential benchmarks in their respective domains.
Then came April 2024, and things got strange. WizardLM-2 launched — three variants, including an 8x22B mixture-of-experts model built on Mixtral that reportedly matched GPT-4 on several benchmarks. The reception was enormous. And then, within days, Microsoft quietly pulled it from HuggingFace and GitHub. The stated reason involved ongoing safety testing. The community noticed a Reddit thread suggesting the models had failed toxicity tests at low severity levels. The HuggingFace weights disappeared. The GitHub repo went dark. WizardLM-2 has never been officially re-released.
It’s a strange end for a project that was genuinely ahead of its time. The Evol-Instruct methodology didn’t disappear with the models — it’s visible in the DNA of Microsoft’s Phi series and in the broader shift toward synthetic data generation that now defines how companies like Anthropic and Google approach post-training. But WizardLM as a living project appears to be over.
What These Four Models Actually Changed
It’s easy to look back at Vicuna, Guanaco, Vicuna-33B, and WizardLM-70B and see them as historical footnotes — open source models that scored well on a leaderboard before the real competition arrived. That reading undersells what happened.
These open source models established several things the industry now takes for granted. Chatbot Arena itself — which LMSYS spun into a standalone company that raised $17 million in 2025 and continues to operate as the de facto benchmark for conversational AI — grew directly out of Vicuna’s demo infrastructure. QLoRA became foundational tooling. Evol-Instruct seeded the synthetic data paradigm. FastChat became production serving infrastructure.
More broadly, they demonstrated that open source models could participate meaningfully in the same conversation as GPT-4 and Claude, not as toys but as legitimate alternatives. That proof of concept attracted the researchers, the funding, and the compute that eventually produced the models we’re talking about today.
In 2026, Claude Opus 4.6 and GPT-5.5 dominate the Arena leaderboard. The gap between frontier closed models and the best open source alternatives has narrowed dramatically, and in some specialised domains it’s closed entirely. You can trace a direct line from Wei-Lin Chiang scraping ShareGPT conversations and spending $300 on A100 time to the current moment where DeepSeek-R1 challenges OpenAI on reasoning benchmarks and Llama 3 runs comfortably on a MacBook Pro. The pioneers rarely get the credit. These four did.

