HomeArtificial IntelligenceOpenSCAD LLM Benchmark: Antigravity 2.0 Takes the Top Spot

OpenSCAD LLM Benchmark: Antigravity 2.0 Takes the Top Spot

  • The OpenSCAD LLM benchmark tasked six AI coding tools with building Rome’s Pantheon in parametric CAD code.
  • Antigravity 2.0 topped the OpenSCAD LLM benchmark, outperforming rivals from Cursor, Claude Code, and Codex Desktop.
  • ModelRift built the benchmark because LLM geometry quality directly determines what ships on its 3D platform.
  • Codex Desktop showed the clearest visual workflow, but Cursor had the fastest interaction loop of any tested client.

Why an OpenSCAD LLM Benchmark Built Around the Pantheon?

The OpenSCAD LLM benchmark that ModelRift published this week isn’t a lab curiosity — it’s a production concern dressed up as a test. The company generates OpenSCAD code for every 3D model on its platform, which means the geometric judgment of whatever language model sits under the hood has a direct line to what actually ships. Bad spatial reasoning doesn’t just produce ugly previews; it breaks the whole pipeline. So ModelRift decided to formalize what it was already informally tracking, and the result is one of the more practically motivated AI benchmarks we’ve seen in the 3D space.

The choice of the Pantheon as the test subject is cleverer than it first appears. It would have been easy to pick something trivial — a box with a hole punched through it, a basic threaded bolt — but that kind of prompt mostly checks whether a model has memorized OpenSCAD’s difference(), cube(), and cylinder() primitives. Every serious coding model passes that test without breaking a sweat. The Pantheon sits in a more interesting middle zone. It has radial symmetry in the rotunda and dome, a central oculus that requires a clean Boolean subtraction, 28 repeated columns that benefit from loop logic, a rectangular portico bolted onto a circular drum, and a triangular pediment up front. That’s a genuine mix of constructive geometry challenges — not impossible, but demanding enough that the differences between models become visible fast. This is precisely why the OpenSCAD LLM benchmark uses it as a reference structure rather than a simpler shape.

It’s also, crucially, recognizable. A weak result still looks vaguely like a domed classical building. A strong result has to get the proportional relationship between the drum, the dome rings, the portico, and the front facade roughly right. That gives the OpenSCAD LLM benchmark a built-in qualitative gut-check that a purely numeric score can’t fully capture.

Why OpenSCAD — and Not Blender or Another Tool?

This is the question worth sitting with for a moment, because the tool choice isn’t obvious to everyone. OpenSCAD is a script-based CAD tool where geometry is entirely defined in plain text code — no mouse dragging, no hidden scene state, no GUI operations to simulate. You describe shapes as nested transformations and Boolean operations, and the renderer turns that into a mesh. That text-first structure is what makes it a natural fit for language models, and it’s the same property that makes any OpenSCAD LLM benchmark a meaningful signal of real geometric reasoning ability.

When an LLM needs to place 28 columns evenly around a circle, it can write a for loop around a rotate() and a cylinder() and be done. When it needs to carve an oculus into a dome, it writes a difference() block. The geometry lives in the source file as readable, inspectable, version-controllable code. If a column spacing comes out wrong, the fix is a parameter change, not hunting through a Blender scene graph for a hidden object offset.

ModelRift contrasted this with what it calls the “Blender MCP” approach — using model-context-protocol tool calls to drive a 3D application through its API. That works for some workflows, but it introduces a layer of indirection that compounds quickly for CAD-like tasks. The agent has to translate architectural intent into a sequence of application operations, maintain a running mental model of the accumulated scene state, and hope nothing silently drifts between commands. OpenSCAD collapses all of that: the file is the scene. For the kind of parametric, print-ready geometry that ModelRift cares about — output eventually goes to STL or 3MF for 3D printing — that directness matters enormously. It’s also why an OpenSCAD LLM benchmark is a more reliable differentiator than benchmarks built on GUI-driven tools.

Antigravity 2.0 Wins the OpenSCAD LLM Benchmark

Six AI systems ran the same prompt: look at two reference images of the Pantheon (a front facade shot and an aerial view), produce a .scad file implementing the building, use the OpenSCAD CLI to render PNG previews, and iterate until satisfied. Every tested system had access to a locally installed OpenSCAD binary on a Mac, available on PATH — and every one of them used it successfully. Tool access wasn’t the bottleneck. Geometric judgment was.

Antigravity 2.0 came out on top. ModelRift is careful to note that the scores are relative to this specific OpenSCAD LLM benchmark only — they’re not general model capability rankings — and that even the winning result isn’t close to a perfect Pantheon reproduction. The quality scores were deliberately conservative. That’s the right call. What the benchmark is actually measuring is the gap between models on a specific class of task: translating visual architectural reference into structured parametric code with iterative self-correction.

The runner-up results from Cursor, Claude Code CLI, and Codex Desktop all produced recognizable buildings with meaningful differences in how well they handled the dome-to-portico relationship, column distribution, and mesh cleanliness on export. One recurring issue across multiple runs was what ModelRift describes as a problematic roof and entablature export — the geometry looked acceptable in preview but didn’t export into a clean final mesh. After the OpenSCAD LLM benchmark was published, Codex made a follow-up attempt to diagnose and fix that issue, but ModelRift excluded it from the official comparison to keep the results consistent across all tested systems.

The Client Workflow Matters as Much as the Model

One of the more interesting asides in ModelRift’s writeup is the observation that the client — meaning the interface through which you interact with the model — had almost as much influence on the experience as the underlying model itself. That’s a point that gets underweighted in most AI tooling discussions, where the model gets all the credit or blame. Teams interpreting any OpenSCAD LLM benchmark result should keep this in mind: the same underlying model can perform noticeably differently depending on which client surfaces its outputs.

Codex Desktop, for instance, displayed the reference images that the model had loaded directly inside the conversation thread, alongside the OpenSCAD file edits and the rendered PNG previews. For visual CAD work, that’s genuinely useful. You can see at a glance whether the agent is actually working from the reference you intended, rather than having to trust that the image made it into context. Cursor Agent and Claude Code CLI were described as workable but less transparent about visual context.

Cursor’s advantage came on speed. It had the fastest interaction loop of any tested client, and its UI showed a plan and generated OpenSCAD code side by side — a layout that makes it easier to catch reasoning errors before they propagate into a broken render. When you’re doing iterative geometry work where each render cycle takes a few seconds, that interaction speed compounds quickly over a full session.

The takeaway here isn’t that one client is universally better. It’s that the tooling layer is a meaningful variable in AI-assisted CAD workflows, and teams building on top of these models should be thinking carefully about which client surfaces the right information at the right time — not just which model scores highest on a leaderboard.

What This Tells Us About AI-Generated 3D Geometry

Benchmarks like this one are valuable precisely because they’re grounded in a real production use case rather than constructed to flatter any particular model. ModelRift isn’t running this OpenSCAD LLM benchmark to generate a press release — it’s running it because the results determine engineering decisions about which models to route geometry tasks through.

The broader picture here is that AI-generated 3D geometry is maturing in a very uneven way. Text-to-code approaches like OpenSCAD generation are advancing fast, because language models are already good at structured code and the feedback loop — render, inspect, adjust — is tight and machine-readable. Direct mesh generation from text prompts, by contrast, still struggles badly with precision, parametric relationships, and printability. The gap between “looks roughly right in a viewport” and “exports a watertight mesh you can send to a printer” remains significant across most text-to-mesh systems. An OpenSCAD LLM benchmark is one of the cleaner ways to expose exactly where that gap sits for any given model.

That’s exactly the gap that a code-first approach like ModelRift’s is trying to close. And as more platforms build AI-assisted 3D generation around programmatic representations rather than raw mesh outputs, benchmarks like this one — practical, task-specific, and honest about the limitations of current results — will become more important tools for navigating a field that’s still finding its footing.

Source: https://modelrift.com/blog/openscad-llm-benchmark/

Sara Ali Emad
Sara Ali Emad
Im Sara Ali Emad, I have a strong interest in both science and the art of writing, and I find creative expression to be a meaningful way to explore new perspectives. Beyond academics, I enjoy reading and crafting pieces that reflect curiousity, thoughtfullness, and a genuine appreciation for learning.
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular