Here's a post claiming that only Gemini 2.5 Flash can create a Galton Board, and no other thinking model can. This intuitively sounds wrong to me. What's worse, is that they then go and provide a big-ass prompt with waaaay too much yap.
Gemini 2.5 Flash demolishes my Galton Board test, I could not get 4omini, 4o mini high, or 03 to produce this. I found that Gemini 2.5 Flash understands my intents almost instantly, code produced is tight and neat. The prompt is a merging of various steps. It took me 5 steps to… pic.twitter.com/DN8IKXBn54
— RameshR (@rezmeram) April 17, 2025
That's not the future we're aiming for, so I did some test across models. I want a reasonably good starting point with a very simple prompt:
Can you write a physically accurate Galton Board simulation in a single-page self-contained html+js?
That's all. For some models, I'm then curious and try a couple follow-up prompts that I'll share later. But first, here's all the models I tried with that simple single-sentence prompt.
I'll first show the results of each "main" coding model, and then a few more variants of each provider. Make sure to scroll to the end, where I give a few take-aways.
Click on the videos to go to the interactive html page, exactly as output by the LLM:
Main models
ChatGPT o3
Gemini 2.5 Pro
Claude Sonnet 3.7
Grok3 (with thinking)
Extras
Gemini 2.5 Flash
ChatGPT o4-mini (high)
ChatGPT o4-mini
Grok3 (no thinking)
ChatGPT 4o
Take-aways
- All models generate perfectly valid html and js code.
- This was not true a year ago! They would often write code that contains big errors.
- Most of the main models generate reasonable but imperfect physics and logic.
- Don't trust shitfluencers. Most models can do this task fine without need for huge-ass prompt.
- Only Sonnet 3.7 had default parameters at "enjoyable" physics values, all others start off way too bouncy.
- Sonnet 3.7 unfortunately miscalculated its viewport. If you set peg rows to 6 and reset it, you'll see that it actually implemented histogram-bins too, they are just outside the viewport by default.
- Even non-thinking models come close now.
- Grok3 did better without thinking than with. Grok3 is actually pretty hit-or-miss.
What I have not explored enough, is the steerability of each model: how well they react to and integrate feedback and allow me to iterate. Do the models mostly stick to their original and require excrutiatingly detailed instructions in follow-ups, or can they go off vague feedback like we'd give people? "yeah but make it prettier" or "sure, but in a triangle, not square" etc.
I feel like each of these misses a lot of polish that I'd want for anything I publish, though it may be fine for personal exploration of the topic. So as of right now, I might use them to bootstrap the initial code, but pretty soon, I'd dive in myself.
Some more fun
ChatGPT o3 follow-up request
Looking at real Galton board pictures, I see that it's indeed a triangle of pegs (unlike OP's rectangle) but also, that there's walls close to the border. So I asked o3 to do that with this follow-up message:
ok, this is almost great. But usually Galton board has an oblique wall just outside the pins, so that there's not tons of empty space around them. Can you add that?
ChatGPT o3 with the long-ass prompt
Simply copy-pasted the full prompt from the tweet, without any changes.
Iterating with ChatGPT 4o
I also wanted to see how far we can get a non-reasoning model by providing feedback. Turns out we can get it quite far, but it's a bit tedious. I had to give 8 rounds of feedback to 4o until it finally had all the pieces I wanted.
Of course, this is still far from a pretty demo like one would want for a blogpost or similar.
Sources
Here are links to my chats with each of these, both for proof, and for reference:
- ChatGPT o3
- Gemini 2.5 Pro
- Claude Sonnet 3.7: It seems that I can't share chats with Claude? But here's the artifact.
- Grok3 (with thinking)
- Gemini 2.5 Flash
- ChatGPT o4-mini (high)
- ChatGPT o4-mini
- Grok3 (no thinking)
- ChatGPT 4o