GPT 5.3 Codex Real Testing After Launch

On the same day Claude Opus 4.6 launched, OpenAI released GPT-5.3 Codex around 10 minutes later. They were clearly waiting to drop it right after and take away the attention from Anthropic. OpenAI is playing hardball, even though Anthropic already had those big Super Bowl-style commercials out first. Feels like a direct clapback.

We’re going to look at this new model now.

Quick Look at the Official Announcement

We won’t take much time on the announcement because they always call it the greatest thing ever.
Just a couple interesting points.

They said GPT-5.3 Codex is their first model that was instrumental in creating itself.
Early versions were used to debug its own training, manage development, diagnose test results and evaluations. The team was blown away by how much it accelerated its own development.

Any company using their own tool heavily could say something similar today.
So this feels like a pretty empty statement.
But if they ever say the model built a clearly better version of itself with zero human input — that’s real singularity territory.
Maybe we’re closer than we think if it can already meaningfully improve itself.
That part feels a bit scary.

Interesting Graph on Token Efficiency

One graph shows how many output tokens it needs for a certain SWE-bench score.
Before it was doing around 100,000 tokens in extra high mode.
Now it’s down to 40–50,000.
So I’m thinking of running today’s tests in extra high mode for the strongest results.

But the graph shows only ~1% score difference between high and extra high, yet almost twice the speed.
Not sure what to choose.
Should I always run extra high even if it takes hours sometimes?
Claude finishes in 20 minutes max.
Still, most people will probably use high mode because the quality drop between high and medium feels noticeable.

Terminal Bench Score Jump

Terminal bench result looks strong.
Opus scored around 64–65% this morning — only ~1% better than previous Codex 5.2.
Now GPT-5.3 Codex hits 77%.
Very good number on this benchmark.
Maybe they trained extra for it, we don’t know.
But impressive anyway.

Their Demo Games and Claims

They showed a couple playable test games. Tried both — nothing amazing. One is simple Mario Kart style. Performs how the preview looks. Cool it made that in probably one prompt. Other is Dave the Diver style (they called it a diving game) — click fish to collect. Not a great demo, but okay.

They also showed better UI design. New version makes noticeably cleaner, more polished websites. Rest is standard except they claim much better computer-use ability — that will be interesting.

Time to Run Our Own Tests

Now we test it ourselves. Using the same three prompts from the Opus test. These have become quite easy now, so this model should match or beat Opus.

Prompt 1 – Evolution Simulator with Neural Networks

Creates creatures that survive by evolving simple neural networks to find food.
They learn only through reproduction — offspring brains mutate slightly randomly.
Process repeats, networks improve over time.
Usually they make raycasts to sense food distance.

Prompt 2 – Transcription App

Builds an installable Electron transcription app (like Whisper Flow / Super Whisper).
Can share with anyone.
I’m using my own Python server for the transcription model, so not fully standalone.
But the app handles everything else.

Prompt 3 – 3D Vampire Survivors / Mega Bonk Style Game

Create a Vampire Survivors or Mega Bonk style game, usually web-based. I want 3D. So far every model makes top-down 2D-feeling game even in 3D. I want Mega Bonk feel, not Vampire Survivors camera.

Starting the Tests

Kicking them off now. GPT-5.3 Codex selected. Extra high mode. Running do @prompt.txt for all three. They’re starting. Usually evolution simulator finishes first

Test Results Are In

All three finished surprisingly fast — under 10 minutes total even on extra high. Last time GPT-5.2 took 40–60 minutes on same prompts. Big speed improvement.

Evolution Simulator Result

Nothing showing on map at first. Buttons don’t work. Console shows Three.js issue. Already weaker than Opus (which worked first try).

Transcription App Result

npm run start → right control to record.
Got error immediately.
Pasted console error back.
After that it worked well.
Very similar to Opus output — too many creatures at first.
Lowered food spawn rate → population behaves nicely.
Live network update is cool.

But movement feels wrong — creatures slide without facing direction.
Looks unnatural.

Very similar to Opus — same border style, creature look, network visualization.
Checked thinking trace and asked directly — it said only read prompt.
So models are thinking very similarly.

Side by side: Opus looks more polished.
More natural movement, more controls, better graphs (best creature, average energy etc.).
I’d give Opus 8–9, this one ~7.
Opus wins this round.

Transcription App – Second Attempt

After feedback it worked flawlessly.
Nice recording/processing indicator in bottom right.
Opus had zero errors first try, but this fixed fast and added a clean detail.
Pretty much a tie.

Game Result – Neon Relic Survivors

Named it Neon Relic Survivors.
Picked Vanguard.
Movement and mouse look is nice.
Camera lags a bit.
Mouse capture for looking around is actually good.

No visual health bar — just number at top.
Poor UI choice.
But this is the first properly working 3D game from any model — camera and controls feel correct.

Still, UI is weak compared to Opus.
Opus had clear gold display, visual health/XP bars, mini-map, item display — much more polished even though top-down.
Opus wins this round too.

Final Score on the Three Tests

Evolution simulator → Opus cleaner, better movement/controls/graphs. GPT-5.3 Codex decent but quirks. Opus wins.
Transcription app → Opus perfect first try. GPT-5.3 Codex needed one fix but nice UI detail. Tie.
Roguelike game → Opus superior UI/polish. GPT-5.3 Codex got real 3D first but presentation weaker. Opus wins.

GPT 5.3 Codex Real Testing After Launch – Honest Results