Awesome LLMs: from GPT-3.5 to Gemini CLI

An opiniated LLM timeline for code generation

November 2022 - GPT-3.5

Wow effect! I can write to a computer, it understands and answers. GPT 3.5 feels like Wikipedia on steroids. However, there was quick disappointment for coding: too many errors and issues following instructions when going beyond programming 101.
March 2023 - GPT-4

Now it seriously works for coding. Yet there are some hallucinations on libraries but it’s perfect to write code on a small scope: a function, a test, minor refactoring. Between May and June 2023, with AI as my programming buddy, I developed a complete website for my climbing club in 8 weeks (4-5 hours per weekend) with online registration, course subscription, health attestation, vouchers, and bankcard payment.
March 2024 - Claude 3, then June 2024 - Claude 3.5 Sonnet

After 18 months of domination, OpenAI loses the vibe coding battle to Anthropic’s Claude model. It’s difficult to explain the subtle differences, but it was just better at code. This is also when I switched from ChatGPT monthly subscription to pay-as-you-go through API credits thanks to LibreChat. My monthly AI costs dropped from 20$ to 5 or 6$ on average.
February 2025 - Claude Code

Switching from copy-pasting to my IDE to watching Claude Code searching through a large codebase, updating code and committing was exhilarating… but also insanely expensive. A handful of prompts quickly turned into the 5$ range, which was my previous monthly spend. What? The amount of data sent into the context windows was just too much and I had no control over it. Plus, sometimes it got lost into error loops from which it couldn’t recover. I didn’t use it much beyond a few experiments.
June 2025 - Gemini CLI

Since the beginning of the LLM hype, I didn’t care so much about Google Gemini range of products. First the initial branding (Google Bard) was a huge mistake. As a native French speaker, I couldn’t relate bard to anything else than Assurancetourix, bard and punchbag of the Gallic village of Asterix. I think I have changed my mind since I started using Gemini CLI a week ago. It’s open-source (Claude Code is closed-source), it’s fast and the code is really good. So far, the free plan is very generous so I’m not anxious to run /stats to know my token usage like with Claude Code.

Where are we now?

The last 32 months since the release of GPT-3.5 have been insane. The competition between LLM providers has delivered incredible improvements every 4 to 6 months. In my latest experiment with Gemini CLI, I added a new feature (with tests) by just writing a product specification and guiding a little bit the agent. It was flawless and I didn’t have to get my hands dirty fixing code. For code generation, guided by a human software architect, LLMs are an awesome tool.

What are their limitations so far?

The planning of taks is getting better and better with the agentic approach of Claude Code and Google Gemini CLI. It seems quite obvious it will continue to improve with increasingly fine-tuned agentic coding workflows.

In my view, aesthetics is the main issue with LLMs: I think they s*ck at generating beautiful HTML/CSS design. The HTML is syntaxically perfect, so is the CSS. But it’s really bad at delivering an eye-appealing website. I don’t know if I’m missing some prompting techniques or pro-tips to steer them in the right direction? Or is it just the downside of emotion-less machine generation? I’m sure it will get better as multi-modal LLMs (including vision) will be able to render, self-analyze and improve in an iterative feedback loop. I hope so.

And you, what do you think about LLMs? What will see next year ?