option
Home
News
I put GPT-4o through my coding tests and it aced them - except for one weird result

I put GPT-4o through my coding tests and it aced them - except for one weird result

April 17, 2025
141

I put GPT-4o through my coding tests and it aced them - except for one weird result

If you've been keeping up with the tech world, you're likely aware that OpenAI has just dropped its latest large language model, GPT-4o, where the "o" signifies "omni." This new model promises versatility across text, graphics, and voice, and I couldn't wait to put it through its paces with my standard set of coding tests. These tests have been run against a wide array of AI models, yielding some pretty fascinating results. Stick with me until the end because there's a twist you won't want to miss.

If you're interested in conducting your own experiments, check out this guide: How I test an AI chatbot's coding ability - and you can too. It outlines all the tests I use, along with detailed explanations of how they work and what to look for in the outcomes.

Now, let's dive into the results of each test and see how GPT-4o stacks up against previous contenders like Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Advanced, and the earlier versions of ChatGPT.

1. Writing a WordPress Plugin

Here's a glimpse of GPT-4o's user interface:

Interestingly, GPT-4o took the liberty of including a JavaScript file, which dynamically updates the line count in both fields. While the prompt didn't explicitly rule out JavaScript, this creative approach was unexpected and effective. The JavaScript also enhances the Randomize button's functionality, allowing for multiple result sets without a full page refresh.

The lines were arranged correctly, and duplicates were appropriately separated according to the specifications. It's a solid piece of code, with just one minor quibble: the Randomize button wasn't placed on its own line, though I hadn't specified that in the prompt, so no points off for that.

Here are the aggregate results for this and previous tests:

  • ChatGPT GPT-4o: Interface: good, functionality: good
  • Microsoft Copilot: Interface: adequate, functionality: fail
  • Meta AI: Interface: adequate, functionality: fail
  • Meta Code Llama: Complete failure
  • Google Gemini Advanced: Interface: good, functionality: fail
  • ChatGPT 4: Interface: good, functionality: good
  • ChatGPT 3.5: Interface: good, functionality: good

2. Rewriting a String Function

This test evaluates the model's ability to handle dollars and cents conversions. GPT-4o successfully rewrote the code to reject inputs that could cause issues with subsequent lines, ensuring only valid dollar and cent values are processed.

I was a bit disappointed that it didn't automatically add a leading zero to values like .75, converting them to 0.75. However, since I didn't explicitly request this feature, it's not a fault of the AI. It's a reminder that even when an AI delivers functional code, you might need to refine the prompt to get exactly what you need.

Here are the aggregate results for this and previous tests:

  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Succeeded
  • Google Gemini Advanced: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

3. Finding an Annoying Bug

This test is intriguing because the solution isn't immediately apparent. I was initially stumped by this error during my own coding, so I turned to the first ChatGPT model for help. It found the error instantly, which was mind-blowing at the time.

Contrastingly, three of the other LLMs I tested missed the misdirection in this problem. The error message points to one part of the code, but the actual issue lies elsewhere, requiring deep knowledge of the WordPress framework to identify.

F fortunately, GPT-4o correctly identified the problem and described the fix accurately.

Here are the aggregate results for this and previous tests:

  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
  • Meta AI: Succeeded
  • Meta Code Llama: Failed
  • Google Gemini Advanced: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

So far, GPT-4o is three for three. Let's see how it does with the final test.

4. Writing a Script

In response to this test, GPT-4o actually provided more than I asked for. The test involves using the obscure Mac scripting tool Keyboard Maestro, Apple's AppleScript, and Chrome scripting behavior. Keyboard Maestro, by the way, is a game-changer for me, making Macs my go-to for productivity due to its ability to reprogram the OS and applications.

To pass, the AI needs to correctly outline a solution using a combination of Keyboard Maestro code, AppleScript, and Chrome API functionality.

Surprisingly, GPT-4o gave me two different versions:

Both versions correctly interacted with Keyboard Maestro, but they differed in handling case sensitivity. The left version was incorrect because AppleScript doesn't support "as lowercase." The right version, which used "contains" and was case-insensitive, worked fine.

I'm giving GPT-4o a pass, albeit cautiously, because it did deliver working code. However, returning two options, one of which was incorrect, made me do extra work to evaluate and choose the right one. That could have been as time-consuming as writing the code myself.

Here are the aggregate results for this and previous tests:

  • ChatGPT GPT-4o: Succeeded, but with reservations
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Failed
  • Google Gemini Advanced: Succeeded
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Failed

Overall Results

Here's how all the models fared across the four tests:

  • ChatGPT GPT-4o: 4 out of 4 succeeded, but with that one odd dual-choice answer
  • Microsoft Copilot: 0 out of 4 succeeded
  • Meta AI: 1 out of 4 succeeded
  • Meta Code Llama: 1 out of 4 succeeded
  • Google Gemini Advanced: 1 out of 4 succeeded
  • ChatGPT 4: 4 out of 4 succeeded
  • ChatGPT 3.5: 3 out of 4 succeeded

Up until now, ChatGPT has been my go-to for coding assistance. It's always delivered (except when it hasn't). The other AIs mostly fell short in my tests. But GPT-4o threw me a curveball with that last dual-answer response. It made me question what's going on inside this model that could cause such a hiccup.

Despite this, GPT-4o remains the top performer in my coding tests, so I'll likely keep using it and get more familiar with its quirks. Alternatively, I might revert to GPT-3.5 or GPT-4 in ChatGPT Plus. Stay tuned; the next time ChatGPT updates its model, I'll definitely rerun these tests to see if it can consistently pick the right answer across all four tests.

Have you tried coding with any of these AI models? What's been your experience? Let us know in the comments below.

Related article
Google Unveils Gemini Notebooks, Merging NotebookLM with Personal Knowledge Base Google Unveils Gemini Notebooks, Merging NotebookLM with Personal Knowledge Base Google recently launched a "Notebooks" feature for Gemini, designed to help users manage complex projects by creating a personalized knowledge base. This update bridges the data gap between Gemini and the AI research assistant NotebookLM, marking a k
Luma AI unveils Uni-1 autoregressive model that generates text and pixels simultaneously Luma AI unveils Uni-1 autoregressive model that generates text and pixels simultaneously Luma Labs launched its image generation model Uni-1 on March 23, marking the company's first publicly available model built on the Unified Intelligence architecture. Free trial access is now open on the official website, with API pricing announced an
NVIDIA's Xinzhou Wu: autonomous driving's ChatGPT moment has arrived, L4 mass production no longer a dream NVIDIA's Xinzhou Wu: autonomous driving's ChatGPT moment has arrived, L4 mass production no longer a dream In the rapidly evolving field of physical AI, autonomous driving is often viewed as the first major challenge to overcome. Recently, Wu Xinzhou, Vice President of NVIDIA, outlined the company's ambitious vision for intelligent driving at a Beijing co
Related Special Topic Recommendations
chatbot Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities
Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities

Discover the 2026 latest top-rated AI romantic chatbots for building genuine, long-term connections. Our curated list features powerful, consistent personalities, free vs paid comparisons, and real-world tests. Find your perfect companion and start building today at XIX.AI.

10 tools
xix.ai
Education and Learning Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows
Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows

Discover the 2026 best AI data science mentors to master SQL, Pandas & ML workflows. Explore our top-rated, curated selection at XIX.AI for powerful, game-changing guidance. Compare free vs paid options with real-world insights. Unlock your data science mastery today.

10 tools
xix.ai
chatbot Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time
Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time

Discover the 2026 best AI flirting and conversation trainers on XIX.AI. Our curated, top-rated selection helps you build social charisma and confidence in real-time. Explore must-try, game-changing tools with free vs paid comparisons and weekly updated rankings. Unlock your social edge today.

10 tools
xix.ai
code Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click
Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click

Discover the 2026 latest top-rated AI tools for automated unit testing. Our curated selection features powerful, game-changing solutions to generate Jest, PyTest & JUnit test cases instantly. Compare free vs paid options with real-world tests and weekly updated rankings on XIX.AI. Unlock your AI edge and boost development productivity today.

10 tools
xix.ai
Data Analysis Best AI Data Visualization Tools: Auto-Generate Interactive BI Dashboards from Raw Files
Best AI Data Visualization Tools: Auto-Generate Interactive BI Dashboards from Raw Files

Discover the 2026 best AI data visualization tools at XIX.AI. Our curated, top-rated selection helps you auto-generate powerful, interactive BI dashboards from raw files instantly. Compare free vs paid options with real-world tests and weekly updated rankings. Unlock your data's potential today.

10 tools
xix.ai
Social Media AI Branding Kits for Social Media: Maintain Consistent Brand Visuals Across All Channels
AI Branding Kits for Social Media: Maintain Consistent Brand Visuals Across All Channels

Discover the 2026 best AI branding kits for social media. XIX.AI's curated list features top-rated, game-changing tools to maintain perfectly consistent brand visuals across all channels. Compare free vs paid options with real-world tests. Unlock your brand's visual edge today.

10 tools
xix.ai
Comments (22)
0/500
RoyMartínez
RoyMartínez April 30, 2026 at 10:01:09 PM EDT

GPT-4o klingt beeindruckend, aber diese 'eine seltsame Ausnahme' macht mich neugierig. Was war das für ein seltsames Ergebnis? Vielleicht ein Hinweis darauf, dass KI bei bestimmten Logikaufgaben immer noch überraschend 'menschlich' scheitern kann? 🤔 Die Omni-Fähigkeiten sind cool, aber ich frage mich, wie stabil die Performance in allen Modi wirklich ist.

PaulYoung
PaulYoung March 14, 2026 at 8:00:58 PM EDT

Bon article ! Les tests de programmation sont toujours révélateurs. Je me demande s’il y a des biais selon les langages utilisés pour l'entraînement… Ou peut-être que c’est lié à la façon dont la requête est formulée ? 🤔

JonathanAllen
JonathanAllen April 26, 2025 at 7:46:22 AM EDT

GPT-4o é impressionante, passando na maioria dos meus testes de codificação! Mas aquele resultado estranho me deixou confuso. Ainda assim, é versátil em texto, gráficos e voz. Se ao menos pudesse explicar aquele resultado estranho, seria perfeito! 🤔

WillHarris
WillHarris April 25, 2025 at 2:21:39 PM EDT

GPT-4o thật ấn tượng, vượt qua hầu hết các bài kiểm tra mã hóa của tôi! Nhưng kết quả lạ đó làm tôi bối rối. Tuy nhiên, nó rất linh hoạt trong văn bản, đồ họa và giọng nói. Giá mà nó có thể giải thích kết quả lạ đó, thì sẽ hoàn hảo! 🤔

DonaldGonzález
DonaldGonzález April 24, 2025 at 7:41:59 AM EDT

GPT-4oは私のコードテストのほとんどを完璧にこなすので感動しました!しかし、その一つの奇妙な結果が気になりました。それでも、テキスト、グラフィック、音声での多様性は素晴らしいです。あの奇妙な結果を説明できれば完璧だったのに!🤔

JustinAnderson
JustinAnderson April 23, 2025 at 1:12:28 AM EDT

¡El GPT-4o me impresionó con sus habilidades de codificación! Pasó todos mis tests excepto por un resultado extraño que me dejó pensando. Su versatilidad en texto, gráficos y voz es genial! Pero ese fallo, hay que arreglarlo, OpenAI! 😎

OR