I put GPT-4o through my coding tests and it aced them - except for one weird result

If you've been keeping up with the tech world, you're likely aware that OpenAI has just dropped its latest large language model, GPT-4o, where the "o" signifies "omni." This new model promises versatility across text, graphics, and voice, and I couldn't wait to put it through its paces with my standard set of coding tests. These tests have been run against a wide array of AI models, yielding some pretty fascinating results. Stick with me until the end because there's a twist you won't want to miss.
If you're interested in conducting your own experiments, check out this guide: How I test an AI chatbot's coding ability - and you can too. It outlines all the tests I use, along with detailed explanations of how they work and what to look for in the outcomes.
Now, let's dive into the results of each test and see how GPT-4o stacks up against previous contenders like Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Advanced, and the earlier versions of ChatGPT.
1. Writing a WordPress Plugin
Here's a glimpse of GPT-4o's user interface:
Interestingly, GPT-4o took the liberty of including a JavaScript file, which dynamically updates the line count in both fields. While the prompt didn't explicitly rule out JavaScript, this creative approach was unexpected and effective. The JavaScript also enhances the Randomize button's functionality, allowing for multiple result sets without a full page refresh.
The lines were arranged correctly, and duplicates were appropriately separated according to the specifications. It's a solid piece of code, with just one minor quibble: the Randomize button wasn't placed on its own line, though I hadn't specified that in the prompt, so no points off for that.
Here are the aggregate results for this and previous tests:
- ChatGPT GPT-4o: Interface: good, functionality: good
- Microsoft Copilot: Interface: adequate, functionality: fail
- Meta AI: Interface: adequate, functionality: fail
- Meta Code Llama: Complete failure
- Google Gemini Advanced: Interface: good, functionality: fail
- ChatGPT 4: Interface: good, functionality: good
- ChatGPT 3.5: Interface: good, functionality: good
2. Rewriting a String Function
This test evaluates the model's ability to handle dollars and cents conversions. GPT-4o successfully rewrote the code to reject inputs that could cause issues with subsequent lines, ensuring only valid dollar and cent values are processed.
I was a bit disappointed that it didn't automatically add a leading zero to values like .75, converting them to 0.75. However, since I didn't explicitly request this feature, it's not a fault of the AI. It's a reminder that even when an AI delivers functional code, you might need to refine the prompt to get exactly what you need.
Here are the aggregate results for this and previous tests:
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Succeeded
- Google Gemini Advanced: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
3. Finding an Annoying Bug
This test is intriguing because the solution isn't immediately apparent. I was initially stumped by this error during my own coding, so I turned to the first ChatGPT model for help. It found the error instantly, which was mind-blowing at the time.
Contrastingly, three of the other LLMs I tested missed the misdirection in this problem. The error message points to one part of the code, but the actual issue lies elsewhere, requiring deep knowledge of the WordPress framework to identify.
F fortunately, GPT-4o correctly identified the problem and described the fix accurately.
Here are the aggregate results for this and previous tests:
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
- Meta AI: Succeeded
- Meta Code Llama: Failed
- Google Gemini Advanced: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
So far, GPT-4o is three for three. Let's see how it does with the final test.
4. Writing a Script
In response to this test, GPT-4o actually provided more than I asked for. The test involves using the obscure Mac scripting tool Keyboard Maestro, Apple's AppleScript, and Chrome scripting behavior. Keyboard Maestro, by the way, is a game-changer for me, making Macs my go-to for productivity due to its ability to reprogram the OS and applications.
To pass, the AI needs to correctly outline a solution using a combination of Keyboard Maestro code, AppleScript, and Chrome API functionality.
Surprisingly, GPT-4o gave me two different versions:
Both versions correctly interacted with Keyboard Maestro, but they differed in handling case sensitivity. The left version was incorrect because AppleScript doesn't support "as lowercase." The right version, which used "contains" and was case-insensitive, worked fine.
I'm giving GPT-4o a pass, albeit cautiously, because it did deliver working code. However, returning two options, one of which was incorrect, made me do extra work to evaluate and choose the right one. That could have been as time-consuming as writing the code myself.
Here are the aggregate results for this and previous tests:
- ChatGPT GPT-4o: Succeeded, but with reservations
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Failed
- Google Gemini Advanced: Succeeded
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Failed
Overall Results
Here's how all the models fared across the four tests:
- ChatGPT GPT-4o: 4 out of 4 succeeded, but with that one odd dual-choice answer
- Microsoft Copilot: 0 out of 4 succeeded
- Meta AI: 1 out of 4 succeeded
- Meta Code Llama: 1 out of 4 succeeded
- Google Gemini Advanced: 1 out of 4 succeeded
- ChatGPT 4: 4 out of 4 succeeded
- ChatGPT 3.5: 3 out of 4 succeeded
Up until now, ChatGPT has been my go-to for coding assistance. It's always delivered (except when it hasn't). The other AIs mostly fell short in my tests. But GPT-4o threw me a curveball with that last dual-answer response. It made me question what's going on inside this model that could cause such a hiccup.
Despite this, GPT-4o remains the top performer in my coding tests, so I'll likely keep using it and get more familiar with its quirks. Alternatively, I might revert to GPT-3.5 or GPT-4 in ChatGPT Plus. Stay tuned; the next time ChatGPT updates its model, I'll definitely rerun these tests to see if it can consistently pick the right answer across all four tests.
Have you tried coding with any of these AI models? What's been your experience? Let us know in the comments below.
Related article
Top 10 Tools to Enhance Developer Experience in 2025
Developer experience (DevEx) goes beyond mere terminology. As tech stacks grow more intricate, remote teams become standard, and continuous delivery accelerates, developers face increasingly complex w
Guide to Crafting Viral AI-Generated IPS Officer Images
In the fast-paced world of social media, trends evolve rapidly. One captivating trend gaining traction is creating AI-generated images of individuals as IPS (Indian Police Service) officers. This guid
Google Cloud Enhances AI Media Tools with New Music and Video Features
On Wednesday, Google unveiled upgrades to its Vertex AI cloud platform, boosting several proprietary media-generating AI models.Lyria, Google’s text-to-music AI, is now in preview for select users, wh
Comments (20)
0/200
JonathanAllen
April 26, 2025 at 7:46:22 AM EDT
GPT-4o é impressionante, passando na maioria dos meus testes de codificação! Mas aquele resultado estranho me deixou confuso. Ainda assim, é versátil em texto, gráficos e voz. Se ao menos pudesse explicar aquele resultado estranho, seria perfeito! 🤔
0
WillHarris
April 25, 2025 at 2:21:39 PM EDT
GPT-4o thật ấn tượng, vượt qua hầu hết các bài kiểm tra mã hóa của tôi! Nhưng kết quả lạ đó làm tôi bối rối. Tuy nhiên, nó rất linh hoạt trong văn bản, đồ họa và giọng nói. Giá mà nó có thể giải thích kết quả lạ đó, thì sẽ hoàn hảo! 🤔
0
DonaldGonzález
April 24, 2025 at 7:41:59 AM EDT
GPT-4oは私のコードテストのほとんどを完璧にこなすので感動しました!しかし、その一つの奇妙な結果が気になりました。それでも、テキスト、グラフィック、音声での多様性は素晴らしいです。あの奇妙な結果を説明できれば完璧だったのに!🤔
0
JustinAnderson
April 23, 2025 at 1:12:28 AM EDT
¡El GPT-4o me impresionó con sus habilidades de codificación! Pasó todos mis tests excepto por un resultado extraño que me dejó pensando. Su versatilidad en texto, gráficos y voz es genial! Pero ese fallo, hay que arreglarlo, OpenAI! 😎
0
NicholasClark
April 22, 2025 at 10:12:49 PM EDT
GPT-4oのコードスキルには感心しました!私のテストをほぼ全てクリアしましたが、一つの奇妙な結果が気になります。テキスト、グラフィック、ボイスでの多才さは素晴らしい!でも、その一つのバグ、修正してほしいですね、OpenAI!😅
0
DavidThomas
April 22, 2025 at 1:04:24 PM EDT
GPT-4o is impressive, acing most of my coding tests! But that one weird result threw me off. Still, it's versatile across text, graphics, and voice. If only it could explain that odd outcome, it'd be perfect! 🤔
0
If you've been keeping up with the tech world, you're likely aware that OpenAI has just dropped its latest large language model, GPT-4o, where the "o" signifies "omni." This new model promises versatility across text, graphics, and voice, and I couldn't wait to put it through its paces with my standard set of coding tests. These tests have been run against a wide array of AI models, yielding some pretty fascinating results. Stick with me until the end because there's a twist you won't want to miss.
If you're interested in conducting your own experiments, check out this guide: How I test an AI chatbot's coding ability - and you can too. It outlines all the tests I use, along with detailed explanations of how they work and what to look for in the outcomes.
Now, let's dive into the results of each test and see how GPT-4o stacks up against previous contenders like Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Advanced, and the earlier versions of ChatGPT.
1. Writing a WordPress Plugin
Here's a glimpse of GPT-4o's user interface:
Interestingly, GPT-4o took the liberty of including a JavaScript file, which dynamically updates the line count in both fields. While the prompt didn't explicitly rule out JavaScript, this creative approach was unexpected and effective. The JavaScript also enhances the Randomize button's functionality, allowing for multiple result sets without a full page refresh.
The lines were arranged correctly, and duplicates were appropriately separated according to the specifications. It's a solid piece of code, with just one minor quibble: the Randomize button wasn't placed on its own line, though I hadn't specified that in the prompt, so no points off for that.
Here are the aggregate results for this and previous tests:
- ChatGPT GPT-4o: Interface: good, functionality: good
- Microsoft Copilot: Interface: adequate, functionality: fail
- Meta AI: Interface: adequate, functionality: fail
- Meta Code Llama: Complete failure
- Google Gemini Advanced: Interface: good, functionality: fail
- ChatGPT 4: Interface: good, functionality: good
- ChatGPT 3.5: Interface: good, functionality: good
2. Rewriting a String Function
This test evaluates the model's ability to handle dollars and cents conversions. GPT-4o successfully rewrote the code to reject inputs that could cause issues with subsequent lines, ensuring only valid dollar and cent values are processed.
I was a bit disappointed that it didn't automatically add a leading zero to values like .75, converting them to 0.75. However, since I didn't explicitly request this feature, it's not a fault of the AI. It's a reminder that even when an AI delivers functional code, you might need to refine the prompt to get exactly what you need.
Here are the aggregate results for this and previous tests:
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Succeeded
- Google Gemini Advanced: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
3. Finding an Annoying Bug
This test is intriguing because the solution isn't immediately apparent. I was initially stumped by this error during my own coding, so I turned to the first ChatGPT model for help. It found the error instantly, which was mind-blowing at the time.
Contrastingly, three of the other LLMs I tested missed the misdirection in this problem. The error message points to one part of the code, but the actual issue lies elsewhere, requiring deep knowledge of the WordPress framework to identify.
F fortunately, GPT-4o correctly identified the problem and described the fix accurately.
Here are the aggregate results for this and previous tests:
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
- Meta AI: Succeeded
- Meta Code Llama: Failed
- Google Gemini Advanced: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
So far, GPT-4o is three for three. Let's see how it does with the final test.
4. Writing a Script
In response to this test, GPT-4o actually provided more than I asked for. The test involves using the obscure Mac scripting tool Keyboard Maestro, Apple's AppleScript, and Chrome scripting behavior. Keyboard Maestro, by the way, is a game-changer for me, making Macs my go-to for productivity due to its ability to reprogram the OS and applications.
To pass, the AI needs to correctly outline a solution using a combination of Keyboard Maestro code, AppleScript, and Chrome API functionality.
Surprisingly, GPT-4o gave me two different versions:
Both versions correctly interacted with Keyboard Maestro, but they differed in handling case sensitivity. The left version was incorrect because AppleScript doesn't support "as lowercase." The right version, which used "contains" and was case-insensitive, worked fine.
I'm giving GPT-4o a pass, albeit cautiously, because it did deliver working code. However, returning two options, one of which was incorrect, made me do extra work to evaluate and choose the right one. That could have been as time-consuming as writing the code myself.
Here are the aggregate results for this and previous tests:
- ChatGPT GPT-4o: Succeeded, but with reservations
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Failed
- Google Gemini Advanced: Succeeded
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Failed
Overall Results
Here's how all the models fared across the four tests:
- ChatGPT GPT-4o: 4 out of 4 succeeded, but with that one odd dual-choice answer
- Microsoft Copilot: 0 out of 4 succeeded
- Meta AI: 1 out of 4 succeeded
- Meta Code Llama: 1 out of 4 succeeded
- Google Gemini Advanced: 1 out of 4 succeeded
- ChatGPT 4: 4 out of 4 succeeded
- ChatGPT 3.5: 3 out of 4 succeeded
Up until now, ChatGPT has been my go-to for coding assistance. It's always delivered (except when it hasn't). The other AIs mostly fell short in my tests. But GPT-4o threw me a curveball with that last dual-answer response. It made me question what's going on inside this model that could cause such a hiccup.
Despite this, GPT-4o remains the top performer in my coding tests, so I'll likely keep using it and get more familiar with its quirks. Alternatively, I might revert to GPT-3.5 or GPT-4 in ChatGPT Plus. Stay tuned; the next time ChatGPT updates its model, I'll definitely rerun these tests to see if it can consistently pick the right answer across all four tests.
Have you tried coding with any of these AI models? What's been your experience? Let us know in the comments below.



GPT-4o é impressionante, passando na maioria dos meus testes de codificação! Mas aquele resultado estranho me deixou confuso. Ainda assim, é versátil em texto, gráficos e voz. Se ao menos pudesse explicar aquele resultado estranho, seria perfeito! 🤔




GPT-4o thật ấn tượng, vượt qua hầu hết các bài kiểm tra mã hóa của tôi! Nhưng kết quả lạ đó làm tôi bối rối. Tuy nhiên, nó rất linh hoạt trong văn bản, đồ họa và giọng nói. Giá mà nó có thể giải thích kết quả lạ đó, thì sẽ hoàn hảo! 🤔




GPT-4oは私のコードテストのほとんどを完璧にこなすので感動しました!しかし、その一つの奇妙な結果が気になりました。それでも、テキスト、グラフィック、音声での多様性は素晴らしいです。あの奇妙な結果を説明できれば完璧だったのに!🤔




¡El GPT-4o me impresionó con sus habilidades de codificación! Pasó todos mis tests excepto por un resultado extraño que me dejó pensando. Su versatilidad en texto, gráficos y voz es genial! Pero ese fallo, hay que arreglarlo, OpenAI! 😎




GPT-4oのコードスキルには感心しました!私のテストをほぼ全てクリアしましたが、一つの奇妙な結果が気になります。テキスト、グラフィック、ボイスでの多才さは素晴らしい!でも、その一つのバグ、修正してほしいですね、OpenAI!😅




GPT-4o is impressive, acing most of my coding tests! But that one weird result threw me off. Still, it's versatile across text, graphics, and voice. If only it could explain that odd outcome, it'd be perfect! 🤔












