Claude 3.5 Sonnet Struggles Creatively in AI Coding Tests Dominated by ChatGPT
Testing the Capabilities of Anthropic's New Claude 3.5 Sonnet
Last week, I received an email from Anthropic announcing the release of Claude 3.5 Sonnet. They boasted that it "raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations." They also claimed it was perfect for complex tasks like code generation. Naturally, I had to put these claims to the test.
I've run a series of coding tests on various AIs, and you can too. Just head over to How I test an AI chatbot's coding ability - and you can too to find all the details. Let's dive into how Claude 3.5 Sonnet performed against my standard tests, and see how it stacks up against other AIs like Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Advanced, and ChatGPT.
1. Writing a WordPress Plugin
Initially, Claude 3.5 Sonnet showed a lot of promise. The user interface it generated was impressive, with a clean layout that placed data fields side-by-side for the first time among the AIs I've tested.
Screenshot by David Gewirtz/ZDNET
What caught my attention was how Claude approached the code generation. Instead of the usual separate files for PHP, JavaScript, and CSS, it provided a single PHP file that auto-generated the JavaScript and CSS files into the plugin's directory. While this was an innovative approach, it's risky because it depends on the OS settings allowing a plugin to write to its own folder—a major security flaw in a production environment.
Unfortunately, despite the creative solution, the plugin didn't work. The "Randomize" button did nothing, which was disappointing given its initial promise.
Here are the aggregate results compared to previous tests:
- Claude 3.5 Sonnet: Interface: good, functionality: fail
- ChatGPT GPT-4o: Interface: good, functionality: good
- Microsoft Copilot: Interface: adequate, functionality: fail
- Meta AI: Interface: adequate, functionality: fail
- Meta Code Llama: Complete failure
- Google Gemini Advanced: Interface: good, functionality: fail
- ChatGPT 4: Interface: good, functionality: good
- ChatGPT 3.5: Interface: good, functionality: good
2. Rewriting a String Function
This test evaluates how well an AI can rewrite code to meet specific needs, in this case, for dollar and cent conversions. Claude 3.5 Sonnet did a good job removing leading zeros, handling integers and decimals correctly, and preventing negative values. It also smartly returned "0" for unexpected inputs, which helps avoid errors.
However, it failed to allow entries like ".50" for 50 cents, which was a requirement. This means the revised code wouldn't work in a real-world scenario, so I have to mark it as a fail.
Here are the aggregate results:
- Claude 3.5 Sonnet: Failed
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Succeeded
- Google Gemini Advanced: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
3. Finding an Annoying Bug
This test is tricky because it requires the AI to find a subtle bug that needs specific WordPress knowledge. It's a bug I missed myself and had to turn to ChatGPT to solve initially.
Claude 3.5 Sonnet not only found and fixed the bug but also noticed an error introduced during the publishing process, which I then corrected. This was a first among the AIs I've tested since publishing the full set of tests.
Here are the aggregate results:
- Claude 3.5 Sonnet: Succeeded
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
- Meta AI: Succeeded
- Meta Code Llama: Failed
- Google Gemini Advanced: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
So far, Claude 3.5 Sonnet has failed two out of three tests. Let's see how it does with the last one.
4. Writing a Script
This test checks the AI's knowledge of specialized programming tools like AppleScript and Keyboard Maestro. While ChatGPT had shown proficiency in both, Claude 3.5 Sonnet didn't fare as well. It wrote an AppleScript that attempted to interact with Chrome but completely ignored the Keyboard Maestro component.
Moreover, the AppleScript contained a syntax error. In trying to make the match case-insensitive, Claude generated a line that would cause a runtime error:
if theTab's title contains input ignoring case then
The "contains" statement is already case-insensitive, and the "ignoring case" phrase was misplaced, resulting in an error.
Here are the aggregate results:
- Claude 3.5 Sonnet: Failed
- ChatGPT GPT-4o: Succeeded but with reservations
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Failed
- Google Gemini Advanced: Succeeded
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Failed
Overall Results
Here's how Claude 3.5 Sonnet performed overall compared to other AIs:
- Claude 3.5 Sonnet: 1 out of 4 succeeded
- ChatGPT GPT-4o: 4 out of 4 succeeded, but with one weird dual-choice answer
- Microsoft Copilot: 0 out of 4 succeeded
- Meta AI: 1 out of 4 succeeded
- Meta Code Llama: 1 out of 4 succeeded
- Google Gemini Advanced: 1 out of 4 succeeded
- ChatGPT 4: 4 out of 4 succeeded
- ChatGPT 3.5: 3 out of 4 succeeded
I was pretty disappointed with Claude 3.5 Sonnet. Anthropic promised it was suited for programming, but it didn't meet those expectations. It's not that it can't program; it just can't program correctly. I keep hoping to find an AI that can outperform ChatGPT, especially as these models get integrated into programming environments. But for now, I'm sticking with ChatGPT for programming help, and I recommend you do the same.
Have you used an AI for programming? Which one, and how did it go? Share your experiences in the comments below.
Follow my project updates on social media, subscribe to my weekly newsletter, and connect with me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.
Related article
OpenAI Commits to Fixes After ChatGPT's Overly Agreeable Responses
OpenAI plans to revise its AI model update process for ChatGPT after an update caused excessively sycophantic responses, prompting widespread user feedback.Last weekend, following an update to GPT-4o,
OpenAI Unveils Advanced AI Reasoning Models, o3 and o4-mini
OpenAI introduced o3 and o4-mini on Wednesday, new AI models engineered to pause and analyze questions before answering.OpenAI touts o3 as its most sophisticated reasoning model yet, surpassing prior
Revamp Your Home: AI-Driven Decor with Pinterest & ChatGPT
Struggling to redesign your home with countless options? Merge artificial intelligence with Pinterest's visual inspiration to create your ideal space. This guide reveals how to blend Pinterest’s image
Comments (10)
0/200
ScottMitchell
May 5, 2025 at 9:17:31 AM EDT
Claude 3.5 Sonnet is pretty good, but it's no match for ChatGPT in coding tests. It's like bringing a knife to a gunfight! 😂 Still, it's an improvement over the last version, so kudos to Anthropic for trying to keep up. Maybe next time, they'll surprise us!
0
JamesMiller
May 5, 2025 at 4:59:50 AM EDT
Claude 3.5 Sonnet é bom, mas não chega aos pés do ChatGPT em testes de codificação. É como levar uma faca para uma batalha de armas! 😂 Ainda assim, é uma melhoria em relação à versão anterior, então parabéns à Anthropic por tentar acompanhar. Talvez da próxima vez eles nos surpreendam!
0
StevenNelson
May 5, 2025 at 3:23:24 AM EDT
クロード3.5ソネットはコードテストではChatGPTにかなわないですね。まるでナイフを持って銃撃戦に挑むようなものです!😂 でも、前バージョンよりは改善されているので、アントロピックの努力には敬意を表します。次回は驚かせてくれるかも?
0
JoseDavis
May 5, 2025 at 2:46:04 AM EDT
Claude 3.5 Sonnet qui galère en codage, c’est un peu décevant vu les promesses d’Anthropic. 😐 ChatGPT garde l’avantage, mais la course à l’IA est fascinante !
0
HaroldLopez
May 5, 2025 at 12:06:54 AM EDT
클로드 3.5 소넷은 코드 테스트에서 ChatGPT에 비해 많이 부족해요. 마치 칼을 들고 총격전에 나서는 느낌이죠! 😂 그래도 이전 버전보다는 나아졌으니, 앤트로픽의 노력에 박수를 보냅니다. 다음에는 놀라게 해줄지 모르겠네요!
0
AveryThomas
May 4, 2025 at 6:30:08 PM EDT
Claude 3.5 Sonnet居然在编程测试中表现一般?有点失望,感觉ChatGPT还是稳坐宝座。😕 不过AI竞争这么激烈,Anthropic得加把劲了!
0
Testing the Capabilities of Anthropic's New Claude 3.5 Sonnet
Last week, I received an email from Anthropic announcing the release of Claude 3.5 Sonnet. They boasted that it "raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations." They also claimed it was perfect for complex tasks like code generation. Naturally, I had to put these claims to the test.
I've run a series of coding tests on various AIs, and you can too. Just head over to How I test an AI chatbot's coding ability - and you can too to find all the details. Let's dive into how Claude 3.5 Sonnet performed against my standard tests, and see how it stacks up against other AIs like Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Advanced, and ChatGPT.
1. Writing a WordPress Plugin
Initially, Claude 3.5 Sonnet showed a lot of promise. The user interface it generated was impressive, with a clean layout that placed data fields side-by-side for the first time among the AIs I've tested.
Screenshot by David Gewirtz/ZDNET
What caught my attention was how Claude approached the code generation. Instead of the usual separate files for PHP, JavaScript, and CSS, it provided a single PHP file that auto-generated the JavaScript and CSS files into the plugin's directory. While this was an innovative approach, it's risky because it depends on the OS settings allowing a plugin to write to its own folder—a major security flaw in a production environment.
Unfortunately, despite the creative solution, the plugin didn't work. The "Randomize" button did nothing, which was disappointing given its initial promise.
Here are the aggregate results compared to previous tests:
- Claude 3.5 Sonnet: Interface: good, functionality: fail
- ChatGPT GPT-4o: Interface: good, functionality: good
- Microsoft Copilot: Interface: adequate, functionality: fail
- Meta AI: Interface: adequate, functionality: fail
- Meta Code Llama: Complete failure
- Google Gemini Advanced: Interface: good, functionality: fail
- ChatGPT 4: Interface: good, functionality: good
- ChatGPT 3.5: Interface: good, functionality: good
2. Rewriting a String Function
This test evaluates how well an AI can rewrite code to meet specific needs, in this case, for dollar and cent conversions. Claude 3.5 Sonnet did a good job removing leading zeros, handling integers and decimals correctly, and preventing negative values. It also smartly returned "0" for unexpected inputs, which helps avoid errors.
However, it failed to allow entries like ".50" for 50 cents, which was a requirement. This means the revised code wouldn't work in a real-world scenario, so I have to mark it as a fail.
Here are the aggregate results:
- Claude 3.5 Sonnet: Failed
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Succeeded
- Google Gemini Advanced: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
3. Finding an Annoying Bug
This test is tricky because it requires the AI to find a subtle bug that needs specific WordPress knowledge. It's a bug I missed myself and had to turn to ChatGPT to solve initially.
Claude 3.5 Sonnet not only found and fixed the bug but also noticed an error introduced during the publishing process, which I then corrected. This was a first among the AIs I've tested since publishing the full set of tests.
Here are the aggregate results:
- Claude 3.5 Sonnet: Succeeded
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
- Meta AI: Succeeded
- Meta Code Llama: Failed
- Google Gemini Advanced: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
So far, Claude 3.5 Sonnet has failed two out of three tests. Let's see how it does with the last one.
4. Writing a Script
This test checks the AI's knowledge of specialized programming tools like AppleScript and Keyboard Maestro. While ChatGPT had shown proficiency in both, Claude 3.5 Sonnet didn't fare as well. It wrote an AppleScript that attempted to interact with Chrome but completely ignored the Keyboard Maestro component.
Moreover, the AppleScript contained a syntax error. In trying to make the match case-insensitive, Claude generated a line that would cause a runtime error:
if theTab's title contains input ignoring case then
The "contains" statement is already case-insensitive, and the "ignoring case" phrase was misplaced, resulting in an error.
Here are the aggregate results:
- Claude 3.5 Sonnet: Failed
- ChatGPT GPT-4o: Succeeded but with reservations
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Failed
- Google Gemini Advanced: Succeeded
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Failed
Overall Results
Here's how Claude 3.5 Sonnet performed overall compared to other AIs:
- Claude 3.5 Sonnet: 1 out of 4 succeeded
- ChatGPT GPT-4o: 4 out of 4 succeeded, but with one weird dual-choice answer
- Microsoft Copilot: 0 out of 4 succeeded
- Meta AI: 1 out of 4 succeeded
- Meta Code Llama: 1 out of 4 succeeded
- Google Gemini Advanced: 1 out of 4 succeeded
- ChatGPT 4: 4 out of 4 succeeded
- ChatGPT 3.5: 3 out of 4 succeeded
I was pretty disappointed with Claude 3.5 Sonnet. Anthropic promised it was suited for programming, but it didn't meet those expectations. It's not that it can't program; it just can't program correctly. I keep hoping to find an AI that can outperform ChatGPT, especially as these models get integrated into programming environments. But for now, I'm sticking with ChatGPT for programming help, and I recommend you do the same.
Have you used an AI for programming? Which one, and how did it go? Share your experiences in the comments below.
Follow my project updates on social media, subscribe to my weekly newsletter, and connect with me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.




Claude 3.5 Sonnet is pretty good, but it's no match for ChatGPT in coding tests. It's like bringing a knife to a gunfight! 😂 Still, it's an improvement over the last version, so kudos to Anthropic for trying to keep up. Maybe next time, they'll surprise us!




Claude 3.5 Sonnet é bom, mas não chega aos pés do ChatGPT em testes de codificação. É como levar uma faca para uma batalha de armas! 😂 Ainda assim, é uma melhoria em relação à versão anterior, então parabéns à Anthropic por tentar acompanhar. Talvez da próxima vez eles nos surpreendam!




クロード3.5ソネットはコードテストではChatGPTにかなわないですね。まるでナイフを持って銃撃戦に挑むようなものです!😂 でも、前バージョンよりは改善されているので、アントロピックの努力には敬意を表します。次回は驚かせてくれるかも?




Claude 3.5 Sonnet qui galère en codage, c’est un peu décevant vu les promesses d’Anthropic. 😐 ChatGPT garde l’avantage, mais la course à l’IA est fascinante !




클로드 3.5 소넷은 코드 테스트에서 ChatGPT에 비해 많이 부족해요. 마치 칼을 들고 총격전에 나서는 느낌이죠! 😂 그래도 이전 버전보다는 나아졌으니, 앤트로픽의 노력에 박수를 보냅니다. 다음에는 놀라게 해줄지 모르겠네요!




Claude 3.5 Sonnet居然在编程测试中表现一般?有点失望,感觉ChatGPT还是稳坐宝座。😕 不过AI竞争这么激烈,Anthropic得加把劲了!












