GitHub Copilot's AI Tested: Mixed Coding Success Leaves Me Baffled
Exploring the Inconsistencies in AI Coding Tools
It's truly puzzling how AI tools, all built on the same foundational large language model, can yield such varied results. For instance, ChatGPT, Perplexity, and GitHub Copilot all leverage OpenAI's GPT-4 model. Yet, my recent tests showed stark differences in performance: while ChatGPT and Perplexity's pro plans excelled, GitHub Copilot had a 50% success rate.
I conducted these tests using GitHub Copilot integrated within a VS Code environment. I'll share a detailed guide on setting this up in an upcoming article. For now, let's dive into the specifics of the tests I ran.
If you're curious about my testing methodology and the prompts used, you can check out my detailed guide on evaluating an AI chatbot's coding capabilities.
TL;DR: GitHub Copilot managed to pass two out of the four tests I conducted.
Test 1: Writing a WordPress Plugin
This test was a complete disappointment. It was my initial experiment, leaving me unsure if GitHub Copilot struggles with coding or if the interaction constraints within VS Code hinder its capabilities.
Here's the context: I asked the AI to develop a fully functional WordPress plugin that includes an admin interface and operational logic. The plugin's task was to accept a list of names, sort them, and separate any duplicates to avoid adjacency.
This task stemmed from a real-world need from my wife's digital goods e-commerce business, where she manages an active Facebook group.
While five out of the ten AI models tested passed this test entirely, three passed partially, and two, including Microsoft Copilot, failed completely. GitHub Copilot, despite being given the same prompt, only produced PHP code. Although the problem could indeed be solved with PHP alone, GitHub Copilot attempted to reference JavaScript without actually generating it.

Screenshot by David Gewirtz/ZDNET
When I tried to prompt GitHub Copilot from within a JavaScript file to complete the task, it bizarrely responded with more PHP code, still referencing a non-existent JavaScript file.

Screenshot by David Gewirtz/ZDNET
Test 2: Rewriting a String Function
This test was relatively straightforward: I provided a function meant to validate dollars and cents but only checking for whole dollars. The challenge was for the AI to correct the function.
GitHub Copilot did modify the code, but the result was problematic. It assumed that any input string was valid, which would cause errors if the string was empty. Additionally, the updated regular expression couldn't handle various edge cases, such as inputs like "3.", ".3", or "00.30". For a function meant to validate currency, such oversights are unacceptable, marking another fail for GitHub Copilot.
Test 3: Finding an Annoying Bug
Here, GitHub Copilot shone. This test was based on a real coding challenge I faced, where the error message didn't directly point to the actual issue. It's a bit like a coding riddle, requiring deep understanding of WordPress API calls to solve.
While Microsoft Copilot, Gemini, and Meta Code Llama stumbled on this test, GitHub Copilot nailed it, showcasing its capability to tackle complex, real-world problems.
Test 4: Writing a Script
GitHub Copilot also succeeded in this test, where Microsoft Copilot fell short. The task involved creating a script that needed to integrate AppleScript, the Chrome object model, and a Mac-specific utility called Keyboard Maestro.
To pass, the AI needed to recognize and address the nuances of all three environments, and GitHub Copilot did just that.
Final Thoughts
It's disheartening to see GitHub Copilot, which uses the advanced GPT-4 model, fail half of the tests. Given GitHub's status as a leading source management platform, one would expect its AI coding support to be more dependable.
However, the world of AI is ever-evolving, and I'm optimistic that GitHub Copilot's performance will improve over time. We'll revisit this in a few months to see how it's progressed.
Do you rely on AI for coding assistance? Which AI tool is your go-to? Have you given GitHub Copilot a try? Share your experiences in the comments below.
Stay updated with my daily project progress on social media. Don't forget to sign up for my weekly newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.
Related article
Musk’s Grok: 1.5 Trillion Parameters and Cursor Code Absorption—Game Changer or Bluff?
Elon Musk is finally making a move.In the AI programming race, OpenAI and Anthropic are accelerating, while xAI appears to be lagging. Musk has often stated his aim to rival Claude, yet despite multiple updates to the Grok4.X series, the results look
OpenAI Secretly Changes Charter to Make Removing Altman Harder
Following the 2023 coup-like incident, OpenAI has further solidified protections for CEO Sam Altman by updating its corporate bylaws. Recently released court documents reveal that Altman's position is now rock-solid, with substantially higher barrier
Meta AI now responds to buyer messages on Facebook Marketplace
Facebook Marketplace introduces new Meta AI features, including automated replies to buyer inquiries, the company announced Thursday. The platform also leverages AI to accelerate item listings, summarize seller profiles, and now lets sellers offer sh
Related Special Topic Recommendations
Comments (40)
0/500
Honestly, this doesn't surprise me. Even with the same underlying model, the way each tool fine-tunes prompts and handles context makes a huge difference. Copilot's mixed results probably come from its integration with IDE specifics. Still, it's baffling why the same model can give such inconsistent outputs for similar tasks. 🤔
Ich hab's auch ausprobiert und finde es echt seltsam, dass die Ergebnisse so unterschiedlich sind, obwohl die Basis ähnlich ist. Manchmal schreibt Copilot super Code, manchmal totalen Unsinn. Vielleicht liegt's an der Integration in die IDE? 🤔 Auf jeden Fall muss da noch viel verbessert werden, bevor ich mich voll darauf verlassen kann.
Интересно, почему ИИ-инструменты на одной базовой модели GPT-4 работают так по-разному? GitHub Copilot иногда генерирует код, который выглядит логично, но потом выдает полную ерунду 😅 Может, дело в тонкой настройке или контексте? Это напоминает мне капризного коллегу-программиста, который то гений, то беспомощен.
이 기사 읽어보니 AI 코딩 도구의 편차가 정말 신기하네요. 같은 기술인데 결과가 이렇게 다를 수 있다니... 개발자로 일하면서 Copilot이 가끔 완벽한 코드를 써주다가도 갑자기 엉뚱한 걸 제안해서 당황했던 적이 많아요. 🤔 앞으로 AI 도구들이 더 안정화되길 바랍니다!
Acho frustrante que ferramentas como Copilot e ChatGPT usem o mesmo modelo base mas tenham performances tão diferentes. Isso me faz questionar se a implementação é realmente bem feita ou se só estão colocando um nome famoso pra vender mais. 🤔
Exploring the Inconsistencies in AI Coding Tools
It's truly puzzling how AI tools, all built on the same foundational large language model, can yield such varied results. For instance, ChatGPT, Perplexity, and GitHub Copilot all leverage OpenAI's GPT-4 model. Yet, my recent tests showed stark differences in performance: while ChatGPT and Perplexity's pro plans excelled, GitHub Copilot had a 50% success rate.
I conducted these tests using GitHub Copilot integrated within a VS Code environment. I'll share a detailed guide on setting this up in an upcoming article. For now, let's dive into the specifics of the tests I ran.
If you're curious about my testing methodology and the prompts used, you can check out my detailed guide on evaluating an AI chatbot's coding capabilities.
TL;DR: GitHub Copilot managed to pass two out of the four tests I conducted.
Test 1: Writing a WordPress Plugin
This test was a complete disappointment. It was my initial experiment, leaving me unsure if GitHub Copilot struggles with coding or if the interaction constraints within VS Code hinder its capabilities.
Here's the context: I asked the AI to develop a fully functional WordPress plugin that includes an admin interface and operational logic. The plugin's task was to accept a list of names, sort them, and separate any duplicates to avoid adjacency.
This task stemmed from a real-world need from my wife's digital goods e-commerce business, where she manages an active Facebook group.
While five out of the ten AI models tested passed this test entirely, three passed partially, and two, including Microsoft Copilot, failed completely. GitHub Copilot, despite being given the same prompt, only produced PHP code. Although the problem could indeed be solved with PHP alone, GitHub Copilot attempted to reference JavaScript without actually generating it.
When I tried to prompt GitHub Copilot from within a JavaScript file to complete the task, it bizarrely responded with more PHP code, still referencing a non-existent JavaScript file.
Test 2: Rewriting a String Function
This test was relatively straightforward: I provided a function meant to validate dollars and cents but only checking for whole dollars. The challenge was for the AI to correct the function.
GitHub Copilot did modify the code, but the result was problematic. It assumed that any input string was valid, which would cause errors if the string was empty. Additionally, the updated regular expression couldn't handle various edge cases, such as inputs like "3.", ".3", or "00.30". For a function meant to validate currency, such oversights are unacceptable, marking another fail for GitHub Copilot.
Test 3: Finding an Annoying Bug
Here, GitHub Copilot shone. This test was based on a real coding challenge I faced, where the error message didn't directly point to the actual issue. It's a bit like a coding riddle, requiring deep understanding of WordPress API calls to solve.
While Microsoft Copilot, Gemini, and Meta Code Llama stumbled on this test, GitHub Copilot nailed it, showcasing its capability to tackle complex, real-world problems.
Test 4: Writing a Script
GitHub Copilot also succeeded in this test, where Microsoft Copilot fell short. The task involved creating a script that needed to integrate AppleScript, the Chrome object model, and a Mac-specific utility called Keyboard Maestro.
To pass, the AI needed to recognize and address the nuances of all three environments, and GitHub Copilot did just that.
Final Thoughts
It's disheartening to see GitHub Copilot, which uses the advanced GPT-4 model, fail half of the tests. Given GitHub's status as a leading source management platform, one would expect its AI coding support to be more dependable.
However, the world of AI is ever-evolving, and I'm optimistic that GitHub Copilot's performance will improve over time. We'll revisit this in a few months to see how it's progressed.
Do you rely on AI for coding assistance? Which AI tool is your go-to? Have you given GitHub Copilot a try? Share your experiences in the comments below.
Stay updated with my daily project progress on social media. Don't forget to sign up for my weekly newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.
Musk’s Grok: 1.5 Trillion Parameters and Cursor Code Absorption—Game Changer or Bluff?
Elon Musk is finally making a move.In the AI programming race, OpenAI and Anthropic are accelerating, while xAI appears to be lagging. Musk has often stated his aim to rival Claude, yet despite multiple updates to the Grok4.X series, the results look
OpenAI Secretly Changes Charter to Make Removing Altman Harder
Following the 2023 coup-like incident, OpenAI has further solidified protections for CEO Sam Altman by updating its corporate bylaws. Recently released court documents reveal that Altman's position is now rock-solid, with substantially higher barrier
Meta AI now responds to buyer messages on Facebook Marketplace
Facebook Marketplace introduces new Meta AI features, including automated replies to buyer inquiries, the company announced Thursday. The platform also leverages AI to accelerate item listings, summarize seller profiles, and now lets sellers offer sh
Honestly, this doesn't surprise me. Even with the same underlying model, the way each tool fine-tunes prompts and handles context makes a huge difference. Copilot's mixed results probably come from its integration with IDE specifics. Still, it's baffling why the same model can give such inconsistent outputs for similar tasks. 🤔
Ich hab's auch ausprobiert und finde es echt seltsam, dass die Ergebnisse so unterschiedlich sind, obwohl die Basis ähnlich ist. Manchmal schreibt Copilot super Code, manchmal totalen Unsinn. Vielleicht liegt's an der Integration in die IDE? 🤔 Auf jeden Fall muss da noch viel verbessert werden, bevor ich mich voll darauf verlassen kann.
Интересно, почему ИИ-инструменты на одной базовой модели GPT-4 работают так по-разному? GitHub Copilot иногда генерирует код, который выглядит логично, но потом выдает полную ерунду 😅 Может, дело в тонкой настройке или контексте? Это напоминает мне капризного коллегу-программиста, который то гений, то беспомощен.
이 기사 읽어보니 AI 코딩 도구의 편차가 정말 신기하네요. 같은 기술인데 결과가 이렇게 다를 수 있다니... 개발자로 일하면서 Copilot이 가끔 완벽한 코드를 써주다가도 갑자기 엉뚱한 걸 제안해서 당황했던 적이 많아요. 🤔 앞으로 AI 도구들이 더 안정화되길 바랍니다!
Acho frustrante que ferramentas como Copilot e ChatGPT usem o mesmo modelo base mas tenham performances tão diferentes. Isso me faz questionar se a implementação é realmente bem feita ou se só estão colocando um nome famoso pra vender mais. 🤔





Home






