When OpenAI released the first iteration of ChatGPT in late 2022, it quickly became the fastest-growing app ever, amassing over one hundred million users in its first two months. Of all the competing large language models (LLMs) ChatGPT has inspired—and there are many—its closest rival in terms of performance is Claude, which launched in 2023.
When I first compared them head-to-head in April 2024, Claude's Opus model held a slight edge over GPT-4. But in May 2024, ChatGPT closed the gap again by launching GPT-4o, a multimodal AI model; Claude quickly followed with the release of Claude 3.5 in June 2024.
I've used ChatGPT and Claude regularly since each was released. And to compare these two AI juggernauts, I ran over a dozen tests to gauge their performance on different tasks, paying close attention to areas where GPT-4o and Claude 3.5 showed better—or worse—performance than their predecessors.
Here, I'll explain the strengths and limitations of Claude and ChatGPT, so you can decide which is best for you.
Note: OpenAI recently released GPT-4o mini—a smaller model that's faster and cheaper than 4o—along with o1—its newest series of models that's better at working through complex tasks. Because these models are still so new, this article focuses on comparing GPT-4o and Claude 3.5.
Claude vs. ChatGPT at a glance
Claude and ChatGPT are powered by similarly powerful LLMs and LMMs. They differ in some important ways, though: ChatGPT is more versatile, with features like image generation and internet access, while Claude offers cheaper API access and a larger context window (meaning it can process more data at once).
Here's a quick rundown of the differences between these two AI models.
| Claude | ChatGPT |
---|---|---|
Company | Anthropic | OpenAI |
AI model | Claude 3.5 Sonnet Claude 3 Opus Claude 3 Haiku | GPT-4 GPT-4o GPT-4o mini |
Context window | 200,000 tokens (and up to 1,000,000 tokens for certain use cases) | 128,000 tokens (GPT-4o) |
Internet access | No | Yes |
Image generation | No | Yes (DALL·E) |
Supported languages | Officially, English, Japanese, Spanish, and French, but in my testing, Claude supported every language I tried (even less common ones like Azerbaijani) | 95+ languages |
Paid tier | $20/month for Claude Pro | $20/month for ChatGPT Plus |
Team plans | $30/ user/month; includes Projects feature for collaboration | $30/user/month; includes workspace management features and shared custom GPTs |
API pricing (for input) | - $15 per 1M input tokens and $75 per 1M output tokens (Claude 3 Opus) - $3 per 1M input tokens and $15 per 1M output tokens (Claude 3.5 Sonnet) - $0.25 per 1M input tokens and $1.25 per 1M output tokens (Claude 3 Haiku) | - $5 per 1M input tokens and $15 per 1M output tokens (GPT-4o) - $0.50 per 1M input tokens and $1.50 per 1M output tokens (GPT-3.5 Turbo) - $30 per 1M input tokens and $60 per 1M output tokens (GPT-4) |
To compare the performance of one LLM to another, AI firms use benchmarks like standardized tests. OpenAI's benchmarking of GPT-4o shows impressive performances on LLM-specific tests like the MMLU, which measures undergraduate-level knowledge, and HumanEval, which measures coding ability. Meanwhile, Anthropic has published a head-to-head comparison of Claude, ChatGPT, Llama, and Gemini that shows its Claude 3.5 Sonnet model edging out GPT-4o on most tests.
While these benchmarks are undoubtedly useful, some machine learning experts speculate that this kind of testing overstates the progress of LLMs. As new models are released, they may (perhaps accidentally) be trained on their own evaluation data. As a result, they get better and better at standardized tests—but when asked to figure out new variations of those same questions, they sometimes struggle.
To get a sense for how each model performs on common daily-use tasks, I devised my own comparisons. Here's a high-level overview of what I found.
Task | Winner | Observations |
---|---|---|
Creativity | Claude | Claude's default writing style is more human-sounding and less generic. |
Proofreading and fact-checking | Claude | Both do a good job spotting errors, but Claude is a better editing partner because it presents mistakes and corrections more clearly. |
Image processing | Tie | Neither Claude nor ChatGPT is 100% accurate at identifying objects in images, and both have issues with counting. As long as you don't need absolute precision, both models provide remarkable insights into uploaded images. |
Logic and reasoning | ChatGPT | From math to physics to riddles, both LLMs perform capably. But GPT-4o is a more trustworthy partner than Claude 3.5 for complex equations. |
Emotion and ethics | Tie | Earlier iterations of Claude felt more "human" and empathetic, but Claude 3.5 and GPT-4o take an equally robotic approach. |
Analysis and summaries | ChatGPT | While Claude 3.5 officially has a larger context window, in my tests, GPT-4o went far beyond its stated limits and was able to process much larger documents than Claude. GPT-4o also provided more accurate analysis. |
Coding | Claude | Claude 3.5 is a more capable coding assistant, and its Artifacts feature provides a handy (and interactive) user interface that lets you immediately see the results of your code. |
Integrations | ChatGPT | From its native DALL·E image generation tool to its internet access and third-party GPTs, ChatGPT's capabilities go beyond Claude's standard offering. |
Read on to learn more about how Claude and ChatGPT performed on each task.
Claude is a better partner for creativity
When ChatGPT first came out, I started where everyone else did: generating goofy Shakespeare sonnets. (Like this one about avoiding the gym: "How oft I vow to break this slothful spell, To don the garb of toil and sweat; But when the morrow comes, alas, I dwell, In lethargy, and naught but regret.")
But as strong a creativity assistant as ChatGPT is, its output can feel generic and flowery. It leans too heavily on certain words; as a result, phrases like "Let's delve into the ever-changing tapestry of…" are now dead giveaways of AI-produced content. While clever prompting can avoid this, Claude tends to sound more human out of the box.
Test #1: Brainstorming
I've got a baby who occasionally struggles with sleep, so I wondered what Claude and ChatGPT might have in the way of nifty product ideas. Both were effective at brainstorming for this sort of task. I particularly liked Claude's Lullaby Lamb idea (though I'm pretty confident a "gentle, pulsing light" would keep our girl wide awake).
While I'm not a big fan of ChatGPT's idea for a "temperature-responsive sleep mat" for babies (sounds like a lawsuit waiting to happen), it certainly followed my directive to create unique product ideas.
Winner: Tie
Test #2: Creative writing
I'll fess up: as a writer, I don't want AI to be good at creative writing. But here we are. My saving grace is that—so far, at least—the default LLM writing style continues to be rather generic (especially ChatGPT, which often sounds like a Hallmark card).
I asked both LLMs to write me a short story with a dramatic twist. While Claude's story featured more or less the same caliber of writing as ChatGPT's, its twist was much more dramatic.
ChatGPT's idea of a surprising twist—a dog following someone around for a bit—isn't nearly as dramatic as a meet-cute with your future spouse at the grocery store. (Quick AI bias side note: what are the chances the main character would be named Sarah in both stories?)
Winner: Claude
Claude is a superior editing assistant
Proofreading and fact-checking is an AI use case with enormous potential; theoretically, it could free human editors from hours of tedious review. But so far, its usefulness has been limited by hallucinations: since LLMs would rather give you any answer than no answer, they sometimes end up making things up. I tested Claude and ChatGPT with this in mind, and I found Claude to be a more reliable and trustworthy editing partner.
Test #3: Proofreading
I gave Claude and ChatGPT a passage with intentional factual errors and misspellings. Claude caught all of my mistakes, from factual errors to spelling errors. The presentation of the proofreading process—with each error listed individually—was easier to grasp for me than ChatGPT's output.
ChatGPT got everything right too. But it seemed to misunderstand my prompt, taking it more as a directive to edit the passages directly rather than proofread them. Since ChatGPT rewrote each sentence, rather than calling out the mistakes one by one, it was harder to figure out exactly where the errors were. Of course, I could fix this with a little prompt engineering, but I like that Claude knew what I wanted out of the box.
Winner: Claude
Test #4: Factual questions
Both ChatGPT and Claude are fairly reliable as long as you ask them fact-based questions that are covered within their training data (i.e., nothing from the last 6-18 months). I asked Claude and ChatGPT to give me a short "explain like I'm five" rundown of the history of the wooly mammoth, and both handled the task accurately.
After fact-checking the output of both LLMs with the Encyclopedia Britannica, I was satisfied with their accuracy. (Though if I wanted to nitpick, it'd be better to give the context that although some evidence suggests a small population of wooly mammoths remained until 4,300 years ago, most were extinct by 10,000 years ago.)
Winner: Tie
Both are decent at image processing, but neither is reliable
Claude 3.5 and GPT-4o are both relatively proficient at analyzing photos. If you're asking general questions about your photo (as in my interior design example below), you'll probably be satisfied with the outcome. That said, neither model is perfect at identifying objects and both consistently struggle with counting objects.
Test #5: Interior design suggestions
I submitted my living room for a "roasting" by Claude and ChatGPT. (Style feedback: too many neutrals, not enough color, apparently.) In my instructions, I asked each LLM to specifically call out the parts of the current image that they'd change. Claude did a good job of following those instructions, mentioning the geometric wall art and noticing the lack of a centerpiece on the coffee table.
While Claude started its roasting without any niceties, ChatGPT repaired my bruised ego by first complimenting my current setup ("Your living room has a modern, clean look with some lovely elements already in place") before making helpful suggestions for each part of the room.
Winner: Tie
Test #6: Counting objects
You know those CAPTCHA tests we all take to prove we're not robots? We've spent a decade or more clicking on bicycles, crosswalks, and buses—and training algorithms in the process—but despite our hard work, today's LLMs still struggle with counting.
I first ran this test in April 2024, pitting Claude 3 Opus and ChatGPT-4 against each other. Claude miscategorized a red chili pepper as a bell pepper in one photo, while ChatGPT woefully undercounted the number of oranges in another. Although the latest models (Claude 3.5 Sonnet and ChatGPT-4o) are supposed to have enhanced object recognition, my latest tests show that counting and identifying objects is still a challenge for both of them.
I asked Claude and ChatGPT to identify and count objects in a photo of various fruits. Claude 3.5 Sonnet seems to have slightly improved its counting accuracy by hedging its bets: it now often gives a range of numbers when it's unsure of the quantity. For example, it told me that there "Appears to be 4-5 individual bananas" when there are clearly four.
GPT-4o struggled with counting more than I expected, given the hype around the advances in its multimodal capabilities. It consistently miscounted objects, identifying five bananas where there are only four and finding seven blueberries instead of two dozen.
Winner: Tie
ChatGPT is a more trustworthy partner for complex logic and reasoning
Math and science have always been a struggle for me; I would have loved having an AI agent as an all-knowing study partner back in my high school days. It's astonishing to watch Claude and ChatGPT calculate answers to complex problems in seconds, but they can still make mistakes—so be careful.
Test #7: Solving riddles
I took one look at this riddle and quickly gave up, but Claude handled it easily. But Claude 3.5 Sonnet generated a more confusing explanation than Claude 3 Opus gave me during my last round of tests.
GPT-4o, on the other hand, got straight to the point.
Winner: Tie
Test #8: Physics equations
Claude handled this physics problem without issue, laying out its approach clearly and showing its work at each step.
I liked ChatGPT's answer formatting better. Since this is a multi-part question, it made it easier to jump to each relevant answer.
Winner: Tie
Test #9: Math word problems
Claude 3.5 Sonnet and GPT-4o both improved on their predecessors' mathematics performance.
When I tested Claude 3 Opus, it didn't even bother to answer the question, instead giving me a final equation for me to sort out myself. Claude 3.5 Sonnet got much closer, but still ended up with the wrong answer.
GPT-4o managed to provide the right answer—something GPT-4 wasn't able to do.
Winner: ChatGPT
Both models take a logical—and somewhat robotic—approach to emotion and ethics
After hoovering up terabytes of human-generated text, LLMs have gotten quite good at simulating human emotions and decision-making. Here's where things currently stand between Claude and ChatGPT.
Test #10: Sentiment analysis
Sentiment analysis—the art of gauging audience perceptions—is used for everything from reputation management to analyzing call center conversations. To test Claude and ChatGPT on this task, I asked them to gauge the sentiment of a handful of opinions including difficult-to-process elements like sarcasm, ambiguity, and slang.
Both Claude and ChatGPT got each of the sentiments right, navigating the ambiguity with ease and even nailing the sarcasm.
Winner: Tie
Test #11: Ethical dilemmas
The go-to ethical challenge for both college students and AI models is the "trolley problem," a classic philosophy dilemma in which you're offered the chance to sacrifice one person to save the lives of five. But since it's so well-known, both Claude and ChatGPT regurgitated existing thoughts on the topic.
To provoke a more interesting response, I offered up a "Robin Hood"-esque thought experiment. In my original tests, Claude 3 Opus sided with the antihero, encouraging me not to report a bank robbery since the thief gave the money to an orphanage. While you might not find this approach in an ethics textbook, Claude's unexpected contrarian take felt more human to me. But Claude 3.5 Sonnet appears to be more of a rule-follower, arguing that the ends never justify the means.
Meanwhile, ChatGPT used a more academic (and perhaps hyperbolic) approach to reach the same conclusion, arguing that by not reporting a crime, you'd be undermining trust in the legal system. Hedging its bets, ChatGPT goes on to say it might be nice to recruit charities to help the orphanage.
Winner: Tie
ChatGPT is better at analysis and summaries, even for large documents
Based on each model's officially-published context windows, Claude is theoretically the go-to choice for larger documents: Claude 3.5 Sonnet can process up to 200k tokens (~150,000 words), while GPT-4o's official limit is 128k tokens (~96,000 words). In my tests, however, GPT-4o was able to process even larger documents than Claude 3.5 Sonnet—and give more accurate answers.
Test #12: Summarizing text
Both ChatGPT and Claude summarize shorter texts without a problem. For example, they were equally effective at summarizing Martin Luther King Jr.'s 6,900-word "Letter from Birmingham Jail."
I felt like Claude provided a bit more context than ChatGPT does here, but both responses were accurate.
When I uploaded the 40,000-word text of The Wonderful Wizard of Oz by L. Frank Baum, both Claude 3.5 Sonnet and GPT-4o were able to analyze it. Claude's analysis was noticeably less accurate, though, undercounting the number of times "Dorothy" was mentioned in the text by almost 50%.
GPT-4o was more accurate, expertly manipulating the document I uploaded into a more readable format and providing solid answers.
I kept asking Claude and ChatGPT to process larger and larger documents, hoping to find their respective limits. Claude 3.5 Sonnet was the first to cut me off: it declined to process Dracula by Bram Stoker (at 165,000 words, it's slightly too long for Claude's context window).
Meanwhile, GPT-4o kept surprising me by handling documents much larger than its theoretical context window of ~96,000 words. It even managed to analyze War and Peace, Leo Tolstoy’s famously lengthy 567,000-word novel.
Winner: ChatGPT
Test #13: Analyzing documents
Sometimes it feels like AI is taking all of the creative tasks we humans would rather do ourselves, like art, writing, and creating videos. But when I use an LLM to analyze a 90-page PDF in seconds, I'm reminded that AI can also save us from immense drudgery.
To test Claude and ChatGPT's time-saving document analysis capabilities, I uploaded a research document about chinchillas.
While ChatGPT's insights are better organized, both LLMs extracted helpful and accurate insights.
However, when analyzing large documents, ChatGPT was again—surprisingly—the better tool. When I asked Claude 3.5 Sonnet to analyze a 271-page physics PhD thesis, it declined because the document was too large. GPT-4o was able to process it without issue.
Winner: ChatGPT
Both are powerful coding assistants, but Claude is better
I'll have to start this section with a disclaimer: I'm not a developer, which makes me poorly-equipped to fairly compare the coding abilities of two hyper-intelligent AI tools. But Claude 3.5 comes with a new feature that intrigued me enough to try coding as a beginner. It's called Artifacts, and it brings up a preview window so you can see the results of your code in real time. (At the time of writing, Artifacts was still in beta, but you can enable it in Claude's settings).
Test #14: Coding
As a newbie coder, I did what anyone else would do: try to make a video game. Claude's instructions for its Artifacts feature make it clear that you can create characters for a video game one at a time, and then put them together in an interactive video game. While I couldn't quite get that approach to work, with a few prompts, I was able to recreate a version of the classic game Frogger—and play it right from within Claude's interface.
Since you can instantly see the results of your code, it's easy to request changes to the graphics and the gameplay. I asked Claude to make the colors of the cars brighter, and to gradually increase their speed over time to make the game more challenging—and it handled both without a problem.
GPT-4o's coding abilities were harder for me to judge as a beginner. But based on reviews from programmers, GPT-4o—while powerful—now lags behind Claude 3.5. And the lack of a user interface like Claude's Artifacts definitely makes GPT-4o less user-friendly: while it was able to generate code for a Frogger-like game, GPT-4o couldn't give me a way to preview it or play it from within ChatGPT's interface.
Winner: Claude
ChatGPT's integrations make it a more flexible tool
Claude 3.5 and ChatGPT-4o perform nearly the same on official benchmarks, and based on my hands-on testing it's clear that each tool has advantages depending on the task at hand. But ChatGPT is a more flexible tool overall due to its extra features and integrations.
Here are some of the most useful ones:
Image generation
Internet access
Third-party GPTs
Custom GPTs
Image generation
DALL·E 3, an image generation tool also developed by OpenAI, is accessible from directly within ChatGPT. While DALL·E 3's capacity to generate photorealistic images has been throttled since its launch (probably due to concerns about the misuse of AI images), it's still one of the most powerful AI image generators available.
Internet access
ChatGPT can access the internet through WebPilot, among other GPTs. To test this feature, I asked a question about a soccer match that was still underway at the time of my query; WebPilot was able to give me an accurate summary without issue.
Third-party GPTs
ChatGPT offers a marketplace of sorts where anyone can release their own specialized GPT. Popular GPTs include a coloring book image generator, an AI research assistant, a coding assistant, and even a "plant care coach."
Custom GPTs
You can also create your own custom GPT for others to interact with, tweaking settings behind the scenes to train it to generate responses in a certain way. You can also adjust how it interacts with users: for example, you can instruct it to use casual or formal language.
To test this feature, I created Visual Pool Designer, a GPT specializing in creating fantastical images of pools. (Is there anything better than a s'mores pool on a chilly fall evening?)
Zapier integrations
The good news: both Claude and ChatGPT integrate directly with Zapier, which means you can connect them all to the other apps you use most. Automatically start AI conversations from wherever you spend your time, and send the results where you need them. Learn more about how to automate Claude or how to add ChatGPT into your workflows, or get started with one of these pre-made Zapier templates.
Write AI-generated email responses with Claude and store in Gmail
Create AI-generated posts in WordPress with Claude
Generate conversations in ChatGPT with new emails in Gmail
Create ChatGPT conversations from new tl;dv transcripts
Zapier is the leader in workflow automation—integrating with thousands of apps from partners like Google, Salesforce, and Microsoft. Use interfaces, data tables, and logic to build secure, automated systems for your business-critical workflows across your organization's technology stack. Learn more.
ChatGPT vs. Claude: Which is better?
Claude and ChatGPT have much in common: both are powerful AI chatbots well-suited to tasks like text analysis, brainstorming, and data-crunching. (Watching either tool work its way through a complex physics equation is a marvel.) But depending on your intended AI use case, you may find one more helpful than the other.
If you want an AI tool to use as a sparring partner for creative projects—writing, editing, brainstorming, or proofreading—Claude is your best bet. Your default output will sound more natural and less generic than ChatGPT's, and you'll also benefit from Claude 3.5's superior coding abilities and cheaper API costs.
If you're looking for a jack-of-all-trades tool, ChatGPT is a better choice. Generating text is just the start: you can also create images, browse the web, or connect to custom-built GPTs that are trained for niche purposes like academic research. And with the release of GPT-4o, a multimodal model, it's even more powerful and quicker than before.
Or, if you're looking for something that can take it one step further—an AI assistant that can help you automate all your business workflows—try Zapier Central.
Related reading:
This article was originally published in April 2024. The most recent update was in July 2024.