Large language models (LLMs) are incredibly good at stating things confidently, even if they aren't always correct. OpenAI's reasoning models are an attempt to fix that, by getting the AI to slow down and work through complex problems, rather than just running with the first idea it has. So far, OpenAI has released the o1 model family and o3-mini from the o3 model family.
It's a really interesting approach, and I feel that reasoning models have already demonstrated that they're the next leap forward for LLMs. In addition to o1 and o3-mini, there's now DeepSeek-R1. It's only a matter of time before other AI companies create their own similar models. For now, though, let's focus on OpenAI's reasoning models.
What are OpenAI o1 and o3?
OpenAI o1 and o3 are two series of reasoning models from OpenAI. (o2 doesn't exist because of potential trademark issues.)
While these model families are similar to other OpenAI models, like GPT-4o, in many respects—and still use the major underlying technologies like transformers and a neural network—the o1 and o3 models are significantly better at working through complex tasks and harder problems that require logical reasoning.
That's why Open AI said it was "resetting the counter back to 1" rather than releasing o1 as GPT-5. (And yes, the weird letter casing and hyphenation of all this drives me mad, too.)
Right now, there are two reasoning models available:
OpenAI o1: The largest and most capable o1 model. There is also o1 pro mode that uses more computing resources, though it's the same underlying model.
OpenAI o3-mini: A version of o3 optimized for speed. It has three reasoning effort options: low, medium, and high.
Both are available to ChatGPT Plus subscribers and through the OpenAI API, and o3-mini is also available to free users (mostly in response to DeepSeek's release of its free reasoning model). ChatGPT Pro users have access to o1 pro mode. Earlier models, o1-mini and o1-preview, have been replaced by o3-mini, and presumably, o3 is in development.
These reasoning models aren't meant as a replacement for GPT-4o and GPT-4o mini: they offer a different price-to-performance tradeoff that makes sense for more advanced tasks. Let's dig into what that looks like.
How do reasoning models like OpenAI o1 and o3 work?
According to OpenAI, its reasoning models were trained to "think" through problems before responding. In effect, this means they integrate a prompt engineering technique called Chain of Thought reasoning (CoT) directly into the model.
When you give o1 or o3-mini a complex prompt, rather than immediately trying to generate a response, it breaks down what you've asked it to do into multiple simpler steps. It then works through this chain of thought step by step before creating its output. How long and how much effort it puts into this process depends on what model you use and what reasoning effort you instruct it to use.
In o1's introduction post on OpenAI's blog, you can see a few examples of how the o1-preview model uses CoT reasoning to analyze complex problems like decoding a cipher text, solving a crossword, and correctly answering math, chemistry, and English questions. They're worth looking through—they'll give you a much better idea of how these models work. (Alternatively, if you're a ChatGPT Plus subscriber, just give it a spin.)
Unfortunately, OpenAI has decided not to show these chains of thought to users. Instead, you get an AI-generated summary of the key points. It's still useful for understanding how the model is tackling different problems, but it won't give you quite as much detail as to what it's trying to do.
While I'm always happy to argue that using an anthropomorphizing word like "think" to describe what AI is doing is a stretch, it does capture the fact that new models take time to process your prompt before responding directly to you. Research has shown that CoT reliably improves the accuracy of AI models, so it's no surprise that reasoning models that employ it are significantly better at complex challenges than typical models like GPT-4o.
By using reinforcement learning (where the model is rewarded for getting things correct), OpenAI has trained these models to try multiple approaches, recognize and correct mistakes, and take time to work through complex problems to find a good answer.
OpenAI has found that the performance of reasoning models increases with both training time and how long they're allowed to reason before providing an answer. This means that the more computing resources they have access to, the better they perform. This is why o1 pro mode is better than o1 and why o3-mini has three different versions. It also explains why these models are significantly more expensive to run.
Aside from their reasoning abilities, OpenAI o1 and o3 appear to function much the same as other modern LLMs. OpenAI has released no meaningful details about its architecture, parameter count, or other changes, but that's now what we expect from major AI companies. Despite the name, OpenAI isn't actually producing open AI models.
GPT-4o vs. OpenAI o1 and o3-mini
When it comes to tasks that require logical reasoning, OpenAI o1 and OpenAI o3-mini are significantly better than GPT-4o (and by extension, almost all other AI models). On typical AI benchmarks that require some logic where GPT-4o performs really well, like MMLU, OpenAI o1 still scores higher.
More interestingly, on tasks that require high levels of logical reasoning, GPT-4o tends to do pretty poorly. One example that OpenAI uses is the 2024 USA Math Olympiad (AIME) paper. Out of 15 hard math questions, GPT-4o was only able to answer two correctly. o1, however, was able to get 13 correct, which would place it among the top 500 students taking the paper in the U.S. The situation is similar on the competitive coding platform Codeforces. GPT-4o only scores in the 11th percentile, while the full o1 model scores in the 89th percentile.
The story is largely the same with o3-mini. On various benchmarks, o3-mini with low reasoning effort matches or exceeds o1-mini, and o3-mini with high reasoning effort matches o1.
What struck me most, though, were the situations where OpenAI o1 fell short. In human evaluations, the o1-preview model did slightly worse at personal writing and matched GPT-4o's performance at editing text. While not a big deal in and of itself, it is when you compare the cost of the different models (which we'll look at in a bit).
OpenAI o3-mini is a little more specialized and excels at STEM questions that require logical reasoning and generating code—but not broad general knowledge. For its niche tasks, it's fast and effective—but for general tasks, it's worse than GPT-4o.
To see this all in action, here's GPT-4o mini answering a question about how to get to Spain given different options.
While it understands that swimming would be challenging, it also seems to think four hours is longer than six hours.
When you give o3-mini the same prompt, it nails it.
It works through its chain of thought step by step before creating its output: that's why o1 knows that six is more than four.
OpenAI o1 and o3-mini pricing
Through OpenAI's API, GPT-4o costs $2.50 per million input tokens and $10 per million output tokens. GPT-4o mini costs just $0.15 per million input tokens and $0.60 per million output tokens. On the other hand, o1 costs $15 per million input tokens and $60 per million output tokens. Even o3-mini costs $1.10 per million input tokens and $4.40 per million output tokens.
Model | Price per million input tokens | Price per million output tokens |
---|---|---|
GPT-4o mini | $0.15 | $0.60 |
o3-mini | $1.10 | $4.40 |
GPT-4o | $2.50 | $10 |
o1 | $15 | $60 |
All this is to say: the o1 and o3 models' increased logical performance comes at a cost. If you don't need the AI to work through complex problems, the o1 will cost far more for no major performance benefit while o3-mini might be cheaper but just isn't as good for general tasks.
Are reasoning models worth using?
While I've been super impressed with both the o1 and o3-mini models' ability to solve the kinds of problems that stump most AI models, they otherwise didn't stand out. They're even missing lots of useful features: o1 can work with images but not uploaded files or content from the web, for example, while o3-mini can use ChatGPT Search but doesn't do images. Neither works in voice mode.
OpenAI is also testing a feature called deep research that uses a version of o3 to evaluate large amounts of online data and complete multi-step research projects. Instead of returning an answer in a few moments, it takes ten or twenty minutes to find the correct response. For example, you can use deep research to find, collate, and format a variety of different data sources or find the best version of a product you want to buy. It's currently only available to ChatGPT Pro subscribers.
So, at least for the time being, the o1 and o3 models are a super exciting development—but existing LLMs and large multimodal models will still have their uses. OpenAI says it's working on a system to automatically route your prompts to the most appropriate model, which would certainly make things work more seamlessly.
How to access OpenAI o1 and o3-mini
Right now you can use the OpenAI o1 and OpenAI o3-mini models through ChatGPT and the API—OpenAI emphasizes that this is the first time a reasoning model (o3-mini) has been available for free through ChatGPT. If you're a ChatGPT Plus or Teams subscriber, you're limited to 50 weekly messages for o1 and 150 daily messages for o3-mini. ChatGPT Pro subscribers get unlimited access to o1 and o3-mini models, plus limited access to o1 pro mode and deep research.
The o1 and o3-mini models are also accessible through the API, but you don't have to be a developer to use them. With Zapier's ChatGPT integration, you can use both the o1 and o3-mini models, connecting them to thousands of other apps. Learn more about how to automate these new models, or get started with one of these pre-made workflows to access the power of ChatGPT from all of the apps you use at work.
More details
More details
More details
Zapier is the leader in workflow automation—integrating with thousands of apps from partners like Google, Salesforce, and Microsoft. Use interfaces, data tables, and logic to build secure, automated systems for your business-critical workflows across your organization's technology stack. Learn more.
Related reading:
This article was originally published in September 2024. The most recent update was in February 2025.