Skip to content

App tips

6 min read

What is multimodal AI? Large multimodal models, explained

By Harry Guinness · July 24, 2024
Hero image with an icon representing an AI agent

Large language models (LLMs), like OpenAI's GPT-4, are the extremely capable, state-of-the-art AI models that have been generating countless headlines for the past couple of years. The best of these LLMs are capable of parsing, understanding, interpreting, and generating text as well as most humans—and are able to ace many standardized tests. 

But there are still plenty of things LLMs can't do by themselves, like understand different forms of inputs. For example, LLMs can't natively respond to spoken or handwritten instructions, video footage, or anything else that isn't just text. Of course, the world isn't just made up of neatly formatted text, so some AI researchers think that training large AI models to be able to understand different "modalities"—like images, videos, and audio—is going to be a big deal in AI research. 

We're already seeing the first of these new large multimodal models or LMMs. Google, OpenAI, and Anthropic, the makers of Claude, are all talking about how powerful their latest AI models are across different modalities—even if all the features aren't widely available to the public just yet. 

So, if large multimodal models are the next frontier of AI, let's have a look at what they are, how they work, and what they can do.

What is multimodal AI?

Large multimodal models are AI models that are capable across multiple "modalities."

In machine learning and artificial intelligence research, a modality is a given kind of data. So text is a modality, as are images, videos, audio, computer code, mathematical equations, and so on. Most current AI models can only work with a single modality or convert information from one modality to another.

For example, large language models, like GPT-4, typically just work with one modality: text. They take a text prompt as an input, do some black box AI stuff, and then return text as an output. 

AI image recognition and text-to-image models both work with two modalities: text and images. AI image recognition models take an image as an input and output a text description, while text-to-image models take a text prompt and generate a corresponding image.

When an LLM appears to work with multiple modalities, it's most likely using an additional AI model to convert the other input into text. For example, before the launch of GPT-4o (a multimodal model), ChatGPT used GPT-3.5 and GPT-4 to power its text features, but it relied on Whisper to parse audio inputs and DALL·E 3 to generate images. 

But that's starting to change. 

Multimodal AI models go mainstream: Gemini, GPT-4o, GPT-4o mini, and Claude 3

When Google announced its Gemini series of AI models, it made a big deal about how they were "natively multimodal." Instead of having different modules tacked on to give the appearance of multimodality, they were apparently trained from the start to be able to handle text, images, audio, video, and more. 

OpenAI recently released GPT-4o, a multimodal model that offers GPT-4 level performance (or better) at much faster speeds and lower costs. It's available to ChatGPT Plus and Enterprise users, and its multimodality means you can quickly create and analyze images, interpret data, and have flowing voice conversations with the AI—among other tasks.

Shortly after the release of GPT-4o, OpenAI launched GPT-4o mini—a smaller language model that's faster and cheaper than GPT-4o. As of this writing, GPT-4o mini doesn't support all the same inputs and outputs as GPT-4o—for example, video and audio—but OpenAI says it plans to roll that out in the future.

Similarly, Anthropic claims that Claude 3 has "sophisticated vision capabilities on par with other leading models." So, while large multimodal model is a fancy new term, it's basically describing the direction the major LLMs have been going. 

How do large multimodal models work?

Large multimodal models are very similar to large language models in training, design, and operation. They rely on the same training and reinforcement strategies, and have the same underlying transformer architecture. This article on how ChatGPT works is a good place to start if you want a breakdown of some of these concepts. 

We're in the era of rapid AI commercialization, so a lot of interesting and important information about the various AI models is no longer released publicly. The broad strokes have to be pieced together from the technical announcements, product specs, and general direction of the research. As a result, this is more of an overarching view about how these models work as a whole, rather than a detailed breakdown of how a specific LMM was developed.

In addition to an unimaginable quantity of text, LMMs are also trained on millions or billions of images (with accompanying text descriptions), video clips, audio snippets, and examples of any other modality that the AI model is designed to understand (e.g., code). Crucially, all this training happens at the same time. The underlying neural network—the algorithm that powers the whole AI model—not only learns the word "dog," but it also learns the concept of what a dog is, as well as what a dog looks and sounds like. In theory, it should be just as capable of recognizing a photo of a dog or identifying a woof in an audio clip as it is at processing the word "dog."

Of course, this pre-training is just the first step in creating a functional AI model. It's likely to have incorporated some pretty unhealthy stereotypes and toxic ideas—mainlining the entire internet isn't good for human brains, let alone artificial networks based on them. To get a large multimodal model that behaves as expected and, importantly, is actually useful, the results are fine-tuned using techniques like reinforcement learning with human feedback (RLHF), supervisory AI models, and "red teaming" (to try to break it).

Once all that is done, you should have a working large multimodal model that's similar to a large language model, but capable of handling other modalities, too. 

What can a large multimodal model do?

ChatGPT describing a picture of a Pekingese dog uploaded by the user

If you want to see a pretty phenomenal demo of where LMMs are headed, check out OpenAI's GPT-4o demo. Google also shared an early vision of what future multimodal AI agents would be like with Google Astra.

To summarize, with some of the LMMs available now, you can do things like:

  • Upload an image and get a description of what's going on, as well as use it as part of a prompt to generate text or images.

  • Upload an image and ask questions about it, as well as follow-up questions about specific elements of the image.

  • Translate the text in an image of, say, a menu, to a different language, and then use it as part of a text prompt. 

  • Upload charts and graphs and ask complicated follow-up questions about what they show.

  • Upload a design mockup and get the HTML and CSS code necessary to create it. 

  • Voice chat with the AI in a vaguely natural cadence.

And as LMMs get more widely available, what they're capable of will likely expand. A multimodal medical chatbot, for example, would be able to better diagnose different rashes and skin discolorations.

Multimodal AI models available now

The easiest way to see what a multimodal model feels like is with GPT-4o, using ChatGPT. But Claude and Gemini are also good demos, just with fewer flashy science-fiction-vibe features.

Either way, over the next year or two, we're likely to see significantly more multimodal AI tools capable of working with text, images, video footage, audio, code, and other modalities we probably haven't even considered.

Automate your multimodal AI models

In addition to interacting with some multimodal models via chatbot right now, you can also use them in your everyday workflows. With Zapier's Google Vertex AI and Google AI Studio integrations, you can access Gemini from all the apps you use at work. And with the ChatGPT integration, you can access GPT-4o and send its outputs wherever you need them.

Learn more about how to automate Gemini and how to automate ChatGPT, or get started with one of these pre-built workflows.

Send prompts to Google Vertex AI from Google Sheets and save the responses

  • Google Vertex AI logo
Google Vertex AI
More details
    Easily send prompts to Google Vertex AI with this workflow. When you add a new prompt to Google Sheets, this automated workflow will send the prompt to Google Vertex AI, then save the response in the same Google Sheet.

    Create email copy with Google Vertex AI from new Microsoft Outlook emails and save as drafts

    • Microsoft Outlook logo
    • Google Vertex AI logo
    • Microsoft Outlook logo
    Microsoft Outlook + Google Vertex AI
    More details
      Need help drafting email responses? This integration means that whenever you receive a new Microsoft Outlook email, Google Vertex AI will write a response and save it automatically as a draft. This Zap lets you send your prospects and customers better-worded, faster email responses, powered by AI.

      Generate draft responses to new Gmail emails with Google AI Studio (Gemini)

      • Gmail logo
      • Google AI Studio (Gemini) logo
      • Gmail logo
      Gmail + Google AI Studio (Gemini)
      More details
        This template automates Gmail replies by extracting the email body, summarizing it with AI, and generating a draft email, ready to be reviewed and sent.

        Promptly reply to Facebook messages with custom responses using Google AI Studio (Gemini)

        • Facebook Messenger logo
        • Google AI Studio (Gemini) logo
        • Facebook Messenger logo
        Facebook Messenger + Google AI Studio (Gemini)
        More details
          When you quickly respond to Facebook messages with customized notes, you create a better customer experience. Use this Zap to instantly respond to Facebook messages with custom replies using Google AI Studio (Gemini).

          Create email copy with ChatGPT from new Gmail emails and save as drafts in Gmail

          • Gmail logo
          • ChatGPT logo
          • Gmail logo
          Gmail + ChatGPT
          More details
            Need help drafting email responses? This integration means that whenever you receive a new customer email, ChatGPT will write a response and save it automatically as a Gmail draft. This Zap lets you send your prospects and customers better-worded, faster email responses, powered by AI.

            Start a conversation with ChatGPT when a prompt is posted in a particular Slack channel

            • Slack logo
            • ChatGPT logo
            • Slack logo
            Slack + ChatGPT
            More details
              Ready to bring the power of ChatGPT to your Slack workspace? Use this Zap to create a reply bot that lets you have a conversation with ChatGPT right inside Slack when a prompt is posted in a particular channel. Now, your team can ask questions and get responses without having to leave Slack.

              Zapier is the leader in workflow automation—integrating with thousands of apps from partners like Google, Salesforce, and Microsoft. Use interfaces, data tables, and logic to build secure, automated systems for your business-critical workflows across your organization's technology stack. Learn more.

              Related reading:

              This article was originally published in March 2024. The most recent update was in July 2024 with contributions from Jessica Lau.

              Get productivity tips delivered straight to your inbox

              We’ll email you 1-3 times per week—and never share your information.

              Try Zapier

              Free forever for core features. 14-day trial for premium features and apps.

              Related articles

              Improve your productivity automatically. Use Zapier to get your apps working together.

              A Zap with the trigger 'When I get a new lead from Facebook,' and the action 'Notify my team in Slack'