Using Multimodal Prompting to Enhance AI Interaction and Collaboration

Explore how AI assistants handle diverse inputs—text, images, data, audio, and video—using multimodal prompting to connect, interpret, and generate across formats through cross-modal understanding, context bridging, and iterative workflows.

AI assistants such as Gemini and ChatGPT can process a wide range of input types beyond text, including images, audio, data, and video. Leveraging these multimodal inputs allows for more integrated, context-aware interactions and enables advanced techniques like cross-modal understanding and iterative workflows.

Key Insights

  • AI assistants now support diverse input types—text, images, audio, video, and structured data—enabling users to interact in more flexible and natural ways.
  • Multimodal prompting techniques like cross-modal understanding, context bridging, and cross-modal generation allow models to interpret and connect information across different formats.
  • Iterative multimodal workflows enable users to refine ideas through a sequence of steps that involve switching between modalities, promoting a more collaborative interaction with AI systems.

This lesson is a preview from our AI Prompt Engineering for the Government Workforce Course. Enroll in a course for detailed lessons, live instructor support, and project-based training.

So there are AI models out there that are specially trained on things like audio or video or image creation, but your normal AI assistants, so the Geminis and the Chat GPT's or Nipper GPT's that you might be dealing with on a daily basis can also take multiple types of input. They can give you multiple types of output, and likewise, they can take a multitude of different input types. Certainly, they can take text.

This is what we've done a little bit already in the demo, right? You can ask it questions, you can upload instructions or stories, documents, or code, and have it summarize or turn it into something new. That's a very, very typical kind of a default place to start with your input. Before we get into a little bit more sophisticated prompt engineering tasks with multimodal prompting, I wanna first highlight all of the different input types that are available because this may save you a couple of hours or days or maybe even weeks to notice that, oh look, there's a little plus sign down there.

I can upload many different things. So your input could also be an image. So you don't have to start with text.

You could hit that little plus sign and upload a diagram, a photo, a screenshot, a chart, or a mockup of some kind, and ask it to describe this image. Maybe it's a photo you've taken, and then you want a caption for it. So, I don't know, just a total mental block.

You could upload that into your AI assistant and have it write a caption for you. Or maybe you upload a chart and say, I kind of want some fresh eyes as it were on this. What does this chart convey to you? And if it gives you, oh, this makes it look like ABC.

Well, if ABC isn't the intention of your chart, that's some good input, maybe, right? To take back to your chart creation. So you can upload images and ask it to do different things with those images, find trends, suggest improvements, things like that. Certainly, you can also upload data.

So it doesn't just have to, you're not again relegated to text where, like, I have a report where that says 43% of why not just upload the entire table or upload the file that is the spreadsheet or a JSON file or CSV or something like that. You don't have to give it a text description, or you don't necessarily have to take a screenshot of maybe a data table or something you have. You can upload the entire file, all of the data, directly into your AI assistant and then give it its prompt, the prompt that you're looking for, analyze this by region, or create a summary for executives, that sort of thing.

You're also able to upload some audio. So maybe you get sick of typing, or if you're like me, you may like to pace around the office a little bit or something like that. With most AI models, there's a little microphone icon where you can speak your input.

Or also maybe if you have a recording, say from a meeting or the audio from a video or something like that, you can upload that MP4 or that audio file and say, create a transcript for me or give this a listen, and how do you think they sound in this particular clip or what else are you hearing? So, as I note here on the slide, this is an emerging capability. We'll certainly be keeping our eyes on this, but AI models are getting better all the time. Similarly to videos, you can upload a video and ask things just like you might have done with your audio.

Summarize key moments of this video. Tell me what's happening. Is this a happy scene, or are there things happening in the background that I may have missed? Give me a highlight reel, maybe of this really long video.

And it'll be very interesting to see how the industry continues to embrace these input types for AI models, because it begins to get really exciting. Once you start to experiment with these different input types for your AI assistant, you'll eventually begin to realize there's a real mastery in combining different modalities even into a single prompt to get even richer or more context-aware responses. So you could give it your text prompt like you normally would, but also, here's a graphic that supports what I'm talking about.

And like we've seen in some earlier exercises, maybe it's that extra bit of context that makes all the difference in its response, right? So multimodal prompting can unlock better context, more intuitive communication, and cross-format reasoning as we'll move into in just a moment. So there's a lot more you can do with your AI assistant. It doesn't just have to be words.

You can use these different inputs to get a truly integrated understanding of the information that you're trying to convey. So now that we've understood that there are different types of multimedia that a model can handle, again, it's text, images, data, audio, maybe video, and even eventually, let's look at what we can do with them. So, a pretty advanced prompt engineering technique is something called multimodal prompting.

And it's not just adding more media, attaching more files. What I mean by multimodal prompting is getting the model to connect information across formats to reason in a more integrated way. So one way that I'll call out is called cross-modal understanding.

This is where we might ask a model to align meaning between modalities. For example, here's a paragraph and a chart or an image and its description, right? It's about comprehension across forms, not just within one, but having a conversation that references both a document and a diagram. So that would be cross-modal understanding with your prompts.

Another technique is context bridging. So here, one modality gives context for interpreting another. So let's say you were to give a model a photo of a product, let's say, and a customer review, and then ask your AI assistant at that point, now you've seen it, now you have this information about what people think about it, create ad copy that's gonna fit both the photo and the review.

So it can learn to merge tone and visual mood and textual intent into one coherent response. There's also another aspect called cross-modal generation. This is where things start to get a little creative.

You're moving between modalities. So you can ask it to turn text into an image, let's say, or here's a table, turn it into a chart, or here's a photo, describe that photo for me. So the model isn't just understanding, it's taking that media that you've input and generating a new form based on whatever content you asked it to output.

The last category of multimodal prompting that I wanna talk about is especially exciting, I think. It's called iterative multimodal workflows. And this is about chaining steps together: sketch something, then describe it, then improve it, and then visualize it again or sketch it again.

It's a loop that lets you refine ideas interactively, switching modalities along the way. And this is where multimodal prompting starts to look like collaborating with your AI, not just querying. So we're gonna see this in a demo in a moment, but overall, multimodal prompting is less about inputs and more about the relationships between them.

It's what lets models combine different ways of perceiving and expressing that moves us more towards human-like interaction with our AI assistant and not so call-and-response, let's say. I wanna put this in action for us, especially with that iterative multimodal workflow, because I think an understanding and seeing this in action will really tie a lot of those multimodal input types and context actions into practice. So I'm gonna share my screen out here in a moment, and I'll demonstrate iterative multimodal workflows.

photo of Brian Simms

Brian Simms

Brian Simms teaches for Graduate School USA in the area of Artificial Intelligence, helping federal agencies build the knowledge and skills needed to adopt AI responsibly and effectively. An AI educator and author, he focuses on practical, mission-driven applications of AI for government leaders, program managers, and technical professionals.

More articles by Brian Simms

How to Learn AI

Build practical, career-focused skills in AI through hands-on training designed for beginners and professionals alike. Learn fundamental tools and workflows that prepare you for real-world projects or industry certification.