Multimodal AI Apps (Vision + Text) with GPT-4o / Claude-3.5 in Python – 2026

Combine vision and text in one model: describe images, answer visual questions, understand charts & documents, analyze screenshots — all in Python using OpenAI GPT-4o and Anthropic Claude-3.5.

Why Multimodal AI in 2026?

  • GPT-4o & Claude-3.5 Sonnet are the strongest vision+text models available
  • Use cases exploding: document AI, chart analysis, screenshot debugging, visual search, accessibility tools
  • Low cost: GPT-4o vision ~$0.005–$0.015 per image, Claude-3.5 even cheaper
  • Fast: 1–3 seconds per request
  • Python-first: OpenAI & Anthropic SDKs are excellent

Prerequisites

1. GPT-4o – Image Description & VQA


from openai import OpenAI
import base64
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Example image
image_path = "chart-example.jpg"
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this chart and explain the key insights."},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                }
            ]
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)
        

2. Claude-3.5 – Document & Screenshot Analysis


from anthropic import Anthropic
import base64

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

with open("invoice.pdf", "rb") as f:
    pdf_data = base64.b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    }
                },
                {
                    "type": "text",
                    "text": "Extract key information: invoice number, date, total amount, vendor name."
                }
            ]
        }
    ]
)

print(message.content[0].text)
        

3. Build Multimodal API with FastAPI


from fastapi import FastAPI, UploadFile, File
from openai import OpenAI
import base64

app = FastAPI()
client = OpenAI()

@app.post("/describe-image")
async def describe_image(file: UploadFile = File(...)):
    image_data = await file.read()
    base64_image = base64.b64encode(image_data).decode('utf-8')

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in detail."},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                ]
            }
        ]
    )

    return {"description": response.choices[0].message.content}
        

4. Deployment & Best Practices 2026

  • Use base64 for image input (no public URLs needed)
  • Rate limiting & caching (redis) for cost control
  • Privacy: Never log images or user data
  • FastAPI + Uvicorn/Gunicorn + Nginx
  • Docker + Railway/Fly.io/Render for free tier

Ready to build multimodal AI?

Explore All Tutorials →