Building Around “Nano-Banana”: Real-World Engineering with Gemini 2.5 Flash Image Preview

📖 New Ebook Available

Build Your First MCP Server: A Developer's Guide to Wrapping Existing APIs for AI Agents to Use

Learn to create powerful AI integrations step by step

Get it for $9.99 →

TL;DR: Wiring Google’s Gemini 2.5 Flash Image Preview (aka “Nano‑Banana”) into a real product meant splitting my agent runtime in two, building a bridge to the Vercel AI SDK stream protocol, and ditching base64‑in‑SSE for Cloudflare R2 presigned URLs. If you’re shipping real image gen/edit, these are the dragons you’ll meet.

(“Nano‑Banana” is Google’s internal codename for the native image generation/editing capability in Gemini 2.5 Flash.)

Why “Nano‑Banana” felt different

It’s truly multimodal. The model emits text and image bytes, and it expects files as proper parts on input. That pushed me to think in two lanes—chat semantics for text, and binary media paths for images. Your plumbing has to respect both worlds.

The architecture shift I had to make

I started with agents on Pydantic AI (Python), streaming to a TypeScript UI via the Vercel AI SDK. To add Nano‑Banana without duct tape, I split the base into two:

  • BasePydanticAIAgent for text/tooling agents
  • BaseGoogleGeminiAgent for Google’s GenAI SDK, auth, and streaming quirks

That separation gave me clean seams for logging/config—and, most importantly, a home for message conversion.

The biggest gotcha: the streams don’t speak the same language

Google streams candidates/parts (text chunks and inline image bytes). The Vercel AI SDK expects Server‑Sent Events in its own data stream protocol so the frontend can assemble a UIMessage with parts. Those universes do not align by default.

My fix was a small bridge that consumes Google’s stream and emits AI SDK‑friendly events with stable part ids. I wrapped text with text-startidtext-end, and I used AI SDK Data Parts for everything else. For images, I defined a custom data part—data-generated-image-url—that carries { url, mediaType } so the UI can render immediately. If you haven’t seen this pattern, it’s straight from Vercel’s docs on Streaming Custom Data and reconciliation: Vercel AI SDK: Streaming Custom Data.

Don’t stream megabytes of base64 over SSE

I tried it. It worked—until it didn’t. Browsers choked on huge event buffers and JSON decoding, mobile got sluggish, and any hiccup meant re‑streaming a mountain of base64.

The pragmatic fix: upload generated bytes to Cloudflare R2 and stream a short‑lived presigned URL as a data part. The UI just does <img src={url}>, and life is good. Egress is cheap (free!), retries are cheap, and you can expire aggressively.

Frontend tweaks that actually mattered

Because not every agent should accept files, I added an accepts_files flag to agent metadata and gated the paperclip UI on it. For rendering:

  • Prefer URL parts for generated images (data-generated-image-url).
  • Show a “might have expired” state (my presigned URLs last ~5 minutes).
  • Keep a legacy base64 path during migration (useful for debugging).

On the stream side, this maps 1:1 to the AI SDK’s data stream semantics, so useChat assembles messages whose parts include text and my custom data parts, and it reconciles updates by id.

Lessons I’m taking forward

  • Separate runtimes early: Text/tooling and multimodal image work deserve different bases. It reduces special‑case logic.
  • Message conversion is the linchpin: Own the bridge from provider stream → UI protocol. Add metrics and tests around it.
  • URLs > base64 for big media: Push bytes to storage; stream tiny metadata. Your SSE stays fast and resilient.
  • Design for expiration: Assume URLs expire. ChatGPT does exactly this. If you refresh a conversation even minutes later any images or files it generated are no longer available.
  • Be explicit about capabilities: Flags like accepts_files keep your UI honest and users unconfused.

What I’d do differently next time

Start with S3‑compatible storage (like R2) on day one. Treat Google and Pydantic agents as first‑class but separate. And build the message converter as a reusable module with reconciliation from the start.

This was a great technical challenge, and I learned a lot. If you’re building something similar and want to trade notes, I’m happy to chat. For deeper context on the Pydantic AI ↔ Vercel AI SDK bridge, I shared my playbook here: Pydantic AI + Vercel AI SDK tech stack.

Stay in touch

Want to Chat About AI Engineering?

I hold monthly office hours to discuss your AI Product, MCP Servers, Web Dev, systematically improving your app with Evals, or whatever strikes your fancy. These times are odd because it's weekends and before/after my day job, but I offer this as a free community service. I may create anonymized content from our conversations as they often make interesting blog posts for others to learn from.

Book Office Hours