All work

Project · Agent tooling

Reusable Vision & Image Skill for Claude Code

Claude Code can read a screenshot and even rough out an image in code — but both are limited, and Gemini's vision and image generation go well beyond them. They just aren't at hand in the terminal, so each project that wants them re-wires the API from scratch. I built one skill that puts Gemini's see and draw behind a single command — a gateway any agent, skill, or workflow can build on, without touching the API again.

Claude Code skill Gemini API Python CLI Agent tooling
A LEGO Claude minifigure with Gemini-sparkle eyes, holding a paint palette and brush

Claude, with Gemini's eyes and a paintbrush — one skill to see and to draw.

2
See & draw, on tap
1
Gateway, one API key
Reused
Across every project
The gap

Gemini's eyes and hands — not yet in the terminal.

Claude Code can read an image and even rough one out in code — but only so far. It'll tell you a screenshot shows an error; it won't reliably lift every line of small text from it or judge whether a generated product shot matches the real garment down to the stitching, and its own image output stops at SVG and diagrams. Gemini's vision is far sharper at exactly that, and it renders photographic images Claude can't. Neither is a command away in the terminal, though — and calling Gemini ad hoc means every project re-implements the same client, copies the API key into one more place, and hardcodes model names that quietly break when the provider renames them. The capability isn't hard; owning it cleanly across a dozen projects — one place every agent, skill, and workflow can reach — is.

The architecture

One gateway, in front of everything.

Every agent and skill calls the same small CLI instead of wiring up Gemini on its own. Behind it sits one client and one key — so vision and image generation are a command away, and the messy parts (auth, model names, errors) live in exactly one place.

Any Claude agent or any workspace skill one shell command
Gemini CLI gemini_cli.py see · draw · models
google-genai one client · one key model constants
Gemini API vision + image models
What it does

See, draw, and a way to stay current.

Each is one subcommand with a clean input and output. The commands below are generic illustrations — the skill's own prompts stay inside it.

see Vision

Hand it one or more images and a question; it answers in plain text. Read a screenshot, pull text out of a picture, or check a generated image against the brief.

Images + a question A plain-text answer
gemini see shot.png --prompt "What error is shown, and which field is highlighted?" gemini-2.5-flash
draw Image generation

Turn a prompt — plus up to fourteen reference images and a chosen aspect ratio — into a PNG on disk. Generate from scratch, or edit and recombine the references.

Prompt + up to 14 refs A PNG on disk
gemini draw "a minimalist ink-on-limestone lens icon" --aspect 1:1 --out icon.png gemini-3.1-flash-image · pro option
models Discovery

List what the API actually offers, grouped by capability, so model names are looked up at runtime instead of guessed — when the provider renames one, nothing downstream breaks.

A capability filter Models that can do it
gemini models --filter draw draw · see
The components

What makes it a gateway, not just a script.

The two commands are the easy part. These are the pieces that make it safe for any agent to shell out to, again and again, without surprises.

One client, one key

A single google-genai client, with the API key resolved from the environment or a git-ignored file — never copied into each project.

Model drift, absorbed

Model names live as editable constants and can be overridden per call, so a provider rename is a one-line change instead of a code hunt.

Errors surfaced, not swallowed

API failures print verbatim with clear exit codes, so the calling agent relays the real reason rather than guessing at it.

Secrets stay out of git

The real key and the virtual environment are git-ignored; only a template and a dependency lockfile are committed.

Reproducible by design

A locked dependency set means any machine or agent gets the same environment from a single setup command.

Draws are validated

The contract requires the agent to actually look at a generated image and confirm it matches before calling the job done.

Why it's worth building once

Build the gateway once; every agent after it inherits it.

The work of reaching Gemini cleanly — the client, the key, the model names, the error handling — gets done a single time, in a single place. After that, giving a new project vision or image generation isn't a task; it's a command it already knows. A small piece of tooling that quietly raises the ceiling on everything built next.

What it powers

Already paying for itself.

It runs the product-photography pipeline

Phone photos to studio-grade shots, every image checked against the real product — built entirely on this skill's see and draw.

Learn more
It generated every cover on this site

The cover image on each case study across this portfolio came out of the same gateway — generated, checked, and dropped in. No stock photos, no design tool, no hand-off.

It lets Claude see its own work and fix it

With draw and see behind one command, Claude generates an image, looks at the result itself, and keeps correcting until it matches the brief — image generation as a self-correcting loop, not a one-shot guess.

Want a skill like this in your stack?

Tell me what you're building, and I'll come back with whether I can help and what a first step looks like.

Get in touch