We built a tool that lets you augment LLM agents with visual capabilities — like OCR, object detection, and video editing — using just plain English. No need to write computer vision code.
Examples:
> “Blur all faces in this image and preview it.”
> “Extract the invoice ID, email, and totals from this invoice and overlay their locations.”
> "Redact all the sensitive data in this image, and preview the result."
> “Trim this video from 0:30 to 1:10 and add captions.”
It works with any MCP-compatible agent (Claude, OpenAI, Cursor, etc.), and turns natural language into visual AI workflows. No Python. No brittle CV pipelines. Just describe what you want, and your agent handles the rest.
Examples:
> “Blur all faces in this image and preview it.”
> “Extract the invoice ID, email, and totals from this invoice and overlay their locations.”
> "Redact all the sensitive data in this image, and preview the result."
> “Trim this video from 0:30 to 1:10 and add captions.”
It works with any MCP-compatible agent (Claude, OpenAI, Cursor, etc.), and turns natural language into visual AI workflows. No Python. No brittle CV pipelines. Just describe what you want, and your agent handles the rest.
Here's the full showcase / our docs:
[1] Colab showcase: https://colab.research.google.com/github/vlm-run/vlmrun-cook...
[2] MCP Intro / Docs: https://docs.vlm.run/mcp/introduction
We’d love feedback — especially from devs building LLM tools, agentic frameworks, or anything that needs visual understanding.