Frontier vision + OCR, wrapped in an extraction API your coding agent installs for you. You sign in once and approve a $5 top-up. That's the whole human part.
If your agent has the first-party Docex skill installed it picks this up natively. If not, the prompt is explicit enough that any decent coding agent figures out the CLI path.
Set up Docex in this project for [describe what you want to extract —
e.g., "company details from trade licenses during onboarding"].
Take me through the GitHub approval and the $5 wallet top-up, wire the
API key into my env, scaffold a server-side extraction endpoint for my
stack, and run a smoke test when you're done.
All three are designed for a developer to sit down, read the docs, and stand up an account. Fine if that's how you work. Slow if you've been letting an agent do the wiring lately — and you still have to track the model frontier yourself.
Fastest to start, slowest to finish. You own model selection (and re-selection every quarter), prompt tuning, HEIC → JPEG, orientation fix, OCR fallback, page chunking, retries, and the Tuesday-invoice ticket.
OCR is fine. Layout parsing is fine. But most predate the multimodal jump, so you're stitching them to an LLM yourself — and reading IAM docs before you read a file.
Shaped around the agent doing the work. One approval link, a wallet, a skill in your repo. The engine underneath is whatever frontier vision model is currently winning.
One approval link. GitHub sign-in. Wallet top-up. The key returns to your agent and lands in your .env. Your job in the loop: click approve.
No annual commit, no per-seat fee, no “talk to sales.” Five dollars in the wallet is around a hundred pages. Run out, top up, keep going.
A provider abstraction over the current best multimodal models — Anthropic by default, more behind feature flags. When a better model ships next quarter, your extractions get better with no code change.
PDF, JPEG, PNG, HEIC. Trade licenses, invoices, IDs, receipts photographed at a weird angle. Same SDK call, same response shape. Describe the fields; get them back.
The agent receives the API key on the other side and writes it into your project. No portal tour. No "where do I find my key" thread.
Crooked iPhone HEIC, blurry scan, 12-page PDF — same call. Docex handles orientation, glare, OCR fallback, schema validation, retries, and provenance.
await docex.extract({ file: "./uploaded-license.heic", prompt: "Pull legal name, license no., expiry.", }); // → 200 OK · 2.4s · $0.05 { "legal_name": "ACME LOGISTICS L.L.C", "license_no": "1019388", "expires_on": "2026-03-13" }
Drop in five dollars. About a hundred pages. Run out, top up, keep going. No card on file until you want one there.
See what your agent extracted, the prompt it ran, the credits it burned, the failures and why. No demo banner. No upsell modal.
People shipping extraction as a feature, not a research project. Already coding with an agent in the loop. Onboarding, KYC, claims, expense capture, invoice ingestion, contract review.
The bottleneck moved from "can the agent write the code" to "can the agent get through your signup, billing, and API key UI." Most dev tools weren't designed for that. Docex was.
Crooked phone photos, scanned 40-page leases, hand-annotated invoices — read better than the stack you'd have spent six months assembling two years ago. The hard part isn't reading the file. It's staying on the current model. That's the layer Docex owns.
$ docex setup \ --use-case "extract trade licenses" \ --top-up 5 → opening approval link… ✓ wallet funded · $5.00 ✓ key written to .env.local ✓ skill installed for claude-code
$ docex extract ./uploaded-license.heic \ --prompt "Pull legal name, no., expiry" → 1 page · anthropic · sonnet-4.5 ✓ 200 OK · 2.41s · $0.05 { "legal_name": "ACME LOGISTICS L.L.C", "license_no": "1019388", "expires_on": "2026-03-13" }
Approve the link, top up $5, watch the smoke test pass. Frontier vision, agent-installed, pay-as-you-go. Five minutes from paste to live.