PDF intelligence for AI coding agents

Stop sending PDFs as base64 blobs. GigaText extracts structured markdown locally. Saves tokens, improves accuracy, with automatic OCR.

$pip install gigatextGet started

Workflow shift

From binary payloads to local extraction

GigaText is positioned as downloadable software that runs before model calls. The page shows the install path, the execution surface, and the output users get.

THE PROBLEM

How AI agents read PDFs today

PDF to base64 to send entire binary to API
20MB limit, 20 pages per read
No OCR, no table detection
Tokens wasted on raw encoding

THE SOLUTION

GigaText processes locally

PDF to structured markdown to context window
No size limit, full document
Hybrid OCR auto-detects bad text
~50% fewer tokens, better accuracy

~50%fewer tokens

48%faster OCR

100%local processing

Install

Three ways to use GigaText

CLI

gigatext read doc.pdf

Install the package from PyPI and turn any local PDF into structured markdown from the terminal.

Claude Code skill

~/.claude/skills/

Drop GigaText into your local agent workflow so PDF extraction stays on-device before context is sent upstream.

MCP server

gigatext serve

Run GigaText as a local server so editors and agent frameworks can call document extraction tools directly.

Software execution

Run GigaText from the terminal

$ gigatext read annual-report.pdf
GigaText v0.1.0 / PDF intelligence for AI agents

Analyzing 42 pages...
Pages with OCR needed: 3 (hybrid OCR applied)
Tables detected: 12
Output: markdown (28,340 tokens vs ~61,200 base64)

Saved ~54% tokens

Hybrid OCR

Hybrid OCR: only where needed

Real PDFs mix digital text, scanned images, and corrupted encodings. GigaText analyzes each page and applies OCR only to regions that need it. Preserves original text quality while fixing the broken parts.

Text in imagesBad unicodeVector textBad OCR layers

Coming soonIn development

Layout-aware chunking

Splits documents by semantic structure, not arbitrary character counts. Headers, tables, and paragraphs stay intact. Ready for vector stores and RAG indexing.

Use cases

Where GigaText fits

AI coding agents

Claude Code, Codex, Cursor read PDFs as structured text instead of binary blobs.

RAG pipelines

Extract and index PDFs with layout-aware markdown for vector stores.

Document Q&A

Feed clean markdown to any LLM for accurate question answering with fewer hallucinations.

Batch processing

Process thousands of invoices, contracts, reports with consistent markdown output.