DOWNLOADABLE SOFTWARE
GigaText

PDF intelligence for AI coding agents

Stop sending PDFs as base64 blobs. GigaText extracts structured markdown locally. Saves tokens, improves accuracy, with automatic OCR.

From binary payloads to local extraction

GigaText is positioned as downloadable software that runs before model calls. The page shows the install path, the execution surface, and the output users get.

THE PROBLEM

How AI agents read PDFs today

  • PDF to base64 to send entire binary to API
  • 20MB limit, 20 pages per read
  • No OCR, no table detection
  • Tokens wasted on raw encoding
THE SOLUTION

GigaText processes locally

  • PDF to structured markdown to context window
  • No size limit, full document
  • Hybrid OCR auto-detects bad text
  • ~50% fewer tokens, better accuracy
~50%fewer tokens
48%faster OCR
100%local processing

Three ways to use GigaText

CLI
gigatext read doc.pdf

Install the package from PyPI and turn any local PDF into structured markdown from the terminal.

Claude Code skill
~/.claude/skills/

Drop GigaText into your local agent workflow so PDF extraction stays on-device before context is sent upstream.

MCP server
gigatext serve

Run GigaText as a local server so editors and agent frameworks can call document extraction tools directly.

Run GigaText from the terminal

$ gigatext read annual-report.pdf
GigaText v0.1.0 / PDF intelligence for AI agents

Analyzing 42 pages...
Pages with OCR needed: 3 (hybrid OCR applied)
Tables detected: 12
Output: markdown (28,340 tokens vs ~61,200 base64)

Saved ~54% tokens

Hybrid OCR: only where needed

Real PDFs mix digital text, scanned images, and corrupted encodings. GigaText analyzes each page and applies OCR only to regions that need it. Preserves original text quality while fixing the broken parts.

Text in imagesBad unicodeVector textBad OCR layers
Coming soonIn development

Layout-aware chunking

Splits documents by semantic structure, not arbitrary character counts. Headers, tables, and paragraphs stay intact. Ready for vector stores and RAG indexing.

Where GigaText fits

AI coding agents

Claude Code, Codex, Cursor read PDFs as structured text instead of binary blobs.

RAG pipelines

Extract and index PDFs with layout-aware markdown for vector stores.

Document Q&A

Feed clean markdown to any LLM for accurate question answering with fewer hallucinations.

Batch processing

Process thousands of invoices, contracts, reports with consistent markdown output.

Powered byPyMuPDF4LLM