A two-stage pipeline that reads a collection of documents, extracts technical and academic terms using an LLM, and builds an interconnected Obsidian knowledge vault with synonym merging, descriptions, and cross-references.
Stage 1 — Classify & Extract (classifier.py):
- Reads PDFs, DOCX, PPTX, XLSX, images (via OCR), and plain text files
- Chunks large documents and sends each chunk to Claude Haiku for concept-level term extraction
- Deduplicates terms case-insensitively
- Outputs:
terms_raw.json,analytics.json
Stage 2 — Build (builder.py):
- Synonym merging: batches terms and identifies exact synonyms (e.g., "QM" and "Quantum Mechanics")
- Description generation: produces 1-3 sentence explanations with LaTeX and
[[Obsidian links]] - Connection forming: identifies meaningful cross-references within and across content fields
- Export: writes one
.mdfile per term (organized by content field), plus a CSV summary - Checkpointing at each stage for resumability
When run on a personal collection of 48 files spanning physics, math, engineering, and computer science coursework:
- 9,662 unique terms extracted across 12 content fields
- Zero files unreadable
See analytics.json for the full breakdown.
Selecting a single node reveals hundreds of cross-references radiating across the vault:
Each term gets a generated description with inline [[Obsidian links]] and source attribution:
At full zoom, individual nodes and hub terms are visible across the graph:
# Stage 1: Extract terms from a directory of documents
python classifier.py /path/to/documents
# Stage 2: Build the Obsidian vault from extracted terms
python builder.py
# (reads terms_raw.json from the same directory)Requires an ANTHROPIC_API_KEY environment variable.
pip install -r requirements.txt
Documents (PDF, DOCX, PPTX, XLSX, images, text)
│
▼
┌─────────────────────────────────┐
│ classifier.py │
│ - Read files (pdfplumber, etc.)│
│ - Chunk text (~3000 tokens) │
│ - Claude Haiku classification │
│ - Case-insensitive dedup │
└──────────────┬──────────────────┘
│ terms_raw.json
▼
┌─────────────────────────────────┐
│ builder.py │
│ - Synonym merging (batched) │
│ - Description generation │
│ - Connection forming │
│ - Markdown + CSV export │
└──────────────┬──────────────────┘
│
▼
Obsidian Vault
(one .md per term, organized
by content field, with
[[cross-references]])
MIT



