Knowledge Graph Builder

A two-stage pipeline that reads a collection of documents, extracts technical and academic terms using an LLM, and builds an interconnected Obsidian knowledge vault with synonym merging, descriptions, and cross-references.

What it does

Stage 1 — Classify & Extract (classifier.py):

Reads PDFs, DOCX, PPTX, XLSX, images (via OCR), and plain text files
Chunks large documents and sends each chunk to Claude Haiku for concept-level term extraction
Deduplicates terms case-insensitively
Outputs: terms_raw.json, analytics.json

Stage 2 — Build (builder.py):

Synonym merging: batches terms and identifies exact synonyms (e.g., "QM" and "Quantum Mechanics")
Description generation: produces 1-3 sentence explanations with LaTeX and [[Obsidian links]]
Connection forming: identifies meaningful cross-references within and across content fields
Export: writes one .md file per term (organized by content field), plus a CSV summary
Checkpointing at each stage for resumability

Scale

When run on a personal collection of 48 files spanning physics, math, engineering, and computer science coursework:

9,662 unique terms extracted across 12 content fields
Zero files unreadable

See analytics.json for the full breakdown.

Selecting a single node reveals hundreds of cross-references radiating across the vault:

Each term gets a generated description with inline [[Obsidian links]] and source attribution:

At full zoom, individual nodes and hub terms are visible across the graph:

Usage

# Stage 1: Extract terms from a directory of documents
python classifier.py /path/to/documents

# Stage 2: Build the Obsidian vault from extracted terms
python builder.py
# (reads terms_raw.json from the same directory)

Requires an ANTHROPIC_API_KEY environment variable.

Requirements

pip install -r requirements.txt

Architecture

Documents (PDF, DOCX, PPTX, XLSX, images, text)
    │
    ▼
┌─────────────────────────────────┐
│  classifier.py                  │
│  - Read files (pdfplumber, etc.)│
│  - Chunk text (~3000 tokens)    │
│  - Claude Haiku classification  │
│  - Case-insensitive dedup       │
└──────────────┬──────────────────┘
               │  terms_raw.json
               ▼
┌─────────────────────────────────┐
│  builder.py                     │
│  - Synonym merging (batched)    │
│  - Description generation       │
│  - Connection forming           │
│  - Markdown + CSV export        │
└──────────────┬──────────────────┘
               │
               ▼
         Obsidian Vault
    (one .md per term, organized
     by content field, with
     [[cross-references]])

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
assets		assets
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analytics.json		analytics.json
builder.py		builder.py
classifier.py		classifier.py
models.py		models.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Graph Builder

What it does

Scale

Usage

Requirements

Architecture

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge Graph Builder

What it does

Scale

Usage

Requirements

Architecture

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages