Skip to content

hschn58/Knowledge_Graph_Builder

Repository files navigation

Knowledge Graph Builder

A two-stage pipeline that reads a collection of documents, extracts technical and academic terms using an LLM, and builds an interconnected Obsidian knowledge vault with synonym merging, descriptions, and cross-references.

Obsidian graph view loading 9,662 interconnected terms

What it does

Stage 1 — Classify & Extract (classifier.py):

  • Reads PDFs, DOCX, PPTX, XLSX, images (via OCR), and plain text files
  • Chunks large documents and sends each chunk to Claude Haiku for concept-level term extraction
  • Deduplicates terms case-insensitively
  • Outputs: terms_raw.json, analytics.json

Stage 2 — Build (builder.py):

  • Synonym merging: batches terms and identifies exact synonyms (e.g., "QM" and "Quantum Mechanics")
  • Description generation: produces 1-3 sentence explanations with LaTeX and [[Obsidian links]]
  • Connection forming: identifies meaningful cross-references within and across content fields
  • Export: writes one .md file per term (organized by content field), plus a CSV summary
  • Checkpointing at each stage for resumability

Scale

When run on a personal collection of 48 files spanning physics, math, engineering, and computer science coursework:

  • 9,662 unique terms extracted across 12 content fields
  • Zero files unreadable

See analytics.json for the full breakdown.

Selecting a single node reveals hundreds of cross-references radiating across the vault:

Node connection highlight showing cross-references

Each term gets a generated description with inline [[Obsidian links]] and source attribution:

Generated note with description, links, and sources

At full zoom, individual nodes and hub terms are visible across the graph:

Zoomed graph view showing individual nodes

Usage

# Stage 1: Extract terms from a directory of documents
python classifier.py /path/to/documents

# Stage 2: Build the Obsidian vault from extracted terms
python builder.py
# (reads terms_raw.json from the same directory)

Requires an ANTHROPIC_API_KEY environment variable.

Requirements

pip install -r requirements.txt

Architecture

Documents (PDF, DOCX, PPTX, XLSX, images, text)
    │
    ▼
┌─────────────────────────────────┐
│  classifier.py                  │
│  - Read files (pdfplumber, etc.)│
│  - Chunk text (~3000 tokens)    │
│  - Claude Haiku classification  │
│  - Case-insensitive dedup       │
└──────────────┬──────────────────┘
               │  terms_raw.json
               ▼
┌─────────────────────────────────┐
│  builder.py                     │
│  - Synonym merging (batched)    │
│  - Description generation       │
│  - Connection forming           │
│  - Markdown + CSV export        │
└──────────────┬──────────────────┘
               │
               ▼
         Obsidian Vault
    (one .md per term, organized
     by content field, with
     [[cross-references]])

License

MIT

About

Build an interconnected Obsidian vault from source data using Anthropic API

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages