Graphify: Open-Source Tool Builds Knowledge Graphs from Code & Docs in One Command

Developer shipped Graphify, an open-source tool that builds queryable knowledge graphs from code, docs, and images in one command. It uses a two-pass pipeline with tree-sitter and Claude subagents, achieving 71.5x fewer tokens per query versus reading raw files.

GAla Smith & AI Research Desk·13h ago·5 min read·12 views·AI-Generated

Source: x.comvia @heygurisinghSingle Source

A developer has shipped Graphify, an open-source command-line tool that automatically builds a queryable knowledge graph from any folder containing code, documents, PDFs, screenshots, or whiteboard photos. The tool was built in 48 hours following a tweet by AI researcher Andrej Karpathy expressing a desire for LLM-powered knowledge graphs. It has gained over 6,300 GitHub stars in days.

The core claim is a 71.5x reduction in tokens per query compared to reading raw files, by querying a structured graph instead of performing brute-force retrieval over all documents.

What Graphify Does

Point the graphify command at a folder. It ingests everything and outputs:

A navigable, interactive graph (search, filter, click nodes).
An Obsidian vault with backlinked articles.
A Wikipedia-style wiki starting at index.md.
Plain English Q&A over the entire corpus.

You can ask questions like:

"What calls this function?"
"What connects these two concepts?"
"What are the most important nodes in this project?"

How It Works: A Two-Pass Pipeline

Graphify uses a deterministic two-stage process:

First Pass - AST Extraction: Uses tree-sitter to parse Abstract Syntax Trees (ASTs) across 19 programming languages (Python, Rust, Go, TypeScript, Swift, Zig, Elixir, etc.). This extracts code structure without LLMs.
Second Pass - Concept Extraction: Runs Claude subagents in parallel over documentation, papers, and images to pull out high-level concepts, design rationale, and semantic relationships.

The tool then applies Leiden community detection algorithms to cluster related nodes. A SHA256-based cache ensures re-runs only process changed files. It can optionally install a git hook to auto-rebuild the graph on every commit or branch switch.

Key Technical Claims

71.5x fewer tokens per query vs. reading raw files.
No vector database, embeddings, or configuration required.
Processes code, PDFs, screenshots, and whiteboard photos (including in other languages).
Deterministic builds via caching.

The Benchmark: Karpathy's Own Repos

The developer tested Graphify on a folder containing Andrej Karpathy's repositories (nanoGPT, minGPT, micrograd), the Attention Is All You Need paper, and the FlashAttention 1 & 2 papers.

The tool successfully identified that the Block class in nanoGPT and minGPT are linked across repositories, with the FlashAttention paper bridging into the CausalSelfAttention module in both codebases—a structural insight keyword search cannot provide.

Availability

Graphify is 100% open-source and available on GitHub.

gentic.news Analysis

This rapid build-and-ship cycle exemplifies the current pace of tooling in the AI-augmented developer ecosystem. The project directly responds to a stated need from a prominent figure (Karpathy) and validates it with a benchmark on his own work—a clever proof-of-concept. The technical approach is pragmatic: combining deterministic, low-cost parsing (tree-sitter) with high-cost, high-intelligence LLM analysis (Claude) only where necessary. This hybrid model is key to the claimed 71.5x token efficiency; the graph structure acts as a highly compressed, query-optimized index over the corpus.

The trend it taps into is the move beyond simple vector search (RAG) towards more structured, relational representations of knowledge—a theme we've seen in research like Microsoft's GraphRAG and tools like MemGPT. Graphify's focus on the developer's local environment (code, notes, papers) and its tight integration with git and Obsidian targets a high-value, immediate-use case. Its rapid adoption (6.3k stars) signals strong developer appetite for tools that reduce cognitive load and uncover latent connections in complex projects.

However, the long-term questions are about scale and maintenance. While brilliant for a personal or single-project codebase, how does the graph consistency hold up across massive, rapidly-changing monorepos? The reliance on a paid, external API (Claude) for the second pass also introduces cost and latency considerations for very large corpora. Nonetheless, as a v1.0, it's a significant prototype that proves the viability and utility of fully automated knowledge graph construction for software engineering.

Frequently Asked Questions

How does Graphify compare to a vector database (RAG)?

Graphify creates an explicit, queryable graph of entities and relationships, whereas a vector database (used in RAG) stores dense embeddings and retrieves chunks based on semantic similarity. Graphify's output is structured knowledge (e.g., "A calls B"), enabling relational queries like "What connects these two concepts?" that are difficult with pure vector search.

What models does Graphify use for the concept extraction pass?

According to the source, Graphify uses Claude subagents (likely via the Anthropic API) run in parallel to analyze documents, papers, and images to extract concepts and rationale. The first pass for code uses the deterministic tree-sitter library, not an LLM.

Is Graphify self-hostable or does it require an API?

The current implementation requires access to the Claude API for its second-pass concept extraction. The tree-sitter parsing is local and offline. Therefore, it is not fully self-hosted without modification, as it depends on a proprietary, cloud-based LLM service.

Can Graphify handle private or proprietary code securely?

Using the Claude API for processing means code and documents are sent to Anthropic's servers. Developers with strict data sovereignty or intellectual property requirements would need to modify the tool to use a locally-hosted LLM (like Llama 3 or a local Claude instance) for the second pass to keep all data on-premises.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Graphify's emergence is a direct product of two converging trends: the maturation of code-understanding LLMs and the developer community's shift towards agentic workflows. It's less a research breakthrough and more a supremely effective integration of existing components—tree-sitter, Claude's API, and graph algorithms—into a single, opinionated pipeline. This 'glue code' is increasingly where real productivity gains are found. The 71.5x token efficiency claim is the most critical technical metric. It suggests the graph acts as a massive compression and indexing layer, turning an O(n) retrieval problem into something closer to O(log n). For practitioners, the lesson is that adding structure—even automatically inferred structure—dramatically reduces LLM context costs. This validates a core hypothesis behind knowledge graphs versus pure vector search: relationships are a higher-bandwidth representation of information than embeddings alone. Looking forward, the natural evolution is towards incremental and real-time graph updates, and perhaps a shift from Claude-as-a-service to smaller, specialized models that can run locally for the concept extraction pass. The tool also opens the door for more sophisticated graph-native queries and analyses, like detecting architectural drift or automatically generating dependency diagrams for new engineers onboarding to a codebase.

#open source #llm applications #software engineering #tools

Mentioned in this article

Claude AI Andrej Karpathy Graphify

Enjoyed this article?

Get the weekly AI intelligence briefing