Vector search finds similar text. Knowledge graphs find connected meaning. When your documents reference each other—contracts linking amendments, people connected to projects, policies tied to regulations—you need a graph to capture these relationships. Neo4j is the leading choice for LLM knowledge graphs. Here's how to build one.
What is a Knowledge Graph?
A knowledge graph is a database that stores information as entities (nodes) and relationships (edges). Unlike relational databases with rigid tables, knowledge graphs naturally represent how things connect. For LLMs, knowledge graphs enable multi-hop reasoning: "Find the manager of the person who approved this contract" requires traversing relationships, not just matching keywords.
1. Why Neo4j for LLMs
Neo4j Advantages
| Feature | Benefit for LLMs |
|---|---|
| Native graph storage | Fast traversal, no JOINs |
| Cypher query language | Intuitive relationship queries |
| Built-in visualization | Debug and explore data |
| LLM integrations | LangChain, LlamaIndex connectors |
| Managed cloud (Aura) | No DevOps overhead |
Alternatives Comparison
| Database | Strengths | Weaknesses |
|---|---|---|
| Neo4j | Best tooling, largest community | Higher cost at scale |
| Amazon Neptune | AWS native, managed | Less intuitive query language |
| TigerGraph | Fastest at extreme scale | Steeper learning curve |
| Memgraph | In-memory, very fast | Smaller ecosystem |
For most LLM applications, Neo4j Aura (managed cloud) is the best starting point.
2. Designing Your Graph Schema
Good schema design is 50% of GraphRAG success.
Schema Design Principles
1. Start with Questions
What will users ask?
"Who approved this contract?"
→ Need: Contract, Person, APPROVED_BY relationship
"What policies reference this regulation?"
→ Need: Policy, Regulation, REFERENCES relationship
"Show all documents related to Project Atlas"
→ Need: Document, Project, RELATED_TO relationship
2. Keep Node Types Focused
❌ Bad: Generic "Entity" node for everything
✅ Good: Specific types (Person, Contract, Project, Policy)
3. Relationships Should Have Meaning
❌ Bad: -[:RELATED]- (too vague)
✅ Good: -[:APPROVED_BY]-, -[:REFERENCES]-, -[:WORKS_FOR]-
Example Schema: Contract Analysis
// Node Types
(:Contract {id, title, effective_date, value, status})
(:Party {name, type, jurisdiction})
(:Person {name, title, email, department})
(:Clause {id, type, text, section_number})
(:Amendment {id, date, description})
(:Document {id, content, created_at})
// Relationship Types
(:Contract)-[:BETWEEN {role: "buyer"|"seller"}]->(:Party)
(:Contract)-[:APPROVED_BY {date}]->(:Person)
(:Contract)-[:CONTAINS]->(:Clause)
(:Contract)-[:AMENDED_BY]->(:Amendment)
(:Clause)-[:REFERENCES]->(:Clause)
(:Person)-[:WORKS_FOR]->(:Party)
(:Person)-[:REPORTS_TO]->(:Person)
(:Document)-[:MENTIONS]->(:Contract)
3. Setting Up Neo4j
Option A: Neo4j Aura (Recommended)
# 1. Create account at https://neo4j.com/cloud/aura/
# 2. Create a new database (Free tier available)
# 3. Note connection details:
# - URI: neo4j+s://xxxxx.databases.neo4j.io
# - Username: neo4j
# - Password: (generated)
Option B: Local Docker
docker run \
--name neo4j \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/your-password \
-e NEO4J_PLUGINS='["apoc"]' \
neo4j:5.15.0
Python Connection
from neo4j import GraphDatabase
class KnowledgeGraph:
def __init__(self, uri: str, user: str, password: str):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def close(self):
self.driver.close()
def query(self, cypher: str, params: dict = None):
with self.driver.session() as session:
result = session.run(cypher, params or {})
return [record.data() for record in result]
# Initialize
graph = KnowledgeGraph(
uri="neo4j+s://xxxxx.databases.neo4j.io",
user="neo4j",
password="your-password"
)
4. Entity Extraction with LLMs
The quality of your graph depends on extraction quality.
LLM-Based Extraction
from anthropic import Anthropic
client = Anthropic()
EXTRACTION_PROMPT = """Extract entities and relationships from this document.
Document:
{document}
Return JSON with exactly this structure:
{{
"entities": [
{{"name": "entity name", "type": "Person|Contract|Party|Clause|Project", "properties": {{}}}}
],
"relationships": [
{{"source": "entity name", "target": "entity name", "type": "APPROVED_BY|REFERENCES|WORKS_FOR|CONTAINS", "properties": {{}}}}
]
}}
JSON:"""
def extract_entities(document: str) -> dict:
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=2000,
messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(document=document)}]
)
import json
text = response.content[0].text
if "```json" in text:
text = text.split("```json")[1].split("```")[0]
return json.loads(text.strip())
5. Ingesting Documents
Full Ingestion Pipeline
class GraphIngestion:
def __init__(self, graph: KnowledgeGraph):
self.graph = graph
def ingest_document(self, doc_id: str, content: str):
extracted = extract_entities(content)
# Create document node
self.graph.query("""
MERGE (d:Document {id: $id})
SET d.content = $content, d.created_at = datetime()
""", {"id": doc_id, "content": content[:5000]})
# Create entity nodes and relationships
for entity in extracted["entities"]:
self._create_entity(entity, doc_id)
for rel in extracted["relationships"]:
self._create_relationship(rel)
6. Essential Cypher Queries
Basic Traversal
// Find all contracts approved by a person
MATCH (p:Person {name: "John Smith"})<-[:APPROVED_BY]-(c:Contract)
RETURN c.title, c.effective_date
// Find the approval chain for a contract
MATCH (c:Contract {id: "CONTRACT-001"})-[:APPROVED_BY]->(approver:Person)
OPTIONAL MATCH path = (approver)-[:REPORTS_TO*1..3]->(manager:Person)
RETURN approver.name, [node in nodes(path) | node.name] as chain
Multi-Hop Queries
// Find contracts that reference a specific clause from another contract
MATCH (c1:Contract)-[:CONTAINS]->(clause1:Clause)-[:REFERENCES]->(clause2:Clause)<-[:CONTAINS]-(c2:Contract)
WHERE c1 <> c2
RETURN c1.title as source_contract, c2.title as referenced_contract
7. Integrating with Your LLM
LangChain Integration
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_anthropic import ChatAnthropic
graph = Neo4jGraph(url="neo4j+s://xxxxx.databases.neo4j.io", username="neo4j", password="your-password")
llm = ChatAnthropic(model="claude-3-sonnet-20240229")
chain = GraphCypherQAChain.from_llm(llm=llm, graph=graph, verbose=True)
response = chain.invoke({"query": "Who approved the contract with Acme Corp?"})
8. Performance Optimization
Essential Indexes
CREATE CONSTRAINT contract_id FOR (c:Contract) REQUIRE c.id IS UNIQUE;
CREATE CONSTRAINT person_name FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE INDEX contract_date FOR (c:Contract) ON (c.effective_date);
CREATE FULLTEXT INDEX clause_text FOR (c:Clause) ON EACH [c.text];
9. Common Mistakes
Mistake 1: Overly Generic Schema
❌ Bad: (:Entity {type: "Person", name: "John"})
✅ Good: (:Person {name: "John"})
Mistake 2: Missing Relationship Direction
❌ Bad: (a)-[:RELATED]-(b)
✅ Good: (contract)-[:APPROVED_BY]->(person)
Mistake 3: Storing Full Documents in Graph
❌ Bad: (:Document {content: "... 50,000 characters ..."})
✅ Good: (:Document {id: "doc-001", snippet: "...", s3_key: "documents/doc-001.pdf"})
Next Steps
- GraphRAG Implementation Guide → - Complete architecture with Neo4j
- Hybrid Search Implementation → - Combine graph with vector search
- RAG vs GraphRAG → - When to use each approach
Need help building knowledge graphs?
At Cognilium, we built Legal Lens AI with 4.2M nodes and 8.7M relationships. Let's discuss your graph →
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI
Muhammad Mudassir
Founder & CEO, Cognilium AI
Mudassir Marwat is the Founder & CEO of Cognilium AI, where he leads the design and deployment of pr...
