How many entities can Neo4j handle?

Neo4j scales to billions of nodes and relationships. Neo4j Aura (managed cloud) handles most enterprise workloads easily. Our Legal Lens AI graph has 4.2M nodes and 8.7M relationships with sub-second query times.

Should I use Neo4j Aura or self-host?

Start with Neo4j Aura (managed cloud) unless you have specific compliance requirements for self-hosting. Aura handles backups, scaling, and security. Free tier is available for development.

How accurate is LLM-based entity extraction?

With well-crafted prompts and few-shot examples, Claude 3 Haiku achieves 85-90% accuracy on structured documents. Use Claude 3 Sonnet for complex documents. Always validate extraction before graph ingestion.

Can I use a different graph database?

Yes. Amazon Neptune works well for AWS-native deployments. TigerGraph is faster for very large graphs. The concepts in this guide apply to any graph database; only the query syntax differs.

How do I handle entity resolution (same entity, different names)?

Implement entity resolution during ingestion. Use fuzzy matching or LLM-based matching to identify duplicates. Store canonical names and aliases. Example: John Smith, J. Smith, Mr. Smith to single Person node with aliases property.

How many entities can Neo4j handle?

Neo4j scales to billions of nodes and relationships. Neo4j Aura (managed cloud) handles most enterprise workloads easily. Our Legal Lens AI graph has 4.2M nodes and 8.7M relationships with sub-second query times.

Should I use Neo4j Aura or self-host?

Start with Neo4j Aura (managed cloud) unless you have specific compliance requirements for self-hosting. Aura handles backups, scaling, and security. Free tier is available for development.

How accurate is LLM-based entity extraction?

With well-crafted prompts and few-shot examples, Claude 3 Haiku achieves 85-90% accuracy on structured documents. Use Claude 3 Sonnet for complex documents. Always validate extraction before graph ingestion.

Can I use a different graph database?

Yes. Amazon Neptune works well for AWS-native deployments. TigerGraph is faster for very large graphs. The concepts in this guide apply to any graph database; only the query syntax differs.

How do I handle entity resolution (same entity, different names)?

Implement entity resolution during ingestion. Use fuzzy matching or LLM-based matching to identify duplicates. Store canonical names and aliases. Example: John Smith, J. Smith, Mr. Smith to single Person node with aliases property.

Building Knowledge Graphs for LLMs with Neo4j (Guide)

Vector search finds similar text. Knowledge graphs find connected meaning. When your documents reference each other—contracts linking amendments, people connected to projects, policies tied to regulations—you need a graph to capture these relationships. Neo4j is the leading choice for LLM knowledge graphs. Here's how to build one.

What is a Knowledge Graph?

A knowledge graph is a database that stores information as entities (nodes) and relationships (edges). Unlike relational databases with rigid tables, knowledge graphs naturally represent how things connect. For LLMs, knowledge graphs enable multi-hop reasoning: "Find the manager of the person who approved this contract" requires traversing relationships, not just matching keywords.

1. Why Neo4j for LLMs

Neo4j Advantages

Feature	Benefit for LLMs
Native graph storage	Fast traversal, no JOINs
Cypher query language	Intuitive relationship queries
Built-in visualization	Debug and explore data
LLM integrations	LangChain, LlamaIndex connectors
Managed cloud (Aura)	No DevOps overhead

Alternatives Comparison

Database	Strengths	Weaknesses
Neo4j	Best tooling, largest community	Higher cost at scale
Amazon Neptune	AWS native, managed	Less intuitive query language
TigerGraph	Fastest at extreme scale	Steeper learning curve
Memgraph	In-memory, very fast	Smaller ecosystem

For most LLM applications, Neo4j Aura (managed cloud) is the best starting point.

2. Designing Your Graph Schema

Good schema design is 50% of GraphRAG success.

Schema Design Principles

1. Start with Questions

What will users ask?

"Who approved this contract?"
→ Need: Contract, Person, APPROVED_BY relationship

"What policies reference this regulation?"
→ Need: Policy, Regulation, REFERENCES relationship

"Show all documents related to Project Atlas"
→ Need: Document, Project, RELATED_TO relationship

2. Keep Node Types Focused

❌ Bad: Generic "Entity" node for everything
✅ Good: Specific types (Person, Contract, Project, Policy)

3. Relationships Should Have Meaning

❌ Bad: -[:RELATED]- (too vague)
✅ Good: -[:APPROVED_BY]-, -[:REFERENCES]-, -[:WORKS_FOR]-

Example Schema: Contract Analysis

// Node Types
(:Contract {id, title, effective_date, value, status})
(:Party {name, type, jurisdiction})
(:Person {name, title, email, department})
(:Clause {id, type, text, section_number})
(:Amendment {id, date, description})
(:Document {id, content, created_at})

// Relationship Types
(:Contract)-[:BETWEEN {role: "buyer"|"seller"}]->(:Party)
(:Contract)-[:APPROVED_BY {date}]->(:Person)
(:Contract)-[:CONTAINS]->(:Clause)
(:Contract)-[:AMENDED_BY]->(:Amendment)
(:Clause)-[:REFERENCES]->(:Clause)
(:Person)-[:WORKS_FOR]->(:Party)
(:Person)-[:REPORTS_TO]->(:Person)
(:Document)-[:MENTIONS]->(:Contract)

3. Setting Up Neo4j

Option A: Neo4j Aura (Recommended)

# 1. Create account at https://neo4j.com/cloud/aura/
# 2. Create a new database (Free tier available)
# 3. Note connection details:
#    - URI: neo4j+s://xxxxx.databases.neo4j.io
#    - Username: neo4j
#    - Password: (generated)

Option B: Local Docker

docker run \
    --name neo4j \
    -p 7474:7474 -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/your-password \
    -e NEO4J_PLUGINS='["apoc"]' \
    neo4j:5.15.0

Python Connection

from neo4j import GraphDatabase

class KnowledgeGraph:
    def __init__(self, uri: str, user: str, password: str):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))
    
    def close(self):
        self.driver.close()
    
    def query(self, cypher: str, params: dict = None):
        with self.driver.session() as session:
            result = session.run(cypher, params or {})
            return [record.data() for record in result]

# Initialize
graph = KnowledgeGraph(
    uri="neo4j+s://xxxxx.databases.neo4j.io",
    user="neo4j",
    password="your-password"
)

4. Entity Extraction with LLMs

The quality of your graph depends on extraction quality.

LLM-Based Extraction

from anthropic import Anthropic

client = Anthropic()

EXTRACTION_PROMPT = """Extract entities and relationships from this document.

Document:
{document}

Return JSON with exactly this structure:
{{
    "entities": [
        {{"name": "entity name", "type": "Person|Contract|Party|Clause|Project", "properties": {{}}}}
    ],
    "relationships": [
        {{"source": "entity name", "target": "entity name", "type": "APPROVED_BY|REFERENCES|WORKS_FOR|CONTAINS", "properties": {{}}}}
    ]
}}

JSON:"""

def extract_entities(document: str) -> dict:
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=2000,
        messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(document=document)}]
    )
    import json
    text = response.content[0].text
    if "```json" in text:
        text = text.split("```json")[1].split("```")[0]
    return json.loads(text.strip())

5. Ingesting Documents

Full Ingestion Pipeline

class GraphIngestion:
    def __init__(self, graph: KnowledgeGraph):
        self.graph = graph
    
    def ingest_document(self, doc_id: str, content: str):
        extracted = extract_entities(content)
        
        # Create document node
        self.graph.query("""
            MERGE (d:Document {id: $id})
            SET d.content = $content, d.created_at = datetime()
        """, {"id": doc_id, "content": content[:5000]})
        
        # Create entity nodes and relationships
        for entity in extracted["entities"]:
            self._create_entity(entity, doc_id)
        for rel in extracted["relationships"]:
            self._create_relationship(rel)

6. Essential Cypher Queries

Basic Traversal

// Find all contracts approved by a person
MATCH (p:Person {name: "John Smith"})<-[:APPROVED_BY]-(c:Contract)
RETURN c.title, c.effective_date

// Find the approval chain for a contract
MATCH (c:Contract {id: "CONTRACT-001"})-[:APPROVED_BY]->(approver:Person)
OPTIONAL MATCH path = (approver)-[:REPORTS_TO*1..3]->(manager:Person)
RETURN approver.name, [node in nodes(path) | node.name] as chain

Multi-Hop Queries

// Find contracts that reference a specific clause from another contract
MATCH (c1:Contract)-[:CONTAINS]->(clause1:Clause)-[:REFERENCES]->(clause2:Clause)<-[:CONTAINS]-(c2:Contract)
WHERE c1 <> c2
RETURN c1.title as source_contract, c2.title as referenced_contract

7. Integrating with Your LLM

LangChain Integration

from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_anthropic import ChatAnthropic

graph = Neo4jGraph(url="neo4j+s://xxxxx.databases.neo4j.io", username="neo4j", password="your-password")
llm = ChatAnthropic(model="claude-3-sonnet-20240229")

chain = GraphCypherQAChain.from_llm(llm=llm, graph=graph, verbose=True)
response = chain.invoke({"query": "Who approved the contract with Acme Corp?"})

8. Performance Optimization

Essential Indexes

CREATE CONSTRAINT contract_id FOR (c:Contract) REQUIRE c.id IS UNIQUE;
CREATE CONSTRAINT person_name FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE INDEX contract_date FOR (c:Contract) ON (c.effective_date);
CREATE FULLTEXT INDEX clause_text FOR (c:Clause) ON EACH [c.text];

9. Common Mistakes

Mistake 1: Overly Generic Schema

❌ Bad: (:Entity {type: "Person", name: "John"})
✅ Good: (:Person {name: "John"})

Mistake 2: Missing Relationship Direction

❌ Bad: (a)-[:RELATED]-(b)
✅ Good: (contract)-[:APPROVED_BY]->(person)

Mistake 3: Storing Full Documents in Graph

❌ Bad: (:Document {content: "... 50,000 characters ..."})
✅ Good: (:Document {id: "doc-001", snippet: "...", s3_key: "documents/doc-001.pdf"})

Next Steps

GraphRAG Implementation Guide → - Complete architecture with Neo4j
Hybrid Search Implementation → - Combine graph with vector search
RAG vs GraphRAG → - When to use each approach

Need help building knowledge graphs?

At Cognilium, we built Legal Lens AI with 4.2M nodes and 8.7M relationships. Let's discuss your graph →

Building Knowledge Graphs for LLMs with Neo4j

What is a Knowledge Graph?

1. Why Neo4j for LLMs

Neo4j Advantages

Alternatives Comparison

2. Designing Your Graph Schema

Schema Design Principles

Example Schema: Contract Analysis

3. Setting Up Neo4j

Option A: Neo4j Aura (Recommended)

Option B: Local Docker

Python Connection

4. Entity Extraction with LLMs

LLM-Based Extraction

5. Ingesting Documents

Full Ingestion Pipeline

6. Essential Cypher Queries

Basic Traversal

Multi-Hop Queries

7. Integrating with Your LLM

LangChain Integration

8. Performance Optimization

Essential Indexes

9. Common Mistakes

Mistake 1: Overly Generic Schema

Mistake 2: Missing Relationship Direction

Mistake 3: Storing Full Documents in Graph

Next Steps

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

How many entities can Neo4j handle?

Should I use Neo4j Aura or self-host?

How accurate is LLM-based entity extraction?

Can I use a different graph database?

How do I handle entity resolution (same entity, different names)?

Still have questions?