How to improve results with prompting
In this guide we’ll go over prompting strategies to improve graph database query generation. We’ll largely focus on methods for getting relevant database-specific information in your prompt.
The GraphCypherQAChain
used in this guide will execute Cypher statements against the provided database.
For production, make sure that the database connection uses credentials that are narrowly-scoped to only include necessary permissions.
Failure to do so may result in data corruption or loss, since the calling code may attempt commands that would result in deletion, mutation of data if appropriately prompted or reading sensitive data if such data is present in the database.
Setup
Install dependencies
- npm
- yarn
- pnpm
npm i langchain @langchain/community @langchain/openai @langchain/core neo4j-driver
yarn add langchain @langchain/community @langchain/openai @langchain/core neo4j-driver
pnpm add langchain @langchain/community @langchain/openai @langchain/core neo4j-driver
Set environment variables
We’ll use OpenAI in this example:
OPENAI_API_KEY=your-api-key
# Optional, use LangSmith for best-in-class observability
LANGSMITH_API_KEY=your-api-key
LANGSMITH_TRACING=true
# Reduce tracing latency if you are not in a serverless environment
# LANGCHAIN_CALLBACKS_BACKGROUND=true
Next, we need to define Neo4j credentials. Follow these installation steps to set up a Neo4j database.
NEO4J_URI="bolt://localhost:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="password"
The below example will create a connection with a Neo4j database and will populate it with example data about movies and their actors.
const url = process.env.NEO4J_URI;
const username = process.env.NEO4J_USER;
const password = process.env.NEO4J_PASSWORD;
import "neo4j-driver";
import { Neo4jGraph } from "@langchain/community/graphs/neo4j_graph";
const graph = await Neo4jGraph.initialize({ url, username, password });
// Import movie information
const moviesQuery = `LOAD CSV WITH HEADERS FROM
'https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/movies/movies_small.csv'
AS row
MERGE (m:Movie {id:row.movieId})
SET m.released = date(row.released),
m.title = row.title,
m.imdbRating = toFloat(row.imdbRating)
FOREACH (director in split(row.director, '|') |
MERGE (p:Person {name:trim(director)})
MERGE (p)-[:DIRECTED]->(m))
FOREACH (actor in split(row.actors, '|') |
MERGE (p:Person {name:trim(actor)})
MERGE (p)-[:ACTED_IN]->(m))
FOREACH (genre in split(row.genres, '|') |
MERGE (g:Genre {name:trim(genre)})
MERGE (m)-[:IN_GENRE]->(g))`;
await graph.query(moviesQuery);
Schema refreshed successfully.
[]
Filtering graph schema
At times, you may need to focus on a specific subset of the graph schema while generating Cypher statements. Let’s say we are dealing with the following graph schema:
await graph.refreshSchema();
console.log(graph.getSchema());
Node properties are the following:
Movie {imdbRating: FLOAT, id: STRING, released: DATE, title: STRING}, Person {name: STRING}, Genre {name: STRING}, Chunk {embedding: LIST, id: STRING, text: STRING}
Relationship properties are the following:
The relationships are the following:
(:Movie)-[:IN_GENRE]->(:Genre), (:Person)-[:DIRECTED]->(:Movie), (:Person)-[:ACTED_IN]->(:Movie)
Few-shot examples
Including examples of natural language questions being converted to valid Cypher queries against our database in the prompt will often improve model performance, especially for complex queries.
Let’s say we have the following examples:
const examples = [
{
question: "How many artists are there?",
query: "MATCH (a:Person)-[:ACTED_IN]->(:Movie) RETURN count(DISTINCT a)",
},
{
question: "Which actors played in the movie Casino?",
query: "MATCH (m:Movie {{title: 'Casino'}})<-[:ACTED_IN]-(a) RETURN a.name",
},
{
question: "How many movies has Tom Hanks acted in?",
query:
"MATCH (a:Person {{name: 'Tom Hanks'}})-[:ACTED_IN]->(m:Movie) RETURN count(m)",
},
{
question: "List all the genres of the movie Schindler's List",
query:
"MATCH (m:Movie {{title: 'Schindler\\'s List'}})-[:IN_GENRE]->(g:Genre) RETURN g.name",
},
{
question:
"Which actors have worked in movies from both the comedy and action genres?",
query:
"MATCH (a:Person)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g1:Genre), (a)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g2:Genre) WHERE g1.name = 'Comedy' AND g2.name = 'Action' RETURN DISTINCT a.name",
},
{
question:
"Which directors have made movies with at least three different actors named 'John'?",
query:
"MATCH (d:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a:Person) WHERE a.name STARTS WITH 'John' WITH d, COUNT(DISTINCT a) AS JohnsCount WHERE JohnsCount >= 3 RETURN d.name",
},
{
question: "Identify movies where directors also played a role in the film.",
query:
"MATCH (p:Person)-[:DIRECTED]->(m:Movie), (p)-[:ACTED_IN]->(m) RETURN m.title, p.name",
},
{
question:
"Find the actor with the highest number of movies in the database.",
query:
"MATCH (a:Actor)-[:ACTED_IN]->(m:Movie) RETURN a.name, COUNT(m) AS movieCount ORDER BY movieCount DESC LIMIT 1",
},
];
We can create a few-shot prompt with them like so:
import { FewShotPromptTemplate, PromptTemplate } from "@langchain/core/prompts";
const examplePrompt = PromptTemplate.fromTemplate(
"User input: {question}\nCypher query: {query}"
);
const prompt = new FewShotPromptTemplate({
examples: examples.slice(0, 5),
examplePrompt,
prefix:
"You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.\n\nHere is the schema information\n{schema}.\n\nBelow are a number of examples of questions and their corresponding Cypher queries.",
suffix: "User input: {question}\nCypher query: ",
inputVariables: ["question", "schema"],
});
console.log(
await prompt.format({
question: "How many artists are there?",
schema: "foo",
})
);
You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.
Here is the schema information
foo.
Below are a number of examples of questions and their corresponding Cypher queries.
User input: How many artists are there?
Cypher query: MATCH (a:Person)-[:ACTED_IN]->(:Movie) RETURN count(DISTINCT a)
User input: Which actors played in the movie Casino?
Cypher query: MATCH (m:Movie {title: 'Casino'})<-[:ACTED_IN]-(a) RETURN a.name
User input: How many movies has Tom Hanks acted in?
Cypher query: MATCH (a:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN count(m)
User input: List all the genres of the movie Schindler's List
Cypher query: MATCH (m:Movie {title: 'Schindler\'s List'})-[:IN_GENRE]->(g:Genre) RETURN g.name
User input: Which actors have worked in movies from both the comedy and action genres?
Cypher query: MATCH (a:Person)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g1:Genre), (a)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g2:Genre) WHERE g1.name = 'Comedy' AND g2.name = 'Action' RETURN DISTINCT a.name
User input: How many artists are there?
Cypher query:
Dynamic few-shot examples
If we have enough examples, we may want to only include the most relevant ones in the prompt, either because they don’t fit in the model’s context window or because the long tail of examples distracts the model. And specifically, given any input we want to include the examples most relevant to that input.
We can do just this using an ExampleSelector. In this case we’ll use a SemanticSimilarityExampleSelector, which will store the examples in the vector database of our choosing. At runtime it will perform a similarity search between the input and our examples, and return the most semantically similar ones:
import { OpenAIEmbeddings } from "@langchain/openai";
import { SemanticSimilarityExampleSelector } from "@langchain/core/example_selectors";
import { Neo4jVectorStore } from "@langchain/community/vectorstores/neo4j_vector";
const exampleSelector = await SemanticSimilarityExampleSelector.fromExamples(
examples,
new OpenAIEmbeddings(),
Neo4jVectorStore,
{
k: 5,
inputKeys: ["question"],
preDeleteCollection: true,
url,
username,
password,
}
);
await exampleSelector.selectExamples({
question: "how many artists are there?",
});
[
{
query: "MATCH (a:Person)-[:ACTED_IN]->(:Movie) RETURN count(DISTINCT a)",
question: "How many artists are there?"
},
{
query: "MATCH (a:Person {{name: 'Tom Hanks'}})-[:ACTED_IN]->(m:Movie) RETURN count(m)",
question: "How many movies has Tom Hanks acted in?"
},
{
query: "MATCH (a:Person)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g1:Genre), (a)-[:ACTED_IN]->(:Movie)-[:IN_GENRE"... 84 more characters,
question: "Which actors have worked in movies from both the comedy and action genres?"
},
{
query: "MATCH (d:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a:Person) WHERE a.name STARTS WITH 'John' WITH"... 71 more characters,
question: "Which directors have made movies with at least three different actors named 'John'?"
},
{
query: "MATCH (a:Actor)-[:ACTED_IN]->(m:Movie) RETURN a.name, COUNT(m) AS movieCount ORDER BY movieCount DES"... 9 more characters,
question: "Find the actor with the highest number of movies in the database."
}
]
To use it, we can pass the ExampleSelector directly in to our FewShotPromptTemplate:
const promptWithExampleSelector = new FewShotPromptTemplate({
exampleSelector,
examplePrompt,
prefix:
"You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.\n\nHere is the schema information\n{schema}.\n\nBelow are a number of examples of questions and their corresponding Cypher queries.",
suffix: "User input: {question}\nCypher query: ",
inputVariables: ["question", "schema"],
});
console.log(
await promptWithExampleSelector.format({
question: "how many artists are there?",
schema: "foo",
})
);
You are a Neo4j expert. Given an input question, create a syntactically correct Cypher query to run.
Here is the schema information
foo.
Below are a number of examples of questions and their corresponding Cypher queries.
User input: How many artists are there?
Cypher query: MATCH (a:Person)-[:ACTED_IN]->(:Movie) RETURN count(DISTINCT a)
User input: How many movies has Tom Hanks acted in?
Cypher query: MATCH (a:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN count(m)
User input: Which actors have worked in movies from both the comedy and action genres?
Cypher query: MATCH (a:Person)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g1:Genre), (a)-[:ACTED_IN]->(:Movie)-[:IN_GENRE]->(g2:Genre) WHERE g1.name = 'Comedy' AND g2.name = 'Action' RETURN DISTINCT a.name
User input: Which directors have made movies with at least three different actors named 'John'?
Cypher query: MATCH (d:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a:Person) WHERE a.name STARTS WITH 'John' WITH d, COUNT(DISTINCT a) AS JohnsCount WHERE JohnsCount >= 3 RETURN d.name
User input: Find the actor with the highest number of movies in the database.
Cypher query: MATCH (a:Actor)-[:ACTED_IN]->(m:Movie) RETURN a.name, COUNT(m) AS movieCount ORDER BY movieCount DESC LIMIT 1
User input: how many artists are there?
Cypher query:
import { ChatOpenAI } from "@langchain/openai";
import { GraphCypherQAChain } from "langchain/chains/graph_qa/cypher";
const llm = new ChatOpenAI({
model: "gpt-3.5-turbo",
temperature: 0,
});
const chain = GraphCypherQAChain.fromLLM({
graph,
llm,
cypherPrompt: promptWithExampleSelector,
});
await chain.invoke({
query: "How many actors are in the graph?",
});
{ result: "There are 967 actors in the graph." }