Skip to main content

Contextual chunk headers

Consider a scenario where you want to store a large, arbitrary collection of documents in a vector store and perform Q&A tasks on them. Simply splitting documents with overlapping text may not provide sufficient context for LLMs to determine if multiple chunks are referencing the same information, or how to resolve information from contradictory sources.

Tagging each document with metadata is a solution if you know what to filter against, but you may not know ahead of time exactly what kind of queries your vector store will be expected to handle. Including additional contextual information directly in each chunk in the form of headers can help deal with arbitrary queries.

Here's an example:

npm install @langchain/openai @langchain/community
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { CharacterTextSplitter } from "langchain/text_splitter";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";

const splitter = new CharacterTextSplitter({
chunkSize: 1536,
chunkOverlap: 200,
});

const jimDocs = await splitter.createDocuments(
[`My favorite color is blue.`],
[],
{
chunkHeader: `DOCUMENT NAME: Jim Interview\n\n---\n\n`,
appendChunkOverlapHeader: true,
}
);

const pamDocs = await splitter.createDocuments(
[`My favorite color is red.`],
[],
{
chunkHeader: `DOCUMENT NAME: Pam Interview\n\n---\n\n`,
appendChunkOverlapHeader: true,
}
);

const vectorstore = await HNSWLib.fromDocuments(
jimDocs.concat(pamDocs),
new OpenAIEmbeddings()
);

const llm = new ChatOpenAI({
model: "gpt-3.5-turbo-1106",
temperature: 0,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
llm,
prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
retriever: vectorstore.asRetriever(),
combineDocsChain,
});

const res = await chain.invoke({
input: "What is Pam's favorite color?",
});

console.log(JSON.stringify(res, null, 2));

/*
{
"input": "What is Pam's favorite color?",
"chat_history": [],
"context": [
{
"pageContent": "DOCUMENT NAME: Pam Interview\n\n---\n\nMy favorite color is red.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
},
{
"pageContent": "DOCUMENT NAME: Jim Interview\n\n---\n\nMy favorite color is blue.",
"metadata": {
"loc": {
"lines": {
"from": 1,
"to": 1
}
}
}
}
],
"answer": "Pam's favorite color is red."
}
*/

API Reference:

;

Help us out by providing feedback on this documentation page: