Contextual chunk headers

Consider a scenario where you want to store a large, arbitrary collection of documents in a vector store and perform Q&A tasks on them. Simply splitting documents with overlapping text may not provide sufficient context for LLMs to determine if multiple chunks are referencing the same information, or how to resolve information from contradictory sources.

Tagging each document with metadata is a solution if you know what to filter against, but you may not know ahead of time exactly what kind of queries your vector store will be expected to handle. Including additional contextual information directly in each chunk in the form of headers can help deal with arbitrary queries.

Here's an example:

tip

See this section for general instructions on installing integration packages.

npm
Yarn
pnpm

npm install @langchain/openai @langchain/community

yarn add @langchain/openai @langchain/community

pnpm add @langchain/openai @langchain/community

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { CharacterTextSplitter } from "langchain/text_splitter";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { createRetrievalChain } from "langchain/chains/retrieval";

const splitter = new CharacterTextSplitter({
  chunkSize: 1536,
  chunkOverlap: 200,
});

const jimDocs = await splitter.createDocuments(
  [`My favorite color is blue.`],
  [],
  {
    chunkHeader: `DOCUMENT NAME: Jim Interview\n\n---\n\n`,
    appendChunkOverlapHeader: true,
  }
);

const pamDocs = await splitter.createDocuments(
  [`My favorite color is red.`],
  [],
  {
    chunkHeader: `DOCUMENT NAME: Pam Interview\n\n---\n\n`,
    appendChunkOverlapHeader: true,
  }
);

const vectorstore = await HNSWLib.fromDocuments(
  jimDocs.concat(pamDocs),
  new OpenAIEmbeddings()
);

const llm = new ChatOpenAI({
  model: "gpt-3.5-turbo-1106",
  temperature: 0,
});

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
  [
    "system",
    "Answer the user's questions based on the below context:\n\n{context}",
  ],
  ["human", "{input}"],
]);

const combineDocsChain = await createStuffDocumentsChain({
  llm,
  prompt: questionAnsweringPrompt,
});

const chain = await createRetrievalChain({
  retriever: vectorstore.asRetriever(),
  combineDocsChain,
});

const res = await chain.invoke({
  input: "What is Pam's favorite color?",
});

console.log(JSON.stringify(res, null, 2));

/*
  {
    "input": "What is Pam's favorite color?",
    "chat_history": [],
    "context": [
      {
        "pageContent": "DOCUMENT NAME: Pam Interview\n\n---\n\nMy favorite color is red.",
        "metadata": {
          "loc": {
            "lines": {
              "from": 1,
              "to": 1
            }
          }
        }
      },
      {
        "pageContent": "DOCUMENT NAME: Jim Interview\n\n---\n\nMy favorite color is blue.",
        "metadata": {
          "loc": {
            "lines": {
              "from": 1,
              "to": 1
            }
          }
        }
      }
    ],
    "answer": "Pam's favorite color is red."
  }
*/

API Reference:

ChatOpenAI from @langchain/openai
OpenAIEmbeddings from @langchain/openai
CharacterTextSplitter from langchain/text_splitter
HNSWLib from @langchain/community/vectorstores/hnswlib
ChatPromptTemplate from @langchain/core/prompts
createStuffDocumentsChain from langchain/chains/combine_documents
createRetrievalChain from langchain/chains/retrieval

;

Contextual chunk headers

API Reference:

Help us out by providing feedback on this documentation page: