Skip to main content

Build a PDF ingestion and Question/Answering system

Prerequisites

This guide assumes familiarity with the following concepts:

PDF files often hold crucial unstructured data unavailable from other sources. They can be quite lengthy, and unlike plain text files, cannot generally be fed directly into the prompt of a language model.

In this tutorial, you’ll create a system that can answer questions about PDF files. More specifically, you’ll use a Document Loader to load text in a format usable by an LLM, then build a retrieval-augmented generation (RAG) pipeline to answer questions, including citations from the source material.

This tutorial will gloss over some concepts more deeply covered in our RAG tutorial, so you may want to go through those first if you haven’t already.

Let’s dive in!

Loading documents

First, you’ll need to choose a PDF to load. We’ll use a document from Nike’s annual public SEC report. It’s over 100 pages long, and contains some crucial data mixed with longer explanatory text. However, you can feel free to use a PDF of your choosing.

Once you’ve chosen your PDF, the next step is to load it into a format that an LLM can more easily handle, since LLMs generally require text inputs. LangChain has a few different built-in document loaders for this purpose which you can experiment with. Below, we’ll use one powered by the pdf-parse package that reads from a filepath:

import "pdf-parse"; // Peer dep
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const loader = new PDFLoader("../../data/nke-10k-2023.pdf");

const docs = await loader.load();

console.log(docs.length);
107
console.log(docs[0].pageContent.slice(0, 100));
console.log(docs[0].metadata);
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K

{
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 1 }
}

So what just happened?

  • The loader reads the PDF at the specified path into memory.
  • It then extracts text data using the pdf-parse package.
  • Finally, it creates a LangChain Document for each page of the PDF with the page’s content and some metadata about where in the document the text came from.

LangChain has many other document loaders for other data sources, or you can create a custom document loader.

Question answering with RAG

Next, you’ll prepare the loaded documents for later retrieval. Using a text splitter, you’ll split your loaded documents into smaller documents that can more easily fit into an LLM’s context window, then load them into a vector store. You can then create a retriever from the vector store for use in our RAG chain:

Pick your chat model:

Install dependencies

yarn add @langchain/openai 

Add environment variables

OPENAI_API_KEY=your-api-key

Instantiate the model

import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI(model: "gpt-4o");
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});

const splits = await textSplitter.splitDocuments(docs);

const vectorstore = await MemoryVectorStore.fromDocuments(
splits,
new OpenAIEmbeddings()
);

const retriever = vectorstore.asRetriever();

Finally, you’ll use some built-in helpers to construct the final ragChain:

import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
import { ChatPromptTemplate } from "@langchain/core/prompts";

const systemTemplate = [
`You are an assistant for question-answering tasks. `,
`Use the following pieces of retrieved context to answer `,
`the question. If you don't know the answer, say that you `,
`don't know. Use three sentences maximum and keep the `,
`answer concise.`,
`\n\n`,
`{context}`,
].join("");

const prompt = ChatPromptTemplate.fromMessages([
["system", systemTemplate],
["human", "{input}"],
]);

const questionAnswerChain = await createStuffDocumentsChain({ llm, prompt });
const ragChain = await createRetrievalChain({
retriever,
combineDocsChain: questionAnswerChain,
});

const results = await ragChain.invoke({
input: "What was Nike's revenue in 2023?",
});

console.log(results);
{
input: "What was Nike's revenue in 2023?",
chat_history: [],
context: [
Document {
pageContent: 'Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we\n' +
'believe will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving\n' +
'speed and responsiveness as we serve consumers globally.\n' +
'FINANCIAL HIGHLIGHTS\n' +
'•In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively\n' +
'•NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for\n' +
'fiscal 2023\n' +
'•Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign\n' +
'currency exchange rates, partially offset by strategic pricing actions',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\n' +
'The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
'•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\n' +
'The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,\n' +
'2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n' +
'•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\n' +
"increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale\n" +
'equivalent basis.',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'EUROPE, MIDDLE EAST & AFRICA\n' +
'(Dollars in millions)\n' +
'FISCAL 2023FISCAL 2022% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGESFISCAL 2021% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGES\n' +
'Revenues by:\n' +
'Footwear$8,260 $7,388 12 %25 %$6,970 6 %9 %\n' +
'Apparel4,566 4,527 1 %14 %3,996 13 %16 %\n' +
'Equipment592 564 5 %18 %490 15 %17 %\n' +
'TOTAL REVENUES$13,418 $12,479 8 %21 %$11,456 9 %12 %\n' +
'Revenues by: \n' +
'Sales to Wholesale Customers$8,522 $8,377 2 %15 %$7,812 7 %10 %\n' +
'Sales through NIKE Direct4,896 4,102 19 %33 %3,644 13 %15 %\n' +
'TOTAL REVENUES$13,418 $12,479 8 %21 %$11,456 9 %12 %\n' +
'EARNINGS BEFORE INTEREST AND TAXES$3,531 $3,293 7 %$2,435 35 % \n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
"•EMEA revenues increased 21% on a currency-neutral basis, due to higher revenues in Men's, the Jordan Brand, Women's and Kids'. NIKE Direct revenues\n" +
'increased 33%, driven primarily by strong digital sales growth of 43% and comparable store sales growth of 22%.',
metadata: [Object]
},
Document {
pageContent: 'Table of Contents\n' +
'NORTH AMERICA\n' +
'(Dollars in millions)\n' +
'FISCAL 2023FISCAL 2022% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGESFISCAL 2021% CHANGE\n' +
'% CHANGE\n' +
'EXCLUDING\n' +
'CURRENCY\n' +
'CHANGES\n' +
'Revenues by:\n' +
'Footwear$14,897 $12,228 22 %22 %$11,644 5 %5 %\n' +
'Apparel5,947 5,492 8 %9 %5,028 9 %9 %\n' +
'Equipment764 633 21 %21 %507 25 %25 %\n' +
'TOTAL REVENUES$21,608 $18,353 18 %18 %$17,179 7 %7 %\n' +
'Revenues by: \n' +
'Sales to Wholesale Customers$11,273 $9,621 17 %18 %$10,186 -6 %-6 %\n' +
'Sales through NIKE Direct10,335 8,732 18 %18 %6,993 25 %25 %\n' +
'TOTAL REVENUES$21,608 $18,353 18 %18 %$17,179 7 %7 %\n' +
'EARNINGS BEFORE INTEREST AND TAXES$5,454 $5,114 7 %$5,089 0 %\n' +
'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
"•North America revenues increased 18% on a currency-neutral basis, primarily due to higher revenues in Men's and the Jordan Brand. NIKE Direct revenues\n" +
'increased 18%, driven by strong digital sales growth of 23%, comparable store sales growth of 9% and the addition of new stores.',
metadata: [Object]
}
],
answer: 'According to the financial highlights, Nike, Inc. achieved record revenues of $51.2 billion in fiscal 2023, which increased 10% on a reported basis and 16% on a currency-neutral basis compared to fiscal 2022.'
}

You can see that you get both a final answer in the answer key of the results object, and the context the LLM used to generate an answer.

Examining the values under the context further, you can see that they are documents that each contain a chunk of the ingested page content. Usefully, these documents also preserve the original metadata from way back when you first loaded them:

console.log(results.context[0].pageContent);
Enterprise Resource Planning Platform, data and analytics, demand sensing, insight gathering, and other areas to create an end-to-end technology foundation, which we
believe will further accelerate our digital transformation. We believe this unified approach will accelerate growth and unlock more efficiency for our business, while driving
speed and responsiveness as we serve consumers globally.
FINANCIAL HIGHLIGHTS
•In fiscal 2023, NIKE, Inc. achieved record Revenues of $51.2 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively
•NIKE Direct revenues grew 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023, and represented approximately 44% of total NIKE Brand revenues for
fiscal 2023
•Gross margin for the fiscal year decreased 250 basis points to 43.5% primarily driven by higher product costs, higher markdowns and unfavorable changes in foreign
currency exchange rates, partially offset by strategic pricing actions
console.log(results.context[0].metadata);
{
source: '../../data/nke-10k-2023.pdf',
pdf: {
version: '1.10.100',
info: {
PDFFormatVersion: '1.4',
IsAcroFormPresent: false,
IsXFAPresent: false,
Title: '0000320187-23-000039',
Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
Keywords: '0000320187-23-000039; ; 10-K',
Creator: 'EDGAR Filing HTML Converter',
Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
CreationDate: "D:20230720162200-04'00'",
ModDate: "D:20230720162208-04'00'"
},
metadata: null,
totalPages: 107
},
loc: { pageNumber: 31, lines: { from: 14, to: 22 } }
}

This particular chunk came from page 31 in the original PDF. You can use this data to show which page in the PDF the answer came from, allowing users to quickly verify that answers are based on the source material.

For a deeper dive into RAG, see this more focused tutorial or our how-to guides.

Next steps

You’ve now seen how to load documents from a PDF file with a Document Loader and some techniques you can use to prepare that loaded data for RAG.

For more on document loaders, you can check out:

For more on RAG, see:


Was this page helpful?


You can also leave detailed feedback on GitHub.