Skip to main content

Handle Files

Besides raw text data, you may wish to extract information from other file types such as PowerPoint presentations or PDFs.

The general strategy is to use a LangChain document loader or other method to parse files into a text format that can be fed into LLMs.

LangChain features a large number of document loader integrations.

Letโ€™s go over an example of loading and extracting data from a PDF. First, we install required dependencies:

yarn add @langchain/openai zod
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
// Only required in a Deno notebook environment to load the peer dep.
import "pdf-parse";

const loader = new PDFLoader("./test/data/bitcoin.pdf");

const docs = await loader.load();
[Module: null prototype] { default: [AsyncFunction: PDF] }

Now that weโ€™ve loaded a PDF document, letโ€™s try extracting mentioned people. We can define a schema like this:

import { z } from "zod";

const personSchema = z
.object({
name: z.optional(z.string()).describe("The name of the person"),
hair_color: z
.optional(z.string())
.describe("The color of the person's hair, if known"),
height_in_meters: z
.optional(z.string())
.describe("Height measured in meters"),
email: z.optional(z.string()).describe("The person's email, if present"),
})
.describe("Information about a person.");

const peopleSchema = z.object({
people: z.array(personSchema),
});

And then initialize our extraction chain like this:

import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";

const SYSTEM_PROMPT_TEMPLATE = `You are an expert extraction algorithm.
Only extract relevant information from the text.
If you do not know the value of an attribute asked to extract, you may omit the attribute's value.`;

const prompt = ChatPromptTemplate.fromMessages([
["system", SYSTEM_PROMPT_TEMPLATE],
["human", "{text}"],
]);

const llm = new ChatOpenAI({
model: "gpt-4-0125-preview",
temperature: 0,
});

const extractionRunnable = prompt.pipe(
llm.withStructuredOutput(peopleSchema, { name: "people" })
);

Now, letโ€™s try invoking it!

await extractionRunnable.invoke({ text: docs[0].pageContent });
{ people: [ { name: "Satoshi Nakamoto", email: "satoshin@gmx.com" } ] }

Help us out by providing feedback on this documentation page: