Build an Extraction Chain

Prerequisites

This guide assumes familiarity with the following concepts:

In this tutorial, we will build a chain to extract structured information from unstructured text.

info

This tutorial will only work with models that support function/tool calling

Setup

Installation

To install LangChain run:

npm
yarn
pnpm

npm i langchain @langchain/core

yarn add langchain @langchain/core

pnpm add langchain @langchain/core

For more details, see our Installation guide.

LangSmith

Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with LangSmith.

After you sign up at the link above, make sure to set your environment variables to start logging traces:

export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."

# Reduce tracing latency if you are not in a serverless environment
# export LANGCHAIN_CALLBACKS_BACKGROUND=true

The Schema

First, we need to describe what information we want to extract from the text.

We’ll use Zod to define an example schema that extracts personal information.

npm
yarn
pnpm

npm i zod @langchain/core

yarn add zod @langchain/core

pnpm add zod @langchain/core

import { z } from "zod";

const personSchema = z.object({
  name: z.nullish(z.string()).describe("The name of the person"),
  hair_color: z
    .nullish(z.string())
    .describe("The color of the person's hair if known"),
  height_in_meters: z.nullish(z.string()).describe("Height measured in meters"),
});

There are two best practices when defining schema:

Document the attributes and the schema itself: This information is sent to the LLM and is used to improve the quality of information extraction.
Do not force the LLM to make up information! Above we used .nullish() for the attributes allowing the LLM to output null or undefined if it doesn’t know the answer.

info

For best performance, document the schema well and make sure the model isn’t force to return results if there’s no information to be extracted in the text.

The Extractor

Let’s create an information extractor using the schema we defined above.

import { ChatPromptTemplate } from "@langchain/core/prompts";

// Define a custom prompt to provide instructions and any additional context.
// 1) You can add examples into the prompt template to improve extraction quality
// 2) Introduce additional parameters to take context into account (e.g., include metadata
//    about the document from which the text was extracted.)
const promptTemplate = ChatPromptTemplate.fromMessages([
  [
    "system",
    `You are an expert extraction algorithm.
Only extract relevant information from the text.
If you do not know the value of an attribute asked to extract,
return null for the attribute's value.`,
  ],
  // Please see the how-to about improving performance with
  // reference examples.
  // ["placeholder", "{examples}"],
  ["human", "{text}"],
]);

We need to use a model that supports function/tool calling.

Please review the documentation for list of some models that can be used with this API.

Pick your chat model:

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/groq

yarn add @langchain/groq 

pnpm add @langchain/groq 

Add environment variables

GROQ_API_KEY=your-api-key

Instantiate the model

import { ChatGroq } from "@langchain/groq";

const llm = new ChatGroq({
  model: "llama-3.3-70b-versatile",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/openai

yarn add @langchain/openai 

pnpm add @langchain/openai 

Add environment variables

OPENAI_API_KEY=your-api-key

Instantiate the model

import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/anthropic

yarn add @langchain/anthropic 

pnpm add @langchain/anthropic 

Add environment variables

ANTHROPIC_API_KEY=your-api-key

Instantiate the model

import { ChatAnthropic } from "@langchain/anthropic";

const llm = new ChatAnthropic({
  model: "claude-3-5-sonnet-20240620",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/google-genai

yarn add @langchain/google-genai 

pnpm add @langchain/google-genai 

Add environment variables

GOOGLE_API_KEY=your-api-key

Instantiate the model

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";

const llm = new ChatGoogleGenerativeAI({
  model: "gemini-2.0-flash",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/community

yarn add @langchain/community 

pnpm add @langchain/community 

Add environment variables

FIREWORKS_API_KEY=your-api-key

Instantiate the model

import { ChatFireworks } from "@langchain/community/chat_models/fireworks";

const llm = new ChatFireworks({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/mistralai

yarn add @langchain/mistralai 

pnpm add @langchain/mistralai 

Add environment variables

MISTRAL_API_KEY=your-api-key

Instantiate the model

import { ChatMistralAI } from "@langchain/mistralai";

const llm = new ChatMistralAI({
  model: "mistral-large-latest",
  temperature: 0
});

Install dependencies

tip

See this section for general instructions on installing integration packages.

npm
yarn
pnpm

npm i @langchain/google-vertexai

yarn add @langchain/google-vertexai 

pnpm add @langchain/google-vertexai 

Add environment variables

GOOGLE_APPLICATION_CREDENTIALS=credentials.json

Instantiate the model

import { ChatVertexAI } from "@langchain/google-vertexai";

const llm = new ChatVertexAI({
  model: "gemini-1.5-flash",
  temperature: 0
});

We enable structured output by creating a new object with the .withStructuredOutput method:

const structured_llm = llm.withStructuredOutput(personSchema);

We can then invoke it normally:

const prompt = await promptTemplate.invoke({
  text: "Alan Smith is 6 feet tall and has blond hair.",
});
await structured_llm.invoke(prompt);

{ name: 'Alan Smith', hair_color: 'blond', height_in_meters: '1.83' }

info

Extraction is Generative 🤯

LLMs are generative models, so they can do some pretty cool things like correctly extract the height of the person in meters even though it was provided in feet!

We can see the LangSmith trace here.

Even though we defined our schema with the variable name personSchema, Zod is unable to infer this name and therefore does not pass it along to the model. To help give the LLM more clues as to what your provided schema represents, you can also give the schema you pass to withStructuredOutput() a name:

const structured_llm2 = llm.withStructuredOutput(personSchema, {
  name: "person",
});

const prompt2 = await promptTemplate.invoke({
  text: "Alan Smith is 6 feet tall and has blond hair.",
});
await structured_llm2.invoke(prompt2);

{ name: 'Alan Smith', hair_color: 'blond', height_in_meters: '1.83' }

This can improve performance in many cases.

Multiple Entities

In most cases, you should be extracting a list of entities rather than a single entity.

This can be easily achieved using Zod by nesting models inside one another.

import { z } from "zod";

const person = z.object({
  name: z.nullish(z.string()).describe("The name of the person"),
  hair_color: z
    .nullish(z.string())
    .describe("The color of the person's hair if known"),
  height_in_meters: z.nullish(z.number()).describe("Height measured in meters"),
});

const dataSchema = z.object({
  people: z.array(person).describe("Extracted data about people"),
});

info

Extraction might not be perfect here. Please continue to see how to use Reference Examples to improve the quality of extraction, and see the guidelines section!

const structured_llm3 = llm.withStructuredOutput(dataSchema);
const prompt3 = await promptTemplate.invoke({
  text: "My name is Jeff, my hair is black and i am 6 feet tall. Anna has the same color hair as me.",
});
await structured_llm3.invoke(prompt3);

{
  people: [
    { name: 'Jeff', hair_color: 'black', height_in_meters: 1.83 },
    { name: 'Anna', hair_color: 'black', height_in_meters: null }
  ]
}

tip

When the schema accommodates the extraction of multiple entities, it also allows the model to extract no entities if no relevant information is in the text by providing an empty list.

This is usually a good thing! It allows specifying required attributes on an entity without necessarily forcing the model to detect this entity.

We can see the LangSmith trace here

Next steps

Now that you understand the basics of extraction with LangChain, you’re ready to proceed to the rest of the how-to guides:

Add Examples: Learn how to use reference examples to improve performance.
Handle Long Text: What should you do if the text does not fit into the context window of the LLM?
Use a Parsing Approach: Use a prompt based approach to extract with models that do not support tool/function calling.

Build an Extraction Chain

Setup

Installation

LangSmith

The Schema

The Extractor

Pick your chat model:

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Multiple Entities

Next steps

Was this page helpful?

You can also leave detailed feedback on GitHub.

Setup​

Installation​

LangSmith​

The Schema​

The Extractor​

Pick your chat model:

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Install dependencies

Add environment variables

Instantiate the model

Multiple Entities​

Next steps​

Was this page helpful?

You can also leave detailed feedback on GitHub.

Setup

Installation

LangSmith

The Schema

The Extractor

Multiple Entities

Next steps