How to stream responses from an LLM

All LLMs implement the Runnable interface, which comes with default implementations of standard runnable methods (i.e. ainvoke, batch, abatch, stream, astream, astream_events).

The default streaming implementations provide an AsyncGenerator that yields a single value: the final output from the underlying chat model provider.

The ability to stream the output token-by-token depends on whether the provider has implemented proper streaming support.

See which integrations support token-by-token streaming here.

:::{.callout-note}

The default implementation does not provide support for token-by-token streaming, but it ensures that the model can be swapped in for any other model as it supports the same standard interface.

:::

Using `.stream()`

The easiest way to stream is to use the .stream() method. This returns an readable stream that you can also iterate over:

tip

See this section for general instructions on installing integration packages.

npm
Yarn
pnpm

npm install @langchain/openai @langchain/core

yarn add @langchain/openai @langchain/core

pnpm add @langchain/openai @langchain/core

import { OpenAI } from "@langchain/openai";

const model = new OpenAI({
  maxTokens: 25,
});

const stream = await model.stream("Tell me a joke.");

for await (const chunk of stream) {
  console.log(chunk);
}

/*


Q
:
 What
 did
 the
 fish
 say
 when
 it
 hit
 the
 wall
?


A
:
 Dam
!
*/

API Reference:

OpenAI from @langchain/openai

For models that do not support streaming, the entire response will be returned as a single chunk.

Using a callback handler

You can also use a CallbackHandler like so:

import { OpenAI } from "@langchain/openai";

// To enable streaming, we pass in `streaming: true` to the LLM constructor.
// Additionally, we pass in a handler for the `handleLLMNewToken` event.
const model = new OpenAI({
  maxTokens: 25,
  streaming: true,
});

const response = await model.invoke("Tell me a joke.", {
  callbacks: [
    {
      handleLLMNewToken(token: string) {
        console.log({ token });
      },
    },
  ],
});
console.log(response);
/*
{ token: '\n' }
{ token: '\n' }
{ token: 'Q' }
{ token: ':' }
{ token: ' Why' }
{ token: ' did' }
{ token: ' the' }
{ token: ' chicken' }
{ token: ' cross' }
{ token: ' the' }
{ token: ' playground' }
{ token: '?' }
{ token: '\n' }
{ token: 'A' }
{ token: ':' }
{ token: ' To' }
{ token: ' get' }
{ token: ' to' }
{ token: ' the' }
{ token: ' other' }
{ token: ' slide' }
{ token: '.' }


Q: Why did the chicken cross the playground?
A: To get to the other slide.
*/

API Reference:

OpenAI from @langchain/openai

We still have access to the end LLMResult if using generate. However, tokenUsage may not be currently supported for all model providers when streaming.

How to stream responses from an LLM

Using `.stream()`

API Reference:

Using a callback handler

API Reference:

Was this page helpful?

You can also leave detailed feedback on GitHub.

How to stream responses from an LLM

Using .stream()​

API Reference:

Using a callback handler​

API Reference:

Was this page helpful?

You can also leave detailed feedback on GitHub.

Using `.stream()`

Using a callback handler