Skip to main content

Split code and markup

CodeTextSplitter allows you to split your code and markup with support for multiple languages.

LangChain supports a variety of different markup and programming language-specific text splitters to split your text based on language-specific syntax. This results in more semantically self-contained chunks that are more useful to a vector store or other retriever. Popular languages like JavaScript, Python, Solidity, and Rust are supported as well as Latex, HTML, and Markdown.

Usage

Initialize a standard RecursiveCharacterTextSplitter with the fromLanguage factory method. Below are some examples for various languages.

JavaScript

import {
SupportedTextSplitterLanguages,
RecursiveCharacterTextSplitter,
} from "langchain/text_splitter";

console.log(SupportedTextSplitterLanguages); // Array of supported languages

/*
[
'cpp', 'go',
'java', 'js',
'php', 'proto',
'python', 'rst',
'ruby', 'rust',
'scala', 'swift',
'markdown', 'latex',
'html'
]
*/

const jsCode = `function helloWorld() {
console.log("Hello, World!");
}
// Call the function
helloWorld();`;

const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
chunkSize: 32,
chunkOverlap: 0,
});
const jsOutput = await splitter.createDocuments([jsCode]);

console.log(jsOutput);

/*
[
Document {
pageContent: 'function helloWorld() {',
metadata: { loc: [Object] }
},
Document {
pageContent: 'console.log("Hello, World!");',
metadata: { loc: [Object] }
},
Document {
pageContent: '}\n// Call the function',
metadata: { loc: [Object] }
},
Document {
pageContent: 'helloWorld();',
metadata: { loc: [Object] }
}
]
*/

API Reference:

Markdown

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const text = `
---
sidebar_position: 1
---
# Document transformers

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example
is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain
has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

## Text splitters

When you want to deal with long pieces of text, it is necessary to split up that text into chunks.
As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text.
This notebook showcases several ways to do that.

At a high level, text splitters work as following:

1. Split the text up into small, semantically meaningful chunks (often sentences).
2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:

1. How the text is split
2. How the chunk size is measured

## Get started with text splitters

import GetStarted from "@snippets/modules/data_connection/document_transformers/get_started.mdx"

<GetStarted/>
`;

const splitter = RecursiveCharacterTextSplitter.fromLanguage("markdown", {
chunkSize: 500,
chunkOverlap: 0,
});
const output = await splitter.createDocuments([text]);

console.log(output);

/*
[
Document {
pageContent: '---\n' +
'sidebar_position: 1\n' +
'---\n' +
'# Document transformers\n' +
'\n' +
"Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example\n" +
"is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain\n" +
'has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.',
metadata: { loc: [Object] }
},
Document {
pageContent: '## Text splitters\n' +
'\n' +
'When you want to deal with long pieces of text, it is necessary to split up that text into chunks.\n' +
'As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text.\n' +
'This notebook showcases several ways to do that.\n' +
'\n' +
'At a high level, text splitters work as following:',
metadata: { loc: [Object] }
},
Document {
pageContent: '1. Split the text up into small, semantically meaningful chunks (often sentences).\n' +
'2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).\n' +
'3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).\n' +
'\n' +
'That means there are two different axes along which you can customize your text splitter:',
metadata: { loc: [Object] }
},
Document {
pageContent: '1. How the text is split\n2. How the chunk size is measured',
metadata: { loc: [Object] }
},
Document {
pageContent: '## Get started with text splitters\n' +
'\n' +
'import GetStarted from "@snippets/modules/data_connection/document_transformers/get_started.mdx"\n' +
'\n' +
'<GetStarted/>',
metadata: { loc: [Object] }
}
]
*/

API Reference:

Python

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const pythonCode = `def hello_world():
print("Hello, World!")
# Call the function
hello_world()`;

const splitter = RecursiveCharacterTextSplitter.fromLanguage("python", {
chunkSize: 32,
chunkOverlap: 0,
});

const pythonOutput = await splitter.createDocuments([pythonCode]);

console.log(pythonOutput);

/*
[
Document {
pageContent: 'def hello_world():',
metadata: { loc: [Object] }
},
Document {
pageContent: 'print("Hello, World!")',
metadata: { loc: [Object] }
},
Document {
pageContent: '# Call the function',
metadata: { loc: [Object] }
},
Document {
pageContent: 'hello_world()',
metadata: { loc: [Object] }
}
]
*/

API Reference:

HTML

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const text = `<!DOCTYPE html>
<html>
<head>
<title>🦜️🔗 LangChain</title>
<style>
body {
font-family: Arial, sans-serif;
}
h1 {
color: darkblue;
}
</style>
</head>
<body>
<div>
<h1>🦜️🔗 LangChain</h1>
<p>⚡ Building applications with LLMs through composability ⚡</p>
</div>
<div>
As an open source project in a rapidly developing field, we are extremely open to contributions.
</div>
</body>
</html>`;

const splitter = RecursiveCharacterTextSplitter.fromLanguage("html", {
chunkSize: 175,
chunkOverlap: 20,
});
const output = await splitter.createDocuments([text]);

console.log(output);

/*
[
Document {
pageContent: '<!DOCTYPE html>\n<html>',
metadata: { loc: [Object] }
},
Document {
pageContent: '<head>\n <title>🦜️🔗 LangChain</title>',
metadata: { loc: [Object] }
},
Document {
pageContent: '<style>\n' +
' body {\n' +
' font-family: Arial, sans-serif;\n' +
' }\n' +
' h1 {\n' +
' color: darkblue;\n' +
' }\n' +
' </style>\n' +
' </head>',
metadata: { loc: [Object] }
},
Document {
pageContent: '<body>\n' +
' <div>\n' +
' <h1>🦜️🔗 LangChain</h1>\n' +
' <p>⚡ Building applications with LLMs through composability ⚡</p>\n' +
' </div>',
metadata: { loc: [Object] }
},
Document {
pageContent: '<div>\n' +
' As an open source project in a rapidly developing field, we are extremely open to contributions.\n' +
' </div>\n' +
' </body>\n' +
'</html>',
metadata: { loc: [Object] }
}
]
*/

API Reference:

Latex

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const text = `\\begin{document}
\\title{🦜️🔗 LangChain}
⚡ Building applications with LLMs through composability ⚡

\\section{Quick Install}

\\begin{verbatim}
Hopefully this code block isn't split
yarn add langchain
\\end{verbatim}

As an open source project in a rapidly developing field, we are extremely open to contributions.

\\end{document}`;

const splitter = RecursiveCharacterTextSplitter.fromLanguage("latex", {
chunkSize: 100,
chunkOverlap: 0,
});
const output = await splitter.createDocuments([text]);

console.log(output);

/*
[
Document {
pageContent: '\\begin{document}\n' +
'\\title{🦜️🔗 LangChain}\n' +
'⚡ Building applications with LLMs through composability ⚡',
metadata: { loc: [Object] }
},
Document {
pageContent: '\\section{Quick Install}',
metadata: { loc: [Object] }
},
Document {
pageContent: '\\begin{verbatim}\n' +
"Hopefully this code block isn't split\n" +
'yarn add langchain\n' +
'\\end{verbatim}',
metadata: { loc: [Object] }
},
Document {
pageContent: 'As an open source project in a rapidly developing field, we are extremely open to contributions.',
metadata: { loc: [Object] }
},
Document {
pageContent: '\\end{document}',
metadata: { loc: [Object] }
}
]
*/

API Reference:


Help us out by providing feedback on this documentation page: