How to load HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.
This covers how to load HTML
documents into a LangChain
Document
objects that we can use downstream.
Parsing HTML files often requires specialized tools. Here we demonstrate parsing via Unstructured. Head over to the integrations page to find integrations with additional services, such as FireCrawl.
Prerequisites
This guide assumes familiarity with the following concepts:
Installationβ
- npm
- yarn
- pnpm
npm i @langchain/community @langchain/core
yarn add @langchain/community @langchain/core
pnpm add @langchain/community @langchain/core
Setupβ
Although Unstructured has an open source offering, youβre still required to provide an API key to access the service. To get everything up and running, follow these two steps:
- Download & start the Docker container:
docker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
- Get a free API key & API URL here, and set it in your environment (as per the Unstructured website, it may take up to an hour to allocate your API key & URL.):
export UNSTRUCTURED_API_KEY="..."
# Replace with your `Full URL` from the email
export UNSTRUCTURED_API_URL="https://<ORG_NAME>-<SECRET>.api.unstructuredapp.io/general/v0/general"
Loading HTML with Unstructuredβ
import { UnstructuredLoader } from "@langchain/community/document_loaders/fs/unstructured";
const filePath =
"../../../../libs/langchain-community/src/tools/fixtures/wordoftheday.html";
const loader = new UnstructuredLoader(filePath, {
apiKey: process.env.UNSTRUCTURED_API_KEY,
apiUrl: process.env.UNSTRUCTURED_API_URL,
});
const data = await loader.load();
console.log(data.slice(0, 5));
[
Document {
pageContent: 'Word of the Day',
metadata: {
category_depth: 0,
languages: [Array],
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'Title'
}
},
Document {
pageContent: ': April 10, 2023',
metadata: {
emphasized_text_contents: [Array],
emphasized_text_tags: [Array],
languages: [Array],
parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'UncategorizedText'
}
},
Document {
pageContent: 'foible',
metadata: {
category_depth: 1,
languages: [Array],
parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'Title'
}
},
Document {
pageContent: 'play',
metadata: {
category_depth: 0,
link_texts: [Array],
link_urls: [Array],
link_start_indexes: [Array],
languages: [Array],
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'Title'
}
},
Document {
pageContent: 'noun',
metadata: {
category_depth: 0,
emphasized_text_contents: [Array],
emphasized_text_tags: [Array],
languages: [Array],
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'Title'
}
}
]