FireCrawlLoader
This notebook provides a quick overview for getting started with FireCrawlLoader. For detailed documentation of all FireCrawlLoader features and configurations head to the API reference.
Overviewβ
Integration detailsβ
Class | Package | Local | Serializable | PY support |
---|---|---|---|---|
FireCrawlLoader | @langchain/community | π (see details below) | beta | β |
Loader featuresβ
Source | Web Loader | Node Envs Only |
---|---|---|
FireCrawlLoader | β | β |
FireCrawl crawls and convert any website into LLM-ready data. It crawls all accessible sub-pages and give you clean markdown and metadata for each. No sitemap required.
FireCrawl handles complex tasks such as reverse proxies, caching, rate limits, and content blocked by JavaScript. Built by the mendable.ai team.
This guide shows how to scrap and crawl entire websites and load them
using the FireCrawlLoader
in LangChain.
Setupβ
To access FireCrawlLoader
document loader youβll need to install the
@langchain/community
integration, and the
@mendable/firecrawl-js@0.0.36
package. Then create a
FireCrawl account and get an API key.
Credentialsβ
Sign up and get your free FireCrawl API key to start. FireCrawl offers 300 free credits to get you started, and itβs open-source in case you want to self-host.
Once youβve done this set the FIRECRAWL_API_KEY
environment variable:
export FIRECRAWL_API_KEY="your-api-key"
If you want to get automated tracing of your model calls you can also set your LangSmith API key by uncommenting below:
# export LANGSMITH_TRACING="true"
# export LANGSMITH_API_KEY="your-api-key"
Installationβ
The LangChain FireCrawlLoader integration lives in the
@langchain/community
package:
- npm
- yarn
- pnpm
npm i @langchain/community @langchain/core @mendable/firecrawl-js@0.0.36
yarn add @langchain/community @langchain/core @mendable/firecrawl-js@0.0.36
pnpm add @langchain/community @langchain/core @mendable/firecrawl-js@0.0.36
Instantiationβ
Hereβs an example of how to use the FireCrawlLoader
to load web search
results:
Firecrawl offers 3 modes: scrape
, crawl
, and map
. In scrape
mode, Firecrawl will only scrape the page you provide. In crawl
mode,
Firecrawl will crawl the entire website. In map
mode, Firecrawl will
return semantic links related to the website.
The formats
(scrapeOptions.formats
for crawl
mode) parameter
allows selection from "markdown"
, "html"
, or "rawHtml"
. However,
the Loaded Document will return content in only one format, prioritizing
as follows: markdown
, then html
, and finally rawHtml
.
Now we can instantiate our model object and load documents:
import "@mendable/firecrawl-js";
import { FireCrawlLoader } from "@langchain/community/document_loaders/web/firecrawl";
const loader = new FireCrawlLoader({
url: "https://firecrawl.dev", // The URL to scrape
apiKey: "...", // Optional, defaults to `FIRECRAWL_API_KEY` in your env.
mode: "scrape", // The mode to run the crawler in. Can be "scrape" for single urls or "crawl" for all accessible subpages
params: {
// optional parameters based on Firecrawl API docs
// For API documentation, visit https://docs.firecrawl.dev
},
});
Loadβ
const docs = await loader.load();
docs[0];
Document {
pageContent: "Introducing [Smart Crawl!](https://www.firecrawl.dev/smart-crawl)\n" +
" Join the waitlist to turn any web"... 18721 more characters,
metadata: {
title: "Home - Firecrawl",
description: "Firecrawl crawls and converts any website into clean markdown.",
keywords: "Firecrawl,Markdown,Data,Mendable,Langchain",
robots: "follow, index",
ogTitle: "Firecrawl",
ogDescription: "Turn any website into LLM-ready data.",
ogUrl: "https://www.firecrawl.dev/",
ogImage: "https://www.firecrawl.dev/og.png?123",
ogLocaleAlternate: [],
ogSiteName: "Firecrawl",
sourceURL: "https://firecrawl.dev",
pageStatusCode: 500
},
id: undefined
}
console.log(docs[0].metadata);
{
title: "Home - Firecrawl",
description: "Firecrawl crawls and converts any website into clean markdown.",
keywords: "Firecrawl,Markdown,Data,Mendable,Langchain",
robots: "follow, index",
ogTitle: "Firecrawl",
ogDescription: "Turn any website into LLM-ready data.",
ogUrl: "https://www.firecrawl.dev/",
ogImage: "https://www.firecrawl.dev/og.png?123",
ogLocaleAlternate: [],
ogSiteName: "Firecrawl",
sourceURL: "https://firecrawl.dev",
pageStatusCode: 500
}
Additional Parametersβ
For params
you can pass any of the params according to the Firecrawl
documentation.
API referenceβ
For detailed documentation of all FireCrawlLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_firecrawl.FireCrawlLoader.html