Docx files
The DocxLoader
allows you to extract text data from Microsoft Word documents. It supports both the modern .docx
format and the legacy .doc
format. Depending on the file type, additional dependencies are required.
Setup
To use DocxLoader
, you'll need the @langchain/community
integration along with either mammoth
or word-extractor
package:
mammoth
: For processing.docx
files.word-extractor
: For handling.doc
files.
Installation
For .docx
Files
- npm
- Yarn
- pnpm
npm install @langchain/community @langchain/core mammoth
yarn add @langchain/community @langchain/core mammoth
pnpm add @langchain/community @langchain/core mammoth
For .doc
Files
- npm
- Yarn
- pnpm
npm install @langchain/community @langchain/core word-extractor
yarn add @langchain/community @langchain/core word-extractor
pnpm add @langchain/community @langchain/core word-extractor
Usage
Loading .docx
Files
For .docx
files, there is no need to explicitly specify any parameters when initializing the loader:
import { DocxLoader } from "@langchain/community/document_loaders/fs/docx";
const loader = new DocxLoader(
"src/document_loaders/tests/example_data/attention.docx"
);
const docs = await loader.load();
Loading .doc
Files
For .doc
files, you must explicitly specify the type
as doc
when initializing the loader:
import { DocxLoader } from "@langchain/community/document_loaders/fs/docx";
const loader = new DocxLoader(
"src/document_loaders/tests/example_data/attention.doc",
{
type: "doc",
}
);
const docs = await loader.load();