Skip to main content

Docx files

The DocxLoader allows you to extract text data from Microsoft Word documents. It supports both the modern .docx format and the legacy .doc format. Depending on the file type, additional dependencies are required.


Setup

To use DocxLoader, you'll need the @langchain/community integration along with either mammoth or word-extractor package:

  • mammoth: For processing .docx files.
  • word-extractor: For handling .doc files.

Installation

For .docx Files

npm install @langchain/community @langchain/core mammoth

For .doc Files

npm install @langchain/community @langchain/core word-extractor

Usage

Loading .docx Files

For .docx files, there is no need to explicitly specify any parameters when initializing the loader:

import { DocxLoader } from "@langchain/community/document_loaders/fs/docx";

const loader = new DocxLoader(
"src/document_loaders/tests/example_data/attention.docx"
);

const docs = await loader.load();

Loading .doc Files

For .doc files, you must explicitly specify the type as doc when initializing the loader:

import { DocxLoader } from "@langchain/community/document_loaders/fs/docx";

const loader = new DocxLoader(
"src/document_loaders/tests/example_data/attention.doc",
{
type: "doc",
}
);

const docs = await loader.load();

Was this page helpful?


You can also leave detailed feedback on GitHub.