GitHub

This example goes over how to load data from a GitHub repository. You can set the GITHUB_ACCESS_TOKEN environment variable to a GitHub access token to increase the rate limit and access private repositories.

Setup

The GitHub loader requires the ignore npm package as a peer dependency. Install it like this:

npm
Yarn
pnpm

npm install ignore

yarn add ignore

pnpm add ignore

Usage

import { GithubRepoLoader } from "langchain/document_loaders/web/github";

export const run = async () => {
  const loader = new GithubRepoLoader(
    "https://github.com/langchain-ai/langchainjs",
    {
      branch: "main",
      recursive: false,
      unknown: "warn",
      maxConcurrency: 5, // Defaults to 2
    }
  );
  const docs = await loader.load();
  console.log({ docs });
};

API Reference:

GithubRepoLoader from langchain/document_loaders/web/github

The loader will ignore binary files like images.

Using .gitignore Syntax

To ignore specific files, you can pass in an ignorePaths array into the constructor:

import { GithubRepoLoader } from "langchain/document_loaders/web/github";

export const run = async () => {
  const loader = new GithubRepoLoader(
    "https://github.com/langchain-ai/langchainjs",
    { branch: "main", recursive: false, unknown: "warn", ignorePaths: ["*.md"] }
  );
  const docs = await loader.load();
  console.log({ docs });
  // Will not include any .md files
};

API Reference:

GithubRepoLoader from langchain/document_loaders/web/github

Using a Different GitHub Instance

You may want to target a different GitHub instance than github.com, e.g. if you have a GitHub Enterprise instance for your company. For this you need two additional parameters:

baseUrl - the base URL of your GitHub instance, so the githubUrl matches <baseUrl>/<owner>/<repo>/...
apiUrl - the URL of the API endpoint of your GitHub instance

import { GithubRepoLoader } from "langchain/document_loaders/web/github";

export const run = async () => {
  const loader = new GithubRepoLoader(
    "https://github.your.company/org/repo-name",
    {
      baseUrl: "https://github.your.company",
      apiUrl: "https://github.your.company/api/v3",
      accessToken: "ghp_A1B2C3D4E5F6a7b8c9d0",
      branch: "main",
      recursive: true,
      unknown: "warn",
    }
  );
  const docs = await loader.load();
  console.log({ docs });
};

API Reference:

GithubRepoLoader from langchain/document_loaders/web/github

Dealing with Submodules

In case your repository has submodules, you have to decide if the loader should follow them or not. You can control this with the boolean processSubmodules parameter. By default, submodules are not processed. Note that processing submodules works only in conjunction with setting the recursive parameter to true.

import { GithubRepoLoader } from "langchain/document_loaders/web/github";

export const run = async () => {
  const loader = new GithubRepoLoader(
    "https://github.com/langchain-ai/langchainjs",
    {
      branch: "main",
      recursive: true,
      processSubmodules: true,
      unknown: "warn",
    }
  );
  const docs = await loader.load();
  console.log({ docs });
};

API Reference:

GithubRepoLoader from langchain/document_loaders/web/github

Note, that the loader will not follow submodules which are located on another GitHub instance than the one of the current repository.

Stream large repository

For situations where processing large repositories in a memory-efficient manner is required. You can use the loadAsStream method to asynchronously streams documents from the entire GitHub repository.

import { GithubRepoLoader } from "langchain/document_loaders/web/github";

export const run = async () => {
  const loader = new GithubRepoLoader(
    "https://github.com/langchain-ai/langchainjs",
    {
      branch: "main",
      recursive: false,
      unknown: "warn",
      maxConcurrency: 3, // Defaults to 2
    }
  );

  const docs = [];
  for await (const doc of loader.loadAsStream()) {
    docs.push(doc);
  }

  console.log({ docs });
};

API Reference:

GithubRepoLoader from langchain/document_loaders/web/github

GitHub

Setup​

Usage​

API Reference:

Using .gitignore Syntax​

API Reference:

Using a Different GitHub Instance​

API Reference:

Dealing with Submodules​

API Reference:

Stream large repository​

API Reference:

Help us out by providing feedback on this documentation page:

Setup

Usage

Using .gitignore Syntax

Using a Different GitHub Instance

Dealing with Submodules

Stream large repository