Similarity Score Threshold

A problem some people may face is that when doing a similarity search, you have to supply a k value. This value is responsible for bringing N similar results back to you. But what if you don't know the k value? What if you want the system to return all the possible results?

In a real-world scenario, let's imagine a super long document created by a product manager which describes a product. In this document, we could have 10, 15, 20, 100 or more features described. How to know the correct k value so the system returns all the possible results to the question "What are all the features that product X has?".

To solve this problem, LangChain offers a feature called Recursive Similarity Search. With it, you can do a similarity search without having to rely solely on the k value. The system will return all the possible results to your question, based on the minimum similarity percentage you want.

It is possible to use the Recursive Similarity Search by using a vector store as retriever.

Usage

tip

See this section for general instructions on installing integration packages.

npm
Yarn
pnpm

npm install @langchain/openai

yarn add @langchain/openai

pnpm add @langchain/openai

import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
import { ScoreThresholdRetriever } from "langchain/retrievers/score_threshold";

const vectorStore = await MemoryVectorStore.fromTexts(
  [
    "Buildings are made out of brick",
    "Buildings are made out of wood",
    "Buildings are made out of stone",
    "Buildings are made out of atoms",
    "Buildings are made out of building materials",
    "Cars are made out of metal",
    "Cars are made out of plastic",
  ],
  [{ id: 1 }, { id: 2 }, { id: 3 }, { id: 4 }, { id: 5 }],
  new OpenAIEmbeddings()
);

const retriever = ScoreThresholdRetriever.fromVectorStore(vectorStore, {
  minSimilarityScore: 0.9, // Finds results with at least this similarity score
  maxK: 100, // The maximum K value to use. Use it based to your chunk size to make sure you don't run out of tokens
  kIncrement: 2, // How much to increase K by each time. It'll fetch N results, then N + kIncrement, then N + kIncrement * 2, etc.
});

const result = await retriever.invoke("What are buildings made out of?");

console.log(result);

/*
  [
    Document {
      pageContent: 'Buildings are made out of building materials',
      metadata: { id: 5 }
    },
    Document {
      pageContent: 'Buildings are made out of wood',
      metadata: { id: 2 }
    },
    Document {
      pageContent: 'Buildings are made out of brick',
      metadata: { id: 1 }
    },
    Document {
      pageContent: 'Buildings are made out of stone',
      metadata: { id: 3 }
    },
    Document {
      pageContent: 'Buildings are made out of atoms',
      metadata: { id: 4 }
    }
  ]
*/

API Reference:

MemoryVectorStore from langchain/vectorstores/memory
OpenAIEmbeddings from @langchain/openai
ScoreThresholdRetriever from langchain/retrievers/score_threshold

Similarity Score Threshold

Usage​

API Reference:

Help us out by providing feedback on this documentation page:

Usage