Skip to main content

Audio/Video Structured Extraction

Google's Gemini API offers support for audio and video input, along with function calling. Together, we can pair these API features to extract structured data given audio or video input.

In the following examples, we'll demonstrate how to read and send MP3 and MP4 files to the Gemini API, and receive structured output as a response.

Setup

These examples use the Gemini API, so you'll need a Google VertexAI credentials file (or stringified credentials file if using a web environment):

GOOGLE_APPLICATION_CREDENTIALS="credentials.json"

Next, install the @langchain/google-vertexai and @langchain/community packages:

npm install @langchain/google-vertexai @langchain/core

Video

This example uses a LangChain YouTube video on datasets and testing in LangSmith sped up to 1.5x speed. It's then converted to base64, and sent to Gemini with a prompt asking for structured output of tasks I can do to improve my knowledge of datasets and testing in LangSmith.

We create a new tool for this using Zod, and pass it to the model via the withStructuredOutput method.

import {
ChatPromptTemplate,
MessagesPlaceholder,
} from "@langchain/core/prompts";
import { ChatVertexAI } from "@langchain/google-vertexai";
import { HumanMessage } from "@langchain/core/messages";
import fs from "fs";
import { z } from "zod";

function fileToBase64(filePath: string): string {
return fs.readFileSync(filePath, "base64");
}

const lanceLsEvalsVideo = "lance_ls_eval_video.mp4";
const lanceInBase64 = fileToBase64(lanceLsEvalsVideo);

const tool = z.object({
tasks: z.array(z.string()).describe("A list of tasks."),
});

const model = new ChatVertexAI({
model: "gemini-1.5-pro-preview-0409",
temperature: 0,
}).withStructuredOutput(tool, {
name: "tasks_list_tool",
});

const prompt = ChatPromptTemplate.fromMessages([
new MessagesPlaceholder("video"),
]);

const chain = prompt.pipe(model);
const response = await chain.invoke({
video: new HumanMessage({
content: [
{
type: "media",
mimeType: "video/mp4",
data: lanceInBase64,
},
{
type: "text",
text: `The following video is an overview of how to build datasets in LangSmith.
Given the following video, come up with three tasks I should do to further improve my knowledge around using datasets in LangSmith.
Only reference features that were outlined or described in the video.

Rules:
Use the "tasks_list_tool" to return a list of tasks.
Your tasks should be tailored for an engineer who is looking to improve their knowledge around using datasets and evaluations, specifically with LangSmith.`,
},
],
}),
});

console.log("response", response);
/*
response {
tasks: [
'Explore the LangSmith SDK documentation for in-depth understanding of dataset creation, manipulation, and versioning functionalities.',
'Experiment with different dataset types like Key-Value, Chat, and LLM to understand their structures and use cases.',
'Try uploading a CSV file containing question-answer pairs to LangSmith and create a new dataset from it.'
]
}
*/

API Reference:

Audio

The next example loads an audio (MP3) file containing Mozart's Requiem in D Minor and prompts Gemini to return a single array of strings, with each string being an instrument from the song.

Here, we'll also use the withStructuredOutput method to get structured output from the model.

import {
ChatPromptTemplate,
MessagesPlaceholder,
} from "@langchain/core/prompts";
import { ChatVertexAI } from "@langchain/google-vertexai";
import { HumanMessage } from "@langchain/core/messages";
import fs from "fs";
import { z } from "zod";

function fileToBase64(filePath: string): string {
return fs.readFileSync(filePath, "base64");
}

const mozartMp3File = "Mozart_Requiem_D_minor.mp3";
const mozartInBase64 = fileToBase64(mozartMp3File);

const tool = z.object({
instruments: z
.array(z.string())
.describe("A list of instruments found in the audio."),
});

const model = new ChatVertexAI({
model: "gemini-1.5-pro-preview-0409",
temperature: 0,
}).withStructuredOutput(tool, {
name: "instruments_list_tool",
});

const prompt = ChatPromptTemplate.fromMessages([
new MessagesPlaceholder("audio"),
]);

const chain = prompt.pipe(model);
const response = await chain.invoke({
audio: new HumanMessage({
content: [
{
type: "media",
mimeType: "audio/mp3",
data: mozartInBase64,
},

{
type: "text",
text: `The following audio is a song by Mozart. Respond with a list of instruments you hear in the song.

Rules:
Use the "instruments_list_tool" to return a list of tasks.`,
},
],
}),
});

console.log("response", response);
/*
response {
instruments: [
'violin', 'viola',
'cello', 'double bass',
'flute', 'oboe',
'clarinet', 'bassoon',
'horn', 'trumpet',
'timpani'
]
}
*/

API Reference:

From a quick Google search, we see the song was composed using the following instruments:

The Requiem is scored for 2 basset horns in F, 2 bassoons, 2 trumpets in D, 3 trombones (alto, tenor, and bass),
timpani (2 drums), violins, viola, and basso continuo (cello, double bass, and organ).

Gemini did pretty well here! For music not being its primary focus, it was able to identify a few of the instruments used in the song, and didn't hallucinate any!


Help us out by providing feedback on this documentation page: