Back to all AI tutorials
Aug 26, 2025 - 20 min read
Extract audio from video and generate transcripts with timestamps and speaker labels

Extract audio from video and generate transcripts with timestamps and speaker labels

A Node.js script that uses Google's Gemini 2.5 Flash to transcribe audio from video files with timestamps and speaker identification, with full TypeScript type safety using Zod schemas.

Patrick the AI Engineer

Patrick the AI Engineer

Introduction

You've got a two-hour meeting recording and need to know who said what and when. Gemini 2.5 Flash handles audio transcription, speaker identification, and timestamps in one API call.

We're going to build a Node.js script that takes a video file, extracts the audio with FFmpeg, and sends it to Gemini for transcription. The model figures out speakers without any training data or voice samples.

First, install the dependencies:

npm init -y
npm install ai @ai-sdk/google dotenv zod
npm install --save-dev @types/node typescript tsx

You'll need FFmpeg installed too (brew install ffmpeg on macOS). Create a .env file with your Google AI API key from Google AI Studio:

GOOGLE_GENERATIVE_AI_API_KEY=your_api_key_here

Let's start with the audio extraction. We're using FFmpeg to strip audio from video:

import { execSync } from "child_process";

async function extractAudio(
  inputFile: string,
  outputFile: string
) {
  execSync(
    `ffmpeg -i "${inputFile}" -vn -acodec mp3 ` +
    `-ab 192k -ar 44100 -y "${outputFile}"`
  );
}

The -vn flag strips video completely. We're encoding to MP3 at 192kbps with a 44.1kHz sample rate. This gives us clean audio that Gemini handles well.

Now we can send the audio to Gemini. Start with the imports:

import fs from "fs";
import { google } from "@ai-sdk/google";
import { generateText } from "ai";

The transcription function uses generateText with a multimodal message. We're mixing text instructions with audio data:

async function transcribeAudio() {
  const result = await generateText({
    model: google("gemini-2.5-flash"),
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: "Generate a transcript...",
          },
          {
            type: "file",
            data: fs.readFileSync("./meeting.mp3"),
            mediaType: "audio/mpeg",
          },
        ],
      },
    ],
  });
}

We're sending two pieces of content: the instruction and the audio file. The mediaType tells Gemini this is MP3 audio.

The prompt matters more than you'd think. Here's what works:

const prompt = 
  "Generate a transcript of the audio given to you. " +
  "Return timestamps [hh:mm:ss], speaker labels " +
  "and the transcripts wherever possible";

Being specific about the timestamp format ([hh:mm:ss]) guides the output structure. The phrase "wherever possible" gives the model flexibility when timestamps aren't clear or there's only one speaker.

Let's add the prompt to our function and return the result:

async function transcribeAudio() {
  const result = await generateText({
    model: google("gemini-2.5-flash"),
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: "Generate a transcript of the audio given to you. " +
                  "Return timestamps [hh:mm:ss], speaker labels " +
                  "and the transcripts wherever possible",
          },
          {
            type: "file",
            data: fs.readFileSync("./meeting.mp3"),
            mediaType: "audio/mpeg",
          },
        ],
      },
    ],
  });

  return result.text;
}

The transcript comes back as formatted text in result.text. Gemini structures it according to your prompt instructions.

The String Formatting Problem

Here's the catch: the transcript returns as an unstructured string. Even with specific prompt instructions, the model chooses different formats across runs. You can't reliably parse the output.

Run the same audio file three times and you might get three completely different formats. Here's what I've seen from identical prompts:

Format 1: Bracketed timestamps

[00:00:26] Speaker 1: As a result of Lieutenant Frank Drebin's heroic work on New Year's Eve, I am happy to announce that police squad is back with a renewed commitment to accountability and justice.
[00:13:50] Speaker 1: And in that spirit, we will not ignore Lieutenant Drebin's questionable actions on the days leading up to the event.

Format 2: JSON structure

[
  {
    "timestamp": "00:00:23",
    "speaker": "Speaker 1",
    "text": "As a result of Lieutenant Frank Dbin's heroic work on New Year's Eve, I am happy to announce that police squad is back with a renewed commitment to accountability and justice."
  },
  {
    "timestamp": "00:13:30",
    "speaker": "Speaker 1",
    "text": "And in that spirit, we will not ignore Lieutenant Dbin's questionable actions on the days leading up to the event."
  }
]

Format 3: Millisecond ranges

[ 0m0s117ms - 0m4s777ms ] As a result of Lieutenant Frank Drebin's heroic work on New Year's Eve,
[ 0m5s67ms - 0m12s307ms ] I am happy to announce that police squad is back with a renewed commitment to accountability and justice.
[ 0m13s607ms - 0m21s307ms ] And in that spirit, we will not ignore Lieutenant Drebin's questionable actions on the days leading up to the event.

Same audio, same prompt, different formats. Writing a parser that handles all three (and future variations) becomes impossible. You'd need regex patterns for every format the model might invent.

The model's temperature setting makes this worse. Higher temperatures increase creativity and randomness, meaning more format variations. Lower temperatures (closer to 0) produce more deterministic outputs but don't eliminate format inconsistency entirely. Even at temperature 0, the model might still switch between formats based on subtle differences in the audio or its internal state.

If you need structured data, you have two options: use the AI SDK's structured output mode with Zod schemas to force consistent formatting, or keep it simple and work with the raw string. For basic transcription tasks where humans read the output, the string works fine.

Getting Type-Safe Structured Output

The solution to inconsistent formatting is structured output with Zod. Instead of hoping the model formats text correctly, you define a schema that enforces the structure. The AI SDK validates the output against your schema and gives you full TypeScript type safety.

First, install Zod and switch from generateText to generateObject:

npm install zod

Define a schema that describes your transcript structure:

import { generateObject } from "ai";
import { z } from "zod";

const schema = z.array(
  z.object({
    text: z.string().describe("The transcript of the audio"),
    timestamp: z.object({
      start: z.string().describe("The start time of the timestamp as [hh:mm:ss]"),
      end: z.string().describe("The end time of the timestamp as [hh:mm:ss]"),
    }).describe("The timestamp of the transcript"),
    speaker: z.string().describe("The speaker(s) of the transcript"),
  })
);

type TranscriptSchema = z.infer<typeof schema>;

The schema defines an array of transcript segments. Each segment has text, timestamps with start/end times, and a speaker label. The .describe() calls guide the model on what to generate for each field.

Now update the transcription function to use generateObject:

async function transcribeAudio(audioFile: string) {
  const result = await generateObject({
    model: google("gemini-2.5-flash"),
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: "Generate a transcript of the audio given to you. Return timestamps [hh:mm:ss], speaker labels and the transcripts wherever possible",
          },
          {
            type: "file",
            data: fs.readFileSync(audioFile),
            mediaType: "audio/mpeg",
          },
        ],
      },
    ],
    schema: schema,
  });

  console.log(result.object as TranscriptSchema);
}

The only changes are swapping generateText for generateObject and adding the schema parameter. The result now comes back as result.object instead of result.text.

Here's what the output looks like:

[
  {
    text: "As a result of Lieutenant Frank Drebin's heroic work on New Year's Eve, I am happy to announce that police squad is back with a renewed commitment to accountability and justice.",
    timestamp: { start: '00:00:27', end: '00:12:47' },
    speaker: 'Speaker 1'
  },
  {
    text: "And in that spirit, we will not ignore Lieutenant Drebin's questionable actions on the days leading up to the event. And right now, Frank Drebin is being subject to a rigorous and thorough internal investigation.",
    timestamp: { start: '00:13:30', end: '00:29:13' },
    speaker: 'Speaker 1'
  },
  {
    text: 'Thank you.',
    timestamp: { start: '00:30:3', end: '00:30:843' },
    speaker: 'Speaker 1'
  }
]

Same audio, same schema, consistent structure every time. TypeScript knows the exact shape of this data. You get autocomplete on result.object[0].text, result.object[0].timestamp.start, and result.object[0].speaker. No parsing, no regex, no string manipulation.

The structured output approach costs slightly more in output tokens because the model needs to generate valid JSON. But the trade-off is worth it if you're building anything that processes transcripts programmatically.

Understanding Token Usage and Costs

Audio files consume massive amounts of tokens. A 30-second video used 4,451 input tokens and 1,514 output tokens in our test. Scale that to a 2-hour meeting and you're looking at hundreds of thousands of tokens.

Gemini 2.5 Flash pricing (as of 2025):

  • Input: $0.30 per 1M tokens
  • Output: $2.50 per 1M tokens

For our 30-second test:

  • Input cost: 4,451 tokens × $0.30 / 1,000,000 = $0.0013
  • Output cost: 1,514 tokens × $2.50 / 1,000,000 = $0.0038
  • Total: ~$0.005 per transcription

A 10-minute recording might use 50,000-100,000 input tokens ($0.015-$0.030) plus output tokens. A 1-hour recording could cost $0.20-$0.50. These are ballpark estimates—actual token usage varies based on audio complexity, speech density, and background noise. pnpm run dev The token usage object shows helpful debugging info:

{
  inputTokens: 4451,
  outputTokens: 1514,
  totalTokens: 6029,
  reasoningTokens: 64,
  cachedInputTokens: 3779
}

The cachedInputTokens field is particularly useful. If you're transcribing multiple files in quick succession, Gemini caches parts of the model context and reuses them. Cached tokens are billed at a much lower rate than fresh input tokens, which can significantly reduce costs for batch processing.

Now we can wire everything together. The main function orchestrates the pipeline:

async function main() {
  await extractAudio("./meeting.mp4", "./meeting.mp3");
  await transcribeAudio();
}

main().catch(console.error);

We extract audio first, then transcribe it. The await ensures each step completes before the next starts. If you tried to transcribe before extraction finished, you'd get a file-not-found error.

Let's add error handling to the extraction function:

async function extractAudio(inputFile: string, outputFile: string) {
  try {
    console.log(`Extracting audio from ${inputFile}...`);
    execSync(
      `ffmpeg -i "${inputFile}" -vn -acodec mp3 -ab 192k -ar 44100 -y "${outputFile}"`,
      { stdio: "inherit" }
    );
    console.log(`Audio extraction completed: ${outputFile}`);
  } catch (error) {
    console.error("Error extracting audio:", error);
    throw error;
  }
}

The console logs show progress. Setting stdio: "inherit" pipes FFmpeg's output to your terminal so you can see what's happening. Without it, the command runs silently.

That's the complete pipeline. FFmpeg strips audio from video, Gemini transcribes it with timestamps and speaker labels. The entire script is under 50 lines.

A few things to keep in mind: audio transcription isn't free. A 5-minute recording costs a few cents, but a 2-hour meeting could cost a dollar or more. Test with short clips first.

The accuracy depends on audio quality. Background noise, overlapping speech, and heavy accents make the job harder. If you're building a production system, consider preprocessing audio with noise reduction before sending it to Gemini.

Speaker identification works best with distinct voices. Two people with similar vocal characteristics might get labeled inconsistently. The model uses voice patterns and speaking styles to figure out who's talking, not voice prints.

Complete Implementation

Here's the full code for the audio transcription pipeline. Choose the plain text version for simple use cases, or the structured version when you need type-safe, parseable output:

import dotenv from "dotenv";
dotenv.config();

import fs from "fs";
import { execSync } from "child_process";
import { google } from "@ai-sdk/google";
import { generateObject } from "ai";
import { z } from "zod";

async function extractAudio(inputFile: string, outputFile: string) {
  try {
    console.log(`Extracting audio from ${inputFile} to ${outputFile}...`);
    execSync(
      `ffmpeg -i "${inputFile}" -vn -acodec mp3 -ab 192k -ar 44100 -y "${outputFile}"`,
      {
        stdio: "inherit",
      },
    );
    console.log(`Audio extraction completed: ${outputFile}`);
  } catch (error) {
    console.error("Error extracting audio:", error);
    throw error;
  }
}

const schema = z.array(
  z.object({
    text: z.string().describe("The transcript of the audio"),
    timestamp: z.object({
      start: z.string().describe("The start time of the timestamp as [hh:mm:ss]"),
      end: z.string().describe("The end time of the timestamp as [hh:mm:ss]"),
    }).describe("The timestamp of the transcript"),
    speaker: z.string().describe("The speaker(s) of the transcript"),
  })
);

type TranscriptSchema = z.infer<typeof schema>;

async function transcribeAudio(audioFile: string) {
  console.log(`Transcribing ${audioFile}...`);
  const result = await generateObject({
    model: google("gemini-2.5-flash"),
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: "Generate a transcript of the audio given to you. Return timestamps [hh:mm:ss], speaker labels and the transcripts wherever possible",
          },
          {
            type: "file",
            data: fs.readFileSync(audioFile),
            mediaType: "audio/mpeg",
          },
        ],
      },
    ],
    schema: schema,
  });

  console.log("\n=== Transcript ===\n");
  console.log(result.object as TranscriptSchema);
  console.log("\n=== Token Usage ===");
  console.dir(result.usage, { depth: null });
}

async function main() {
  const videoFile = process.argv[2] || "./video.mp4";
  const audioFile = videoFile.replace(/\.(mp4|mov|avi|mkv)$/i, ".mp3");

  if (!fs.existsSync(videoFile)) {
    console.error(`Error: Video file "${videoFile}" not found.`);
    console.log("Usage: tsx transcribe.ts [path-to-video.mp4]");
    process.exit(1);
  }

  await extractAudio(videoFile, audioFile);
  await transcribeAudio(audioFile);
}

main().catch(console.error);
Copyright © 2025