Build Karaoke-Style Video Captions in the Browser with Whisper

You know those videos where each word lights up as it's spoken? That karaoke effect makes content more engaging and accessible. I wanted to build this for my own videos, but I didn't want to send audio files to a server every time. So I built a fully client-side solution that runs Whisper in the browser and burns the captions directly into the video file.

This tutorial shows you how to transcribe videos with word-level timing using Whisper (running on WebGPU or WebAssembly), and then export a new video with captions burned in using Mediabunny's Conversion API. Everything runs locally—no backend, no API costs, no data leaving your browser.

Getting Started

We'll need two main packages: @huggingface/transformers for running Whisper, and mediabunny for video processing.

npm install @huggingface/transformers mediabunny

The Whisper model weights download on first run and get cached in IndexedDB. The whisper-base model is around 140MB, so the initial load takes a minute or two.

Loading Whisper with WebGPU

Let's start by loading the speech recognition pipeline. We'll check if WebGPU is available and configure the model accordingly.

import { pipeline } from '@huggingface/transformers'

const ASR_MODEL = 'onnx-community/whisper-base_timestamped'

  const prefersWebGPU = 'gpu' in navigator
const device = prefersWebGPU ? 'webgpu' : 'wasm'

We're checking for navigator.gpu to detect WebGPU support. Chrome and Edge have it, but Safari doesn't yet (as of early 2025).

const options: any = device === 'webgpu'
    ? { device: 'webgpu', dtype: { encoder_model: 'fp32', decoder_model_merged: 'q4' } }
    : { device: 'wasm', dtype: 'q8' }
  
const transcriber = await pipeline('automatic-speech-recognition', ASR_MODEL, options)

The dtype settings control quantization. We use q4 (4-bit quantization) for the decoder on WebGPU and q8 (8-bit) for WASM. Lower precision means smaller memory footprint with minimal accuracy loss.

if (device === 'webgpu') {
  await transcriber(new Float32Array(16_000), { language: 'en' })
}

This warm-up pass with one second of silent audio pre-compiles the WebGPU shaders. Without it, the first real transcription would pause for several seconds while shaders compile. It's one of those things you only learn from experience.

Extracting Audio from Video

Mediabunny handles video demuxing and audio extraction. It takes any video format and outputs the audio resampled to 16kHz, which is what Whisper expects.

import { Input, ALL_FORMATS, BlobSource, Conversion, 
         WavOutputFormat, BufferTarget, Output } from 'mediabunny'

async function extractAudio(file: File): Promise<Float32Array> {
  const input = new Input({ 
    source: new BlobSource(file), 
    formats: ALL_FORMATS 
  })
  const output = new Output({ 
    format: new WavOutputFormat(), 
    target: new BufferTarget() 
  })

We're creating an input from the video file blob and an output that writes a WAV file to an in-memory buffer. The ALL_FORMATS option lets Mediabunny handle any codec the browser supports.

  const conversion = await Conversion.init({ 
    input, 
    output, 
    audio: { sampleRate: 16000 } 
  })
  await conversion.execute()

The conversion runs entirely in the browser using WebCodecs. We're downsampling to 16kHz because that's what Whisper was trained on.

  const wavBuffer: ArrayBuffer = (output.target as any).buffer
  const audioContext = new AudioContext({ sampleRate: 16000 })
  const audioBuffer = await audioContext.decodeAudioData(wavBuffer)
  
  const mono = new Float32Array(audioBuffer.length)
  audioBuffer.copyFromChannel(mono, 0)
  return mono
}

We decode the WAV buffer into a Float32Array using the Web Audio API. Whisper needs mono audio as a raw float array, so we extract just the first channel.

Transcribing in Chunks

You can't just throw a 10-minute audio file at Whisper and expect instant results. We process it in 30-second chunks to keep memory reasonable and show progress.

async function transcribe(transcriber: any, audio: Float32Array) {
  const SAMPLE_RATE = 16000
  const CHUNK_SECONDS = 30
  const SAMPLES_PER_CHUNK = SAMPLE_RATE * CHUNK_SECONDS
  
  const result: any = { text: '', chunks: [] }

We're going to accumulate the full transcript text and all the word-level chunks as we process each 30-second window.

  for (let start = 0; start < audio.length; start += SAMPLES_PER_CHUNK) {
    const end = Math.min(start + SAMPLES_PER_CHUNK, audio.length)
    const chunk = audio.subarray(start, end)
    const offset = start / SAMPLE_RATE

The offset is critical here. Each chunk's timestamps are relative to that chunk, so we need to add the offset to make them relative to the full audio timeline.

    const output: any = await transcriber(chunk, {
      language: 'en',
      return_timestamps: 'word',
      chunk_length_s: CHUNK_SECONDS
    })

The return_timestamps: 'word' option is what gives us per-word timing instead of just sentence-level timestamps. That's the difference between static subtitles and karaoke-style highlighting.

    if (output?.text) {
      result.text += (result.text ? ' ' : '') + output.text
    }
    
    for (const c of output?.chunks || []) {
      const [s, e] = c.timestamp
      result.chunks.push({ 
        text: c.text, 
        timestamp: [s + offset, e + offset] 
      })
    }
  }
  
  return result
}

We're stitching the chunks back together, adjusting each word's timestamp by adding the chunk offset. This gives us a single continuous timeline for the entire video.

Rendering Live Captions

For the live preview, we'll render words as HTML spans and highlight them as the video plays.

const videoEl = document.getElementById('video') as HTMLVideoElement
const wordsEl = document.getElementById('words') as HTMLDivElement

function renderWords(result: any) {
  wordsEl.innerHTML = ''
  
  for (const chunk of result.chunks) {
    const span = document.createElement('span')
    span.textContent = chunk.text
    span.dataset.start = String(chunk.timestamp[0])
    span.dataset.end = String(chunk.timestamp[1])
    wordsEl.appendChild(span)
  }
}

Each word becomes a <span> with its timing data stored in data- attributes. Simple DOM manipulation.

function updateHighlight() {
  const currentTime = videoEl.currentTime
  
  for (const span of wordsEl.children) {
    const start = Number((span as HTMLElement).dataset.start)
    const end = Number((span as HTMLElement).dataset.end)
    const isActive = currentTime >= start && currentTime <= end
    
    (span as HTMLElement).style.background = isActive ? '#3b82f6' : ''
    (span as HTMLElement).style.color = isActive ? '#fff' : ''
  }
}

videoEl.addEventListener('timeupdate', updateHighlight)

On every timeupdate event (fires continuously as the video plays), we check which word's timestamp range contains the current playback time and style it accordingly. It's not the most efficient approach, but for a few hundred words it's fine.

Burning Captions into Video

The real magic is exporting a new video with captions burned in. Mediabunny's new Conversion API makes this surprisingly straightforward.

async function exportVideoWithCaptions(
  videoFile: File, 
  transcript: any
) {
  const input = new Input({ 
    source: new BlobSource(videoFile), 
    formats: ALL_FORMATS 
  })
  const totalDuration = await input.computeDuration()

We create an input from the original video file and get its duration for progress reporting.

  const target = new BufferTarget()
  const output = new Output({ 
    format: new Mp4OutputFormat({ fastStart: 'in-memory' }), 
    target 
  })

We're outputting to an in-memory buffer. The fastStart option puts the MP4 metadata at the beginning of the file, which means it can start playing before the entire file downloads.

  let ctx: CanvasRenderingContext2D | null = null
  
  const conversion = await Conversion.init({
    input,
    output,
    video: {
      process: (sample) => {
        if (!ctx) {
          const canvas = new OffscreenCanvas(
            sample.displayWidth, 
            sample.displayHeight
          )
          ctx = canvas.getContext('2d')!
        }

The video.process callback runs for every frame. We lazily create an OffscreenCanvas sized to match the video dimensions on the first frame.

        ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height)
        sample.draw(ctx, 0, 0)

We draw the original frame to the canvas. The sample.draw() method handles all the complexity of decoding and rotation.

        const timestamp = sample.timestamp + sample.duration / 2
        drawCaptionsOverlay(ctx, ctx.canvas.width, ctx.canvas.height, timestamp)
        
        return ctx.canvas
      }
    }
  })

We call our caption drawing function with the frame's midpoint timestamp, then return the canvas. Mediabunny encodes whatever we return as the output frame. This is where the captions get burned into the video.

  await conversion.execute()
  
  const arrayBuffer: ArrayBuffer = (target as any).buffer
  const blob = new Blob([arrayBuffer], { type: 'video/mp4' })
  return URL.createObjectURL(blob)
}

Once the conversion finishes, we have an MP4 file in memory. We wrap it in a Blob and create an object URL for download or preview.

Drawing Captions on Canvas

The caption overlay logic is where we decide what words to show and how to style them.

function drawCaptionsOverlay(
  ctx: CanvasRenderingContext2D | OffscreenCanvasRenderingContext2D,
  width: number,
  height: number,
  timestamp: number,
  transcript: any
) {
  const sentence = findSentenceAtTimestamp(transcript, timestamp)
  if (!sentence || sentence.words.length === 0) return

We group words into sentences based on punctuation and gaps in speech. If no sentence is active at this timestamp, we don't draw anything.

  const fontSize = Math.max(18, Math.round(height * 0.04))
  const gap = Math.max(12, Math.round(width * 0.02))
  
  ctx.font = `${fontSize}px ui-sans-serif, system-ui, sans-serif`
  ctx.textBaseline = 'middle'
  ctx.shadowColor = 'rgba(0,0,0,0.6)'
  ctx.shadowBlur = 6

We scale the font size and spacing based on video dimensions so captions look good on any resolution. The shadow gives them readability even on complex backgrounds.

  const visibleWords = getVisibleWordsForSentence(sentence, timestamp, 3)

We only show 3 words at a time—the active word and its neighbors. Showing more crowds the frame. This function handles the windowing logic to keep the visible words moving smoothly through the sentence.

  for (const word of visibleWords) {
    const isActive = timestamp >= word.timestamp[0] && timestamp <= word.timestamp[1]
    
    ctx.fillStyle = isActive ? '#3498db' : 'rgba(0,0,0,0.35)'
    // Draw rounded rectangle background
    
    ctx.fillStyle = isActive ? '#ffffff' : '#ffffff'
    ctx.fillText(word.text.toUpperCase(), x, y)
  }
}

We draw each word with a pill-shaped background. Active words get a blue background, inactive words get a subtle dark background. Uppercase makes them more prominent.

The rounded rectangle drawing code is tedious canvas arc math that I'm not showing here, but you get the idea. We're compositing text over the video frame, styled to look like professional captions.

Grouping Words into Sentences

The sentence grouping logic helps us show contextually appropriate windows of words.

function groupWordsIntoSentences(chunks: any[]) {
  const END_PUNCT_REGEX = /[.!?]$/
  const MAX_GAP_SECONDS = 0.8
  
  const sentences = []
  let currentWords: any[] = []

We'll end a sentence if we hit punctuation or if there's a long gap between words.

  for (let i = 0; i < chunks.length; i++) {
    const word = chunks[i]
    currentWords.push(word)
    
    const next = chunks[i + 1]
    const endsByPunctuation = END_PUNCT_REGEX.test(word.text)
    const gapSeconds = next ? next.timestamp[0] - word.timestamp[1] : 0
    const endsByGap = next ? gapSeconds > MAX_GAP_SECONDS : true

We check if the word ends with ., !, or ?, or if the gap to the next word is more than 0.8 seconds. Those are natural sentence boundaries.

    if (endsByPunctuation || endsByGap || !next) {
      sentences.push({
        words: currentWords.slice(),
        start: currentWords[0].timestamp[0],
        end: currentWords[currentWords.length - 1].timestamp[1]
      })
      currentWords = []
    }
  }
  
  return sentences
}

Each sentence object contains its words and a start/end timestamp range. This makes it easy to find which sentence is active at any given time.

Adapting for Vue

The Vue version uses reactivity to manage state instead of manual DOM updates. The core transcription logic stays the same.

<script setup lang="ts">
import { ref, reactive, onMounted } from 'vue'
import { pipeline, type AutomaticSpeechRecognitionPipeline } from '@huggingface/transformers'

const isLoading = ref(false)
const isLoaded = ref(false)
const loadingProgress = ref('')
const videoSrc = ref('')
const results = reactive({ transcript: null as any })

let transcriber: AutomaticSpeechRecognitionPipeline | null = null

All our state is reactive refs. When we update loadingProgress or results.transcript, Vue automatically updates the UI.

async function loadModels() {
  isLoading.value = true
  const prefersWebGPU = 'gpu' in navigator
  const device = prefersWebGPU ? 'webgpu' : 'wasm'
  
  const progress = (p: any) => {
    if (p.status === 'downloading') {
      loadingProgress.value = `Downloading ${p.name}: ${Math.round(p.progress || 0)}%`
    }
  }
  
  const options: any = device === 'webgpu'
    ? { device, dtype: { encoder_model: 'fp32', decoder_model_merged: 'q4' }, progress_callback: progress }
    : { device, dtype: 'q8', progress_callback: progress }
  
  transcriber = await pipeline('automatic-speech-recognition', ASR_MODEL, options)
  
  if (device === 'webgpu') {
    await transcriber(new Float32Array(16_000), { language: 'en' })
  }
  
  isLoaded.value = true
  isLoading.value = false
}

onMounted(() => loadModels())

We load the model on component mount and use the progress callback to update the loading status reactively. This is where Vue shines—no manual textContent updates.

<template>
  <div>
    <div v-if="isLoading">
      <p>{{ loadingProgress }}</p>
    </div>
    
    <input v-if="isLoaded" type="file" accept="video/*" 
           @change="handleVideoUpload" />
    
    <video v-if="videoSrc" ref="videoElement" :src="videoSrc" controls
           @timeupdate="currentTime = videoElement!.currentTime" />
    
    <div v-if="results.transcript">
      <span v-for="(chunk, i) in results.transcript.chunks" :key="i"
            :class="{ active: isWordActive(chunk) }">
        {{ chunk.text }}
      </span>
    </div>
  </div>
</template>

We loop over chunks with v-for, bind the highlight class conditionally based on currentTime, and let Vue handle re-rendering. The timeupdate event updates currentTime, which triggers the class binding to re-evaluate.

function isWordActive(chunk: any): boolean {
  const [start, end] = chunk.timestamp
  return currentTime.value >= start && currentTime.value <= end
}

This is cleaner than the vanilla approach where we manually looped through DOM elements every frame.

Export Button Integration

In the Vue component, we add a button to trigger the export.

<template>
  <button 
    :disabled="!results.transcript || isExporting" 
    @click="exportVideoWithCaptions">
    <span v-if="!isExporting">Export MP4 with Captions</span>
    <span v-else>Exporting… {{ exportProgress }}%</span>
  </button>
  
  <a v-if="exportedVideoUrl" 
     :href="exportedVideoUrl" 
     download="captions.mp4">
    Download MP4
  </a>
</template>

The button is disabled until we have a transcript and while exporting is in progress. We show real-time export progress as a percentage.

const isExporting = ref(false)
const exportProgress = ref(0)
const exportedVideoUrl = ref<string | null>(null)

async function exportVideoWithCaptions() {
  if (!videoFile.value || !results.transcript) return
  
  isExporting.value = true
  exportProgress.value = 0
  
  try {
    // ... conversion code from before ...
    
    const conversion = await Conversion.init({
      input,
      output,
      video: {
        process: (sample) => {
          // ... drawing logic ...
          
          exportProgress.value = Math.round((sample.timestamp / totalDuration) * 100)
          return ctx.canvas
        }
      }
    })
    
    await conversion.execute()
    exportedVideoUrl.value = URL.createObjectURL(blob)
  } finally {
    isExporting.value = false
  }
}

We update exportProgress inside the frame processing callback. Since it's a reactive ref, the UI updates automatically.

Practical Considerations

The first-load experience is the biggest UX challenge. Downloading 140MB of model weights takes time. Clear progress indicators help, but you can't make the download faster.

Transcription speed depends on WebGPU availability. On my M1 Mac, a 5-minute video transcribes in about 30 seconds with WebGPU. On WASM, it's closer to 2-3 minutes. Video export takes similar time since it's processing every frame.

Safari doesn't support WebGPU yet (as of early 2025), so it falls back to WASM automatically. You might want to show a notice recommending Chrome or Edge for better performance.

Error handling should account for model loading failures, unsupported video codecs, and memory issues on lower-end devices. Wrapping the transcription and export in try-catch blocks and showing user-friendly messages helps.

One thing I learned: OffscreenCanvas isn't available in older browsers. You need a fallback that creates a regular canvas element if typeof OffscreenCanvas === 'undefined'. The component handles this, but it's worth noting.

Wrapping Up

You've got a fully client-side karaoke caption system that runs in the browser and exports videos with captions burned in. No servers, no API keys, no monthly costs. The Mediabunny Conversion API makes video processing straightforward, and Whisper gives you accurate word-level timing.

<script setup lang="ts">
import { ref, reactive, computed, onMounted } from 'vue'
import { pipeline, type AutomaticSpeechRecognitionPipeline } from '@huggingface/transformers'
import { Input, ALL_FORMATS, BlobSource, Conversion, WavOutputFormat, 
         BufferTarget, Output, Mp4OutputFormat } from 'mediabunny'

const ASR_MODEL = 'onnx-community/whisper-base_timestamped'

// State
const isLoading = ref(false)
const isLoaded = ref(false)
const loadingProgress = ref('')
const isTranscribing = ref(false)
const transcriptionProgress = ref(0)
const videoFile = ref<File | null>(null)
const videoSrc = ref('')
const currentTime = ref(0)
const videoElement = ref<HTMLVideoElement | null>(null)
const isExporting = ref(false)
const exportProgress = ref(0)
const exportedVideoUrl = ref<string | null>(null)
const results = reactive({ transcript: null as any })

let transcriber: AutomaticSpeechRecognitionPipeline | null = null

// Load model
async function loadModels() {
  isLoading.value = true
  const prefersWebGPU = 'gpu' in navigator
  const device = prefersWebGPU ? 'webgpu' : 'wasm'
  
  const progress = (p: any) => {
    if (p.status === 'downloading') {
      loadingProgress.value = `Downloading ${p.name}: ${Math.round(p.progress || 0)}%`
    } else if (p.status === 'loading') {
      loadingProgress.value = `Loading ${p.name}...`
    }
  }
  
  const options: any = device === 'webgpu'
    ? { device, dtype: { encoder_model: 'fp32', decoder_model_merged: 'q4' }, progress_callback: progress }
    : { device, dtype: 'q8', progress_callback: progress }
  
  transcriber = await pipeline('automatic-speech-recognition', ASR_MODEL, options)
  
  if (device === 'webgpu') {
    loadingProgress.value = 'Warming up model...'
    await transcriber(new Float32Array(16_000), { language: 'en' })
  }
  
  isLoaded.value = true
  isLoading.value = false
}

// Extract audio from video
async function extractAudio(file: File): Promise<Float32Array> {
  const input = new Input({ source: new BlobSource(file), formats: ALL_FORMATS })
  const output = new Output({ format: new WavOutputFormat(), target: new BufferTarget() })
  const conversion = await Conversion.init({ input, output, audio: { sampleRate: 16000 } })
  await conversion.execute()
  
  const wavBuffer: ArrayBuffer = (output.target as any).buffer
  const audioContext = new AudioContext({ sampleRate: 16000 })
  const audioBuffer = await audioContext.decodeAudioData(wavBuffer)
  const mono = new Float32Array(audioBuffer.length)
  audioBuffer.copyFromChannel(mono, 0)
  return mono
}

// Transcribe audio in chunks
async function processAudio(audio: Float32Array) {
  if (!transcriber) return
  isTranscribing.value = true
  
  const SAMPLE_RATE = 16000
  const CHUNK_SECONDS = 30
  const SAMPLES_PER_CHUNK = SAMPLE_RATE * CHUNK_SECONDS
  const acc: any = { text: '', chunks: [] }
  
  for (let start = 0; start < audio.length; start += SAMPLES_PER_CHUNK) {
    const end = Math.min(start + SAMPLES_PER_CHUNK, audio.length)
    const chunk = audio.subarray(start, end)
    const offset = start / SAMPLE_RATE
    
    const r: any = await transcriber(chunk, {
      language: 'en',
      return_timestamps: 'word',
      chunk_length_s: CHUNK_SECONDS
    })
    
    if (r?.text) acc.text += (acc.text ? ' ' : '') + r.text
    
    for (const c of r?.chunks || []) {
      const [s, e] = c.timestamp
      acc.chunks.push({ text: c.text, timestamp: [s + offset, e + offset] })
    }
    
    transcriptionProgress.value = Math.round((end / audio.length) * 100)
    results.transcript = { ...acc }
  }
  
  isTranscribing.value = false
}

// Process video file
async function processVideoFile(file: File) {
  videoFile.value = file
  videoSrc.value = URL.createObjectURL(file)
  const audio = await extractAudio(file)
  await processAudio(audio)
}

// Group words into sentences
function groupIntoSentences(chunks: any[]) {
  const END_PUNCT_REGEX = /[.!?]$/
  const MAX_GAP_SECONDS = 0.8
  const sentences = []
  let currentWords: any[] = []
  
  for (let i = 0; i < chunks.length; i++) {
    const word = chunks[i]
    currentWords.push(word)
    
    const next = chunks[i + 1]
    const endsByPunctuation = END_PUNCT_REGEX.test(word.text)
    const gapSeconds = next ? next.timestamp[0] - word.timestamp[1] : 0
    const endsByGap = next ? gapSeconds > MAX_GAP_SECONDS : true
    
    if (endsByPunctuation || endsByGap || !next) {
      sentences.push({
        words: currentWords.slice(),
        start: currentWords[0].timestamp[0],
        end: currentWords[currentWords.length - 1].timestamp[1]
      })
      currentWords = []
    }
  }
  
  return sentences
}

const sentences = computed(() => {
  if (!results.transcript?.chunks?.length) return []
  return groupIntoSentences(results.transcript.chunks)
})

// Find sentence at timestamp
function findSentenceAtTimestamp(ts: number) {
  return sentences.value.find(s => ts >= s.start && ts <= s.end) || null
}

// Get visible words for sentence
function getVisibleWordsForSentence(sentence: any, ts: number, maxWords = 3) {
  if (!sentence || sentence.words.length <= maxWords) return sentence?.words || []
  
  const activeIdx = sentence.words.findIndex((w: any) => ts >= w.timestamp[0] && ts <= w.timestamp[1])
  const idx = activeIdx >= 0 ? activeIdx : 
    sentence.words.findIndex((w: any) => ts < w.timestamp[1])
  
  const segmentStart = Math.max(0, Math.floor((idx >= 0 ? idx : sentence.words.length - 1) / maxWords) * maxWords)
  return sentence.words.slice(segmentStart, segmentStart + maxWords)
}

// Draw captions overlay
function drawCaptionsOverlay(
  ctx: CanvasRenderingContext2D | OffscreenCanvasRenderingContext2D,
  width: number,
  height: number,
  timestamp: number
) {
  const sentence = findSentenceAtTimestamp(timestamp)
  if (!sentence?.words?.length) return
  
  const fontSize = Math.max(18, Math.round(height * 0.04))
  const gap = Math.max(12, Math.round(width * 0.02))
  const paddingX = Math.max(8, Math.round(width * 0.01))
  const paddingY = Math.max(6, Math.round(height * 0.008))
  
  ctx.save()
  ctx.font = `${fontSize}px ui-sans-serif, system-ui, sans-serif`
  ctx.textBaseline = 'middle'
  ctx.shadowColor = 'rgba(0,0,0,0.6)'
  ctx.shadowBlur = 6
  
  const visibleWords = getVisibleWordsForSentence(sentence, timestamp, 3)
  const measurements = visibleWords.map((w: any) => ({ 
    text: w.text, 
    width: ctx.measureText(w.text).width, 
    chunk: w 
  }))
  
  const totalWidth = measurements.reduce((a, b) => a + b.width, 0) + gap * (measurements.length - 1)
  let x = (width - totalWidth) / 2
  const y = height - Math.max(56, Math.round(height * 0.12))
  
  for (const m of measurements) {
    const isActive = timestamp >= m.chunk.timestamp[0] && timestamp <= m.chunk.timestamp[1]
    
    // Draw background pill
    const bgX = x - paddingX * 0.5
    const bgW = m.width + paddingX
    const bgH = fontSize + paddingY * 0.7
    ctx.fillStyle = isActive ? '#3498db' : 'rgba(0,0,0,0.35)'
    ctx.beginPath()
    const r = Math.min(10, Math.round(fontSize * 0.35))
    ctx.moveTo(bgX + r, y)
    ctx.lineTo(bgX + bgW - r, y)
    ctx.quadraticCurveTo(bgX + bgW, y, bgX + bgW, y + r)
    ctx.lineTo(bgX + bgW, y + bgH - r)
    ctx.quadraticCurveTo(bgX + bgW, y + bgH, bgX + bgW - r, y + bgH)
    ctx.lineTo(bgX + r, y + bgH)
    ctx.quadraticCurveTo(bgX, y + bgH, bgX, y + bgH - r)
    ctx.lineTo(bgX, y + r)
    ctx.quadraticCurveTo(bgX, y, bgX + r, y)
    ctx.closePath()
    ctx.fill()
    
    // Draw text
    ctx.fillStyle = '#ffffff'
    ctx.fillText(m.text.toUpperCase(), x, y + bgH / 2)
    
    x += m.width + gap
  }
  
  ctx.restore()
}

// Export video with captions
async function exportVideoWithCaptions() {
  if (!videoFile.value || !results.transcript) return
  
  isExporting.value = true
  exportProgress.value = 0
  exportedVideoUrl.value = null
  
  try {
    const input = new Input({ source: new BlobSource(videoFile.value), formats: ALL_FORMATS })
    const totalDuration = await input.computeDuration()
    const target = new BufferTarget()
    const output = new Output({ format: new Mp4OutputFormat({ fastStart: 'in-memory' }), target })
    
    let ctx: CanvasRenderingContext2D | OffscreenCanvasRenderingContext2D | null = null
    
    const conversion = await Conversion.init({
      input,
      output,
      video: {
        process: (sample) => {
          if (!ctx) {
            const canvas = typeof OffscreenCanvas !== 'undefined'
              ? new OffscreenCanvas(sample.displayWidth, sample.displayHeight)
              : (() => {
                  const c = document.createElement('canvas')
                  c.width = sample.displayWidth
                  c.height = sample.displayHeight
                  return c
                })()
            ctx = canvas.getContext('2d', { alpha: false })!
          }
          
          ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height)
          sample.draw(ctx, 0, 0)
          
          const timestamp = sample.timestamp + sample.duration / 2
          drawCaptionsOverlay(ctx, ctx.canvas.width, ctx.canvas.height, timestamp)
          
          exportProgress.value = Math.round((sample.timestamp / totalDuration) * 100)
          return ctx.canvas
        }
      }
    })
    
    await conversion.execute()
    
    const arrayBuffer: ArrayBuffer = (target as any).buffer
    const blob = new Blob([arrayBuffer], { type: 'video/mp4' })
    exportedVideoUrl.value = URL.createObjectURL(blob)
  } finally {
    isExporting.value = false
  }
}

// Check if word is active
function isWordActive(chunk: any): boolean {
  return currentTime.value >= chunk.timestamp[0] && currentTime.value <= chunk.timestamp[1]
}

onMounted(() => loadModels())
</script>

<template>
  <div>
    <div v-if="isLoading">
      <p>{{ loadingProgress }}</p>
    </div>
    
    <input v-if="isLoaded" type="file" accept="video/*" 
           @change="(e: any) => e.target.files?.[0] && processVideoFile(e.target.files[0])" />
    
    <div v-if="isTranscribing">
      <p>Transcribing... {{ transcriptionProgress }}%</p>
    </div>
    
    <video v-if="videoSrc" ref="videoElement" :src="videoSrc" controls playsinline
           @timeupdate="currentTime = videoElement!.currentTime" />
    
    <div v-if="results.transcript">
      <span v-for="(chunk, i) in results.transcript.chunks" :key="i"
            :class="{ active: isWordActive(chunk) }">
        {{ chunk.text }}
      </span>
    </div>
    
    <button v-if="results.transcript" 
            :disabled="isExporting" 
            @click="exportVideoWithCaptions">
      {{ isExporting ? `Exporting… ${exportProgress}%` : 'Export MP4 with Captions' }}
    </button>
    
    <a v-if="exportedVideoUrl" 
       :href="exportedVideoUrl" 
       download="captions.mp4">
      Download MP4
    </a>
  </div>
</template>

<style scoped>
span { 
  padding: 4px 6px; 
  margin: 2px; 
  display: inline-block; 
  text-transform: uppercase;
}
span.active { 
  background: #3498db; 
  color: #fff; 
  font-weight: 600; 
}
</style>

import { pipeline } from '@huggingface/transformers'
import { Input, ALL_FORMATS, BlobSource, Conversion, WavOutputFormat, 
         BufferTarget, Output, Mp4OutputFormat } from 'mediabunny'

const ASR_MODEL = 'onnx-community/whisper-base_timestamped'

const videoEl = document.getElementById('video') as HTMLVideoElement
const inputEl = document.getElementById('video-input') as HTMLInputElement
const statusEl = document.getElementById('status')!
const wordsEl = document.getElementById('words') as HTMLDivElement
const exportBtn = document.getElementById('export') as HTMLButtonElement
const downloadLink = document.getElementById('download') as HTMLAnchorElement

let results: any = null
let videoFile: File | null = null

async function loadASR() {
  statusEl.textContent = 'Loading model...'
  const prefersWebGPU = 'gpu' in navigator
  const options: any = prefersWebGPU
    ? { device: 'webgpu', dtype: { encoder_model: 'fp32', decoder_model_merged: 'q4' } }
    : { device: 'wasm', dtype: 'q8' }
  
  const asr = await pipeline('automatic-speech-recognition', ASR_MODEL, options)
  
  if (prefersWebGPU) await asr(new Float32Array(16_000), { language: 'en' })
  
  statusEl.textContent = 'Ready'
  return asr
}

async function extractAudio(file: File): Promise<Float32Array> {
  const input = new Input({ source: new BlobSource(file), formats: ALL_FORMATS })
  const output = new Output({ format: new WavOutputFormat(), target: new BufferTarget() })
  const conversion = await Conversion.init({ input, output, audio: { sampleRate: 16000 } })
  await conversion.execute()
  
  const wavBuffer: ArrayBuffer = (output.target as any).buffer
  const audioContext = new AudioContext({ sampleRate: 16000 })
  const audioBuffer = await audioContext.decodeAudioData(wavBuffer)
  const mono = new Float32Array(audioBuffer.length)
  audioBuffer.copyFromChannel(mono, 0)
  return mono
}

async function transcribe(asr: any, audio: Float32Array) {
  const SAMPLE_RATE = 16000
  const CHUNK_SECONDS = 30
  const SAMPLES_PER_CHUNK = SAMPLE_RATE * CHUNK_SECONDS
  
  const result: any = { text: '', chunks: [] }
  
  for (let start = 0; start < audio.length; start += SAMPLES_PER_CHUNK) {
    const end = Math.min(start + SAMPLES_PER_CHUNK, audio.length)
    const chunk = audio.subarray(start, end)
    const offset = start / SAMPLE_RATE
    
    const output: any = await asr(chunk, {
      language: 'en',
      return_timestamps: 'word',
      chunk_length_s: CHUNK_SECONDS
    })
    
    if (output?.text) result.text += (result.text ? ' ' : '') + output.text
    
    for (const c of output?.chunks || []) {
      const [s, e] = c.timestamp
      result.chunks.push({ text: c.text, timestamp: [s + offset, e + offset] })
    }
  }
  
  return result
}

function groupIntoSentences(chunks: any[]) {
  const END_PUNCT_REGEX = /[.!?]$/
  const MAX_GAP_SECONDS = 0.8
  const sentences = []
  let currentWords: any[] = []
  
  for (let i = 0; i < chunks.length; i++) {
    const word = chunks[i]
    currentWords.push(word)
    
    const next = chunks[i + 1]
    const endsByPunctuation = END_PUNCT_REGEX.test(word.text)
    const gapSeconds = next ? next.timestamp[0] - word.timestamp[1] : 0
    const endsByGap = next ? gapSeconds > MAX_GAP_SECONDS : true
    
    if (endsByPunctuation || endsByGap || !next) {
      sentences.push({
        words: currentWords.slice(),
        start: currentWords[0].timestamp[0],
        end: currentWords[currentWords.length - 1].timestamp[1]
      })
      currentWords = []
    }
  }
  
  return sentences
}

function renderWords(result: any) {
  wordsEl.innerHTML = ''
  
  for (const chunk of result.chunks) {
    const span = document.createElement('span')
    span.textContent = chunk.text
    span.dataset.start = String(chunk.timestamp[0])
    span.dataset.end = String(chunk.timestamp[1])
    span.style.cssText = 'padding:4px 6px; margin:2px; display:inline-block; text-transform:uppercase;'
    wordsEl.appendChild(span)
  }
}

function updateHighlight() {
  const currentTime = videoEl.currentTime
  
  for (const span of wordsEl.children) {
    const start = Number((span as HTMLElement).dataset.start)
    const end = Number((span as HTMLElement).dataset.end)
    const isActive = currentTime >= start && currentTime <= end
    
    (span as HTMLElement).style.background = isActive ? '#3498db' : ''
    (span as HTMLElement).style.color = isActive ? '#fff' : ''
  }
}

async function exportVideo() {
  if (!videoFile || !results) return
  
  exportBtn.disabled = true
  exportBtn.textContent = 'Exporting...'
  
  const input = new Input({ source: new BlobSource(videoFile), formats: ALL_FORMATS })
  const totalDuration = await input.computeDuration()
  const target = new BufferTarget()
  const output = new Output({ format: new Mp4OutputFormat({ fastStart: 'in-memory' }), target })
  
  const sentences = groupIntoSentences(results.chunks)
  let ctx: CanvasRenderingContext2D | null = null
  
  const conversion = await Conversion.init({
    input,
    output,
    video: {
      process: (sample) => {
        if (!ctx) {
          const canvas = new OffscreenCanvas(sample.displayWidth, sample.displayHeight)
          ctx = canvas.getContext('2d')!
        }
        
        ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height)
        sample.draw(ctx, 0, 0)
        
        const timestamp = sample.timestamp + sample.duration / 2
        const sentence = sentences.find(s => timestamp >= s.start && timestamp <= s.end)
        
        if (sentence) {
          const fontSize = Math.max(18, Math.round(ctx.canvas.height * 0.04))
          const gap = Math.max(12, Math.round(ctx.canvas.width * 0.02))
          
          ctx.font = `${fontSize}px sans-serif`
          ctx.textBaseline = 'middle'
          ctx.shadowColor = 'rgba(0,0,0,0.6)'
          ctx.shadowBlur = 6
          
          const visibleWords = sentence.words.slice(0, 3)
          let x = 50
          const y = ctx.canvas.height - 80
          
          for (const word of visibleWords) {
            const isActive = timestamp >= word.timestamp[0] && timestamp <= word.timestamp[1]
            ctx.fillStyle = isActive ? '#3498db' : 'rgba(0,0,0,0.35)'
            const width = ctx.measureText(word.text).width
            ctx.fillRect(x - 5, y - fontSize / 2 - 3, width + 10, fontSize + 6)
            
            ctx.fillStyle = '#fff'
            ctx.fillText(word.text.toUpperCase(), x, y)
            x += width + gap
          }
        }
        
        const progress = Math.round((sample.timestamp / totalDuration) * 100)
        exportBtn.textContent = `Exporting... ${progress}%`
        
        return ctx.canvas
      }
    }
  })
  
  await conversion.execute()
  
  const arrayBuffer: ArrayBuffer = (target as any).buffer
  const blob = new Blob([arrayBuffer], { type: 'video/mp4' })
  const url = URL.createObjectURL(blob)
  
  downloadLink.href = url
  downloadLink.style.display = 'inline'
  exportBtn.textContent = 'Export Complete'
}

;(async () => {
  const asr = await loadASR()
  
  inputEl.addEventListener('change', async (e) => {
    const file = (e.target as HTMLInputElement).files?.[0]
    if (!file) return
    
    videoFile = file
    videoEl.src = URL.createObjectURL(file)
    statusEl.textContent = 'Extracting audio...'
    
    const audio = await extractAudio(file)
    statusEl.textContent = 'Transcribing...'
    
    results = await transcribe(asr, audio)
    renderWords(results)
    statusEl.textContent = 'Done!'
    exportBtn.style.display = 'inline'
  })
  
  videoEl.addEventListener('timeupdate', updateHighlight)
  exportBtn.addEventListener('click', exportVideo)
})()

<!doctype html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <title>Karaoke Captions</title>
  </head>
  <body>
    <input type="file" id="video-input" accept="video/*" />
    <video id="video" controls playsinline style="width:100%;max-width:800px;"></video>
    <div id="status"></div>
    <div id="words"></div>
    <button id="export" style="display:none;">Export MP4 with Captions</button>
    <a id="download" download="captions.mp4" style="display:none;">Download MP4</a>
    <script type="module" src="/src/main.ts"></script>
  </body>
</html>

AI Engineer