Back to all AI tutorials
Sep 4, 2025 - 30 min read
Real-Time Video Captioning in the Browser with Vision Language Models

Real-Time Video Captioning in the Browser with Vision Language Models

Learn how to build an accessible video captioning system using Transformers.js and FastVLM that runs entirely in the browser with WebGPU—no server required.

Patrick the AI Engineer

Patrick the AI Engineer

Try the Interactive Playground

This tutorial is accompanied by an interactive playground. Test the code, experiment with different parameters, and see the results in real-time.

Go to Playground

Introduction

You're on a video call and someone without audio joins. Or you're building a live streaming platform and need to provide real-time captions. Maybe you're just trying to make your web app more accessible. The problem is the same: you need to describe what's happening in a video feed, right now, without the latency and cost of sending frames to a cloud API.

I've been exploring vision language models that run directly in the browser, and it turns out we can build surprisingly capable real-time captioning systems with no backend at all. We'll use Transformers.js to run FastVLM—a compact vision-language model—on the user's GPU via WebGPU. The result is a captioning system that's private (nothing leaves the browser), fast (no network latency), and free (no API costs).

This isn't a perfect replacement for professional captioning services, but it's remarkably good for accessibility features, interactive demos, and any scenario where you want immediate feedback about what's in a video stream.

Vision Language Models for the Browser

A vision language model (VLM) is essentially a model that can "see" images and talk about them. It takes an image and a text prompt as input, then generates a text response. Think of it like having a conversation with someone who's looking at the same picture you are.

The FastVLM model we'll use here is a 0.5B parameter model, which means it's small enough to download and run in a browser (around 300MB quantized). It won't match GPT-4V or Gemini for nuanced visual understanding, but it's surprisingly capable for real-time captioning. You can ask it "Describe what you see" or "What color is the shirt?" and get coherent answers in under a second.

The magic ingredient is WebGPU, a browser API that gives JavaScript access to your computer's GPU. Without it, running even a small vision model in the browser would be painfully slow. With it, we can process video frames fast enough for a live captioning experience. Chrome and Edge have stable WebGPU support now, and Firefox is catching up.

One thing I learned the hard way: you need to use quantized models for in-browser inference. The original FastVLM is too large to be practical. The ONNX version from Hugging Face applies quantization (using q4 and fp16 precision), which shrinks the model by about 4x with minimal quality loss. This trade-off is what makes real-time browser inference possible.

License Note: FastVLM uses Apple's Model License for Research, which means you can only use it for research purposes. If you're building a production app, check the license carefully or look for alternative models.

Implementation: Vanilla TypeScript

Let's start by building the core captioning logic in plain TypeScript. We'll break this into three parts: loading the model, capturing video frames, and running inference.

Loading the Model

First, we need to load both the processor (handles image preprocessing) and the model itself from Hugging Face.

import { AutoProcessor, AutoModelForImageTextToText } from '@huggingface/transformers'
import type { LlavaProcessor, PreTrainedModel } from '@huggingface/transformers'

let processor: LlavaProcessor | null = null
let model: PreTrainedModel | null = null

async function loadModel(onProgress?: (msg: string) => void) {
  onProgress?.('Loading processor...')
  processor = await AutoProcessor.from_pretrained(
    'onnx-community/FastVLM-0.5B-ONNX'
  )
  
  onProgress?.('Loading model...')
  model = await AutoModelForImageTextToText.from_pretrained(
    'onnx-community/FastVLM-0.5B-ONNX',
    {
      dtype: {
        embed_tokens: 'fp16',
        vision_encoder: 'q4',
        decoder_model_merged: 'q4'
      },
      device: 'webgpu'
    }
  )
}

The dtype configuration tells Transformers.js which precision to use for each part of the model. Lower precision (like q4) means smaller size and faster inference, but slightly less accuracy. For real-time captioning, this trade-off is worth it. The device: 'webgpu' line is critical—it tells the library to use the GPU instead of the CPU.

Capturing Video Frames

To process video, we need to extract still frames from the video element. We'll use a canvas as an intermediate step.

import { RawImage } from '@huggingface/transformers'

function captureFrame(video: HTMLVideoElement): RawImage {
  const canvas = document.createElement('canvas')
  canvas.width = video.videoWidth
  canvas.height = video.videoHeight
  
  const ctx = canvas.getContext('2d')!
  ctx.drawImage(video, 0, 0)
  
  const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
  return new RawImage(imageData.data, imageData.width, imageData.height, 4)
}

This creates an off-screen canvas, draws the current video frame onto it, and extracts the raw pixel data as a RawImage object that Transformers.js can work with.

Running Inference with Streaming

Now for the interesting part: running the model and streaming the output token by token.

import { TextStreamer } from '@huggingface/transformers'

async function generateCaption(
  video: HTMLVideoElement,
  prompt: string,
  onToken: (token: string) => void
): Promise<string> {
  if (!processor || !model) {
    throw new Error('Model not loaded')
  }

  const frame = captureFrame(video)
  
  // Format the prompt with chat template
  const messages = [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: `<image>\n${prompt}` }
  ]
  const formattedPrompt = processor.apply_chat_template(messages, {
    add_generation_prompt: true
  })

The model expects input in a specific chat format. The <image> token tells it where to inject the visual information, and we append our text prompt after it.

  // Process inputs
  const inputs = await processor(frame, formattedPrompt)
  
  // Set up streaming
  let result = ''
  const streamer = new TextStreamer(processor.tokenizer!, {
    skip_prompt: true,
    skip_special_tokens: true,
    callback_function: (token: string) => {
      result += token
      onToken(token)
    }
  })

The TextStreamer handles the streaming output. As the model generates each token, the callback fires, which lets us update the UI immediately rather than waiting for the complete caption.

  // Generate
  await model.generate({
    ...inputs,
    max_new_tokens: 512,
    do_sample: false,
    streamer,
    repetition_penalty: 1.2
  })
  
  return result.trim()
}

We're using do_sample: false for deterministic output and adding a slight repetition_penalty to discourage the model from repeating itself. These settings work well for descriptive captions.

Setting Up the Loop

To caption continuously, we need a loop that captures frames at regular intervals.

async function startCaptioning(
  video: HTMLVideoElement,
  prompt: string,
  onUpdate: (caption: string) => void,
  signal: AbortSignal
) {
  while (!signal.aborted) {
    if (video.readyState >= 2 && !video.paused) {
      let currentCaption = ''
      
      await generateCaption(video, prompt, (token) => {
        currentCaption += token
        onUpdate(currentCaption)
      })
    }
    
    await new Promise(resolve => setTimeout(resolve, 2000))
  }
}

This captures a frame every 2 seconds, runs inference, and updates the caption as tokens stream in. The AbortSignal lets us cleanly stop the loop when needed.

Implementation: Vue.js

Vue makes this pattern much easier to manage, especially the state transitions and reactive updates. Let's rebuild this as a composable and component.

The VLM Composable

We'll wrap the model loading and inference logic in a reusable composable.

// useVLM.ts
import { ref, shallowRef } from 'vue'
import { AutoProcessor, AutoModelForImageTextToText, RawImage, TextStreamer } from '@huggingface/transformers'
import type { LlavaProcessor, PreTrainedModel } from '@huggingface/transformers'

export function useVLM() {
  const processor = shallowRef<LlavaProcessor | null>(null)
  const model = shallowRef<PreTrainedModel | null>(null)
  const isLoaded = ref(false)
  const isLoading = ref(false)
  const error = ref<string | null>(null)

  async function load(onProgress?: (msg: string) => void) {
    if (isLoaded.value) return
    
    isLoading.value = true
    error.value = null
    
    try {
      onProgress?.('Loading processor...')
      processor.value = await AutoProcessor.from_pretrained(
        'onnx-community/FastVLM-0.5B-ONNX'
      )

Using shallowRef for the processor and model is important—these are complex objects we don't want Vue to deeply observe.

      onProgress?.('Loading model...')
      model.value = await AutoModelForImageTextToText.from_pretrained(
        'onnx-community/FastVLM-0.5B-ONNX',
        {
          dtype: {
            embed_tokens: 'fp16',
            vision_encoder: 'q4',
            decoder_model_merged: 'q4'
          },
          device: 'webgpu'
        }
      )
      
      isLoaded.value = true
    } catch (e) {
      error.value = e instanceof Error ? e.message : String(e)
      throw e
    } finally {
      isLoading.value = false
    }
  }

The loading states (isLoading, isLoaded, error) are reactive, so our UI can automatically respond to changes.

  async function runInference(
    video: HTMLVideoElement,
    prompt: string,
    onToken?: (token: string) => void
  ): Promise<string> {
    if (!processor.value || !model.value) {
      throw new Error('Model not loaded')
    }

    const canvas = document.createElement('canvas')
    canvas.width = video.videoWidth
    canvas.height = video.videoHeight
    
    const ctx = canvas.getContext('2d')!
    ctx.drawImage(video, 0, 0)
    
    const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
    const frame = new RawImage(imageData.data, imageData.width, imageData.height, 4)

I'm keeping the canvas creation inside runInference for now. You could optimize this by reusing a single canvas, but for clarity, this works fine.

    const messages = [
      { role: 'system', content: 'You are a helpful visual assistant.' },
      { role: 'user', content: `<image>\n${prompt}` }
    ]
    
    const formattedPrompt = processor.value.apply_chat_template(messages, {
      add_generation_prompt: true
    })
    
    const inputs = await processor.value(frame, formattedPrompt)
    
    let result = ''
    const streamer = new TextStreamer(processor.value.tokenizer!, {
      skip_prompt: true,
      skip_special_tokens: true,
      callback_function: (token: string) => {
        result += token
        onToken?.(token)
      }
    })
    
    await model.value.generate({
      ...inputs,
      max_new_tokens: 512,
      do_sample: false,
      streamer,
      repetition_penalty: 1.2
    })
    
    return result.trim()
  }

  return { isLoaded, isLoading, error, load, runInference }
}

The Captioning Loop Composable

Vue's watch makes it easy to build a reactive captioning loop that responds to user input.

// useCaptioningLoop.ts
import { watch, onUnmounted, type Ref } from 'vue'

export function useCaptioningLoop(
  video: Ref<HTMLVideoElement | null>,
  isRunning: Ref<boolean>,
  prompt: Ref<string>,
  vlm: ReturnType<typeof useVLM>,
  onCaptionUpdate: (caption: string) => void,
  onError: (error: string) => void
) {
  let abortController: AbortController | null = null

  watch([isRunning, prompt, () => vlm.isLoaded.value], 
    ([running, currentPrompt, loaded]) => {
      // Stop any existing loop
      if (abortController) {
        abortController.abort()
      }
      
      if (!running || !loaded) return
      
      abortController = new AbortController()
      const { signal } = abortController

Watching isRunning, prompt, and vlm.isLoaded means the loop automatically restarts whenever any of these change. Change the prompt? The loop resets with the new question.

      const loop = async () => {
        while (!signal.aborted) {
          if (video.value?.readyState >= 2) {
            onCaptionUpdate('') // Clear previous caption
            
            try {
              await vlm.runInference(
                video.value,
                currentPrompt,
                onCaptionUpdate
              )
            } catch (e) {
              if (!signal.aborted) {
                onError(e instanceof Error ? e.message : String(e))
              }
            }
          }
          
          if (signal.aborted) break
          await new Promise(resolve => setTimeout(resolve, 2000))
        }
      }
      
      setTimeout(loop, 0)
    },
    { immediate: true }
  )

  onUnmounted(() => {
    if (abortController) {
      abortController.abort()
    }
  })
}

The cleanup happens automatically when the component unmounts. This is much cleaner than manually managing intervals and cleanup in vanilla TypeScript.

The Component

Now we can assemble everything into a Vue component with proper state management.

<script setup lang="ts">
import { ref, computed } from 'vue'
import { useVLM } from './useVLM'
import { useCaptioningLoop } from './useCaptioningLoop'

const videoRef = ref<HTMLVideoElement | null>(null)
const caption = ref('')
const currentPrompt = ref('Describe what you see in one sentence.')
const isLoopRunning = ref(false)
const captioningError = ref<string | null>(null)

const vlm = useVLM()

const handleCaptionUpdate = (token: string) => {
  if (token === '') {
    caption.value = ''
  } else {
    caption.value += token
  }
  captioningError.value = null
}

The component's state is straightforward. We're building up the caption token by token as they arrive.

const handleError = (error: string) => {
  captioningError.value = error
}

useCaptioningLoop(
  videoRef,
  isLoopRunning,
  currentPrompt,
  vlm,
  handleCaptionUpdate,
  handleError
)

async function startCaptioning() {
  await vlm.load((msg) => {
    console.log(msg)
  })
  isLoopRunning.value = true
}
</script>

Starting the captioning process is just two steps: load the model (if needed), then start the loop. Vue's reactivity handles the rest.

Practical Considerations

Performance and User Experience

The first time someone visits your app, they'll wait 30-60 seconds while the model downloads. On subsequent visits, it'll be cached and load much faster. I show a loading progress bar with specific status messages ("Loading processor...", "Loading model...") to keep users informed. Without this feedback, people will think your app is broken.

Inference speed depends heavily on the user's GPU. On my M1 MacBook, each caption takes about 800ms to generate. On an older Windows laptop with integrated graphics, it's closer to 2-3 seconds. You need to handle this variability. I capture frames every 2 seconds, which feels responsive on most hardware without overwhelming slower GPUs.

One thing I tried initially was processing every frame (30 or 60 fps). This completely overloaded the model and caused the browser to freeze. Batching frames or using a frame skip strategy might work, but for accessibility captions, processing every few seconds is actually fine—you don't need frame-perfect synchronization.

Cost and Privacy

There's no API cost here, which is the main advantage. If you were using GPT-4V to caption frames at $0.01 per image, processing even a 5-minute video call at 1 frame per second would cost $3. That adds up fast. With in-browser inference, your only cost is the initial download (about 300MB), which is a one-time hit.

The privacy angle is meaningful for certain use cases. If you're building a telehealth app or anything involving sensitive visual content, keeping all processing local is a huge win. Nothing gets sent to OpenAI or Anthropic—the video never leaves the user's machine.

Error Handling

WebGPU isn't universally supported yet. I check for navigator.gpu before loading the model and show a clear error message if it's missing. Safari doesn't support WebGPU at all as of early 2025, so you'll need a fallback for those users (maybe a message suggesting they use Chrome).

Camera permissions can also fail in various ways. I differentiate between "permission denied" (user clicked no), "no camera found" (hardware issue), and "not secure context" (not using HTTPS). Each gets its own error message because the fix is different.

Wrapping Up

We've built a real-time video captioning system that runs entirely in the browser using vision language models and WebGPU. The vanilla TypeScript implementation shows the core pattern: load a model, capture frames, run inference with streaming, and repeat. The Vue version wraps this in reactive composables that handle state and cleanup automatically.

This approach won't replace professional captioning services, but it's remarkably effective for accessibility features, interactive demos, or any scenario where you need immediate visual feedback without the latency and cost of cloud APIs. The fact that it runs locally means it's private, fast (once loaded), and free.

If you're building on this, consider adding support for multiple languages (FastVLM supports them), implementing a confidence threshold to hide low-quality captions, or letting users provide feedback to improve the prompts. The core pattern we've built here is flexible enough to handle these extensions.

Full Code Examples

import { ref, shallowRef } from 'vue'
import { 
  AutoProcessor, 
  AutoModelForImageTextToText, 
  RawImage, 
  TextStreamer 
} from '@huggingface/transformers'
import type { LlavaProcessor, PreTrainedModel } from '@huggingface/transformers'

export function useVLM() {
  const processor = shallowRef<LlavaProcessor | null>(null)
  const model = shallowRef<PreTrainedModel | null>(null)
  const isLoaded = ref(false)
  const isLoading = ref(false)
  const error = ref<string | null>(null)

  async function load(onProgress?: (msg: string) => void) {
    if (isLoaded.value) return
    
    isLoading.value = true
    error.value = null
    
    try {
      onProgress?.('Loading processor...')
      processor.value = await AutoProcessor.from_pretrained(
        'onnx-community/FastVLM-0.5B-ONNX'
      )
      
      onProgress?.('Loading model...')
      model.value = await AutoModelForImageTextToText.from_pretrained(
        'onnx-community/FastVLM-0.5B-ONNX',
        {
          dtype: {
            embed_tokens: 'fp16',
            vision_encoder: 'q4',
            decoder_model_merged: 'q4'
          },
          device: 'webgpu'
        }
      )
      
      isLoaded.value = true
    } catch (e) {
      error.value = e instanceof Error ? e.message : String(e)
      throw e
    } finally {
      isLoading.value = false
    }
  }

  async function runInference(
    video: HTMLVideoElement,
    prompt: string,
    onToken?: (token: string) => void
  ): Promise<string> {
    if (!processor.value || !model.value) {
      throw new Error('Model not loaded')
    }

    const canvas = document.createElement('canvas')
    canvas.width = video.videoWidth
    canvas.height = video.videoHeight
    
    const ctx = canvas.getContext('2d')!
    ctx.drawImage(video, 0, 0)
    
    const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
    const frame = new RawImage(
      imageData.data, 
      imageData.width, 
      imageData.height, 
      4
    )

    const messages = [
      { 
        role: 'system', 
        content: 'You are a helpful visual assistant.' 
      },
      { role: 'user', content: `<image>\n${prompt}` }
    ]
    
    const formattedPrompt = processor.value.apply_chat_template(messages, {
      add_generation_prompt: true
    })
    
    const inputs = await processor.value(frame, formattedPrompt)
    
    let result = ''
    const streamer = new TextStreamer(processor.value.tokenizer!, {
      skip_prompt: true,
      skip_special_tokens: true,
      callback_function: (token: string) => {
        result += token
        onToken?.(token)
      }
    })
    
    await model.value.generate({
      ...inputs,
      max_new_tokens: 512,
      do_sample: false,
      streamer,
      repetition_penalty: 1.2
    })
    
    return result.trim()
  }

  return { isLoaded, isLoading, error, load, runInference }
}
Copyright © 2025