
Real-Time Video Captioning in the Browser with Vision Language Models
Learn how to build an accessible video captioning system using Transformers.js and FastVLM that runs entirely in the browser with WebGPU—no server required.
Patrick the AI Engineer
Table of Contents
Introduction
You're on a video call and someone without audio joins. Or you're building a live streaming platform and need to provide real-time captions. Maybe you're just trying to make your web app more accessible. The problem is the same: you need to describe what's happening in a video feed, right now, without the latency and cost of sending frames to a cloud API.
I've been exploring vision language models that run directly in the browser, and it turns out we can build surprisingly capable real-time captioning systems with no backend at all. We'll use Transformers.js to run FastVLM—a compact vision-language model—on the user's GPU via WebGPU. The result is a captioning system that's private (nothing leaves the browser), fast (no network latency), and free (no API costs).
This isn't a perfect replacement for professional captioning services, but it's remarkably good for accessibility features, interactive demos, and any scenario where you want immediate feedback about what's in a video stream.
Vision Language Models for the Browser
A vision language model (VLM) is essentially a model that can "see" images and talk about them. It takes an image and a text prompt as input, then generates a text response. Think of it like having a conversation with someone who's looking at the same picture you are.
The FastVLM model we'll use here is a 0.5B parameter model, which means it's small enough to download and run in a browser (around 300MB quantized). It won't match GPT-4V or Gemini for nuanced visual understanding, but it's surprisingly capable for real-time captioning. You can ask it "Describe what you see" or "What color is the shirt?" and get coherent answers in under a second.
The magic ingredient is WebGPU, a browser API that gives JavaScript access to your computer's GPU. Without it, running even a small vision model in the browser would be painfully slow. With it, we can process video frames fast enough for a live captioning experience. Chrome and Edge have stable WebGPU support now, and Firefox is catching up.
One thing I learned the hard way: you need to use quantized models for in-browser inference. The original FastVLM is too large to be practical. The ONNX version from Hugging Face applies quantization (using q4 and fp16 precision), which shrinks the model by about 4x with minimal quality loss. This trade-off is what makes real-time browser inference possible.
Implementation: Vanilla TypeScript
Let's start by building the core captioning logic in plain TypeScript. We'll break this into three parts: loading the model, capturing video frames, and running inference.
Loading the Model
First, we need to load both the processor (handles image preprocessing) and the model itself from Hugging Face.
import { AutoProcessor, AutoModelForImageTextToText } from '@huggingface/transformers'
import type { LlavaProcessor, PreTrainedModel } from '@huggingface/transformers'
let processor: LlavaProcessor | null = null
let model: PreTrainedModel | null = null
async function loadModel(onProgress?: (msg: string) => void) {
onProgress?.('Loading processor...')
processor = await AutoProcessor.from_pretrained(
'onnx-community/FastVLM-0.5B-ONNX'
)
onProgress?.('Loading model...')
model = await AutoModelForImageTextToText.from_pretrained(
'onnx-community/FastVLM-0.5B-ONNX',
{
dtype: {
embed_tokens: 'fp16',
vision_encoder: 'q4',
decoder_model_merged: 'q4'
},
device: 'webgpu'
}
)
}
The dtype configuration tells Transformers.js which precision to use for each part of the model. Lower precision (like q4) means smaller size and faster inference, but slightly less accuracy. For real-time captioning, this trade-off is worth it. The device: 'webgpu' line is critical—it tells the library to use the GPU instead of the CPU.
Capturing Video Frames
To process video, we need to extract still frames from the video element. We'll use a canvas as an intermediate step.
import { RawImage } from '@huggingface/transformers'
function captureFrame(video: HTMLVideoElement): RawImage {
const canvas = document.createElement('canvas')
canvas.width = video.videoWidth
canvas.height = video.videoHeight
const ctx = canvas.getContext('2d')!
ctx.drawImage(video, 0, 0)
const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
return new RawImage(imageData.data, imageData.width, imageData.height, 4)
}
This creates an off-screen canvas, draws the current video frame onto it, and extracts the raw pixel data as a RawImage object that Transformers.js can work with.
Running Inference with Streaming
Now for the interesting part: running the model and streaming the output token by token.
import { TextStreamer } from '@huggingface/transformers'
async function generateCaption(
video: HTMLVideoElement,
prompt: string,
onToken: (token: string) => void
): Promise<string> {
if (!processor || !model) {
throw new Error('Model not loaded')
}
const frame = captureFrame(video)
// Format the prompt with chat template
const messages = [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: `<image>\n${prompt}` }
]
const formattedPrompt = processor.apply_chat_template(messages, {
add_generation_prompt: true
})
The model expects input in a specific chat format. The <image> token tells it where to inject the visual information, and we append our text prompt after it.
// Process inputs
const inputs = await processor(frame, formattedPrompt)
// Set up streaming
let result = ''
const streamer = new TextStreamer(processor.tokenizer!, {
skip_prompt: true,
skip_special_tokens: true,
callback_function: (token: string) => {
result += token
onToken(token)
}
})
The TextStreamer handles the streaming output. As the model generates each token, the callback fires, which lets us update the UI immediately rather than waiting for the complete caption.
// Generate
await model.generate({
...inputs,
max_new_tokens: 512,
do_sample: false,
streamer,
repetition_penalty: 1.2
})
return result.trim()
}
We're using do_sample: false for deterministic output and adding a slight repetition_penalty to discourage the model from repeating itself. These settings work well for descriptive captions.
Setting Up the Loop
To caption continuously, we need a loop that captures frames at regular intervals.
async function startCaptioning(
video: HTMLVideoElement,
prompt: string,
onUpdate: (caption: string) => void,
signal: AbortSignal
) {
while (!signal.aborted) {
if (video.readyState >= 2 && !video.paused) {
let currentCaption = ''
await generateCaption(video, prompt, (token) => {
currentCaption += token
onUpdate(currentCaption)
})
}
await new Promise(resolve => setTimeout(resolve, 2000))
}
}
This captures a frame every 2 seconds, runs inference, and updates the caption as tokens stream in. The AbortSignal lets us cleanly stop the loop when needed.
Implementation: Vue.js
Vue makes this pattern much easier to manage, especially the state transitions and reactive updates. Let's rebuild this as a composable and component.
The VLM Composable
We'll wrap the model loading and inference logic in a reusable composable.
// useVLM.ts
import { ref, shallowRef } from 'vue'
import { AutoProcessor, AutoModelForImageTextToText, RawImage, TextStreamer } from '@huggingface/transformers'
import type { LlavaProcessor, PreTrainedModel } from '@huggingface/transformers'
export function useVLM() {
const processor = shallowRef<LlavaProcessor | null>(null)
const model = shallowRef<PreTrainedModel | null>(null)
const isLoaded = ref(false)
const isLoading = ref(false)
const error = ref<string | null>(null)
async function load(onProgress?: (msg: string) => void) {
if (isLoaded.value) return
isLoading.value = true
error.value = null
try {
onProgress?.('Loading processor...')
processor.value = await AutoProcessor.from_pretrained(
'onnx-community/FastVLM-0.5B-ONNX'
)
Using shallowRef for the processor and model is important—these are complex objects we don't want Vue to deeply observe.
onProgress?.('Loading model...')
model.value = await AutoModelForImageTextToText.from_pretrained(
'onnx-community/FastVLM-0.5B-ONNX',
{
dtype: {
embed_tokens: 'fp16',
vision_encoder: 'q4',
decoder_model_merged: 'q4'
},
device: 'webgpu'
}
)
isLoaded.value = true
} catch (e) {
error.value = e instanceof Error ? e.message : String(e)
throw e
} finally {
isLoading.value = false
}
}
The loading states (isLoading, isLoaded, error) are reactive, so our UI can automatically respond to changes.
async function runInference(
video: HTMLVideoElement,
prompt: string,
onToken?: (token: string) => void
): Promise<string> {
if (!processor.value || !model.value) {
throw new Error('Model not loaded')
}
const canvas = document.createElement('canvas')
canvas.width = video.videoWidth
canvas.height = video.videoHeight
const ctx = canvas.getContext('2d')!
ctx.drawImage(video, 0, 0)
const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
const frame = new RawImage(imageData.data, imageData.width, imageData.height, 4)
I'm keeping the canvas creation inside runInference for now. You could optimize this by reusing a single canvas, but for clarity, this works fine.
const messages = [
{ role: 'system', content: 'You are a helpful visual assistant.' },
{ role: 'user', content: `<image>\n${prompt}` }
]
const formattedPrompt = processor.value.apply_chat_template(messages, {
add_generation_prompt: true
})
const inputs = await processor.value(frame, formattedPrompt)
let result = ''
const streamer = new TextStreamer(processor.value.tokenizer!, {
skip_prompt: true,
skip_special_tokens: true,
callback_function: (token: string) => {
result += token
onToken?.(token)
}
})
await model.value.generate({
...inputs,
max_new_tokens: 512,
do_sample: false,
streamer,
repetition_penalty: 1.2
})
return result.trim()
}
return { isLoaded, isLoading, error, load, runInference }
}
The Captioning Loop Composable
Vue's watch makes it easy to build a reactive captioning loop that responds to user input.
// useCaptioningLoop.ts
import { watch, onUnmounted, type Ref } from 'vue'
export function useCaptioningLoop(
video: Ref<HTMLVideoElement | null>,
isRunning: Ref<boolean>,
prompt: Ref<string>,
vlm: ReturnType<typeof useVLM>,
onCaptionUpdate: (caption: string) => void,
onError: (error: string) => void
) {
let abortController: AbortController | null = null
watch([isRunning, prompt, () => vlm.isLoaded.value],
([running, currentPrompt, loaded]) => {
// Stop any existing loop
if (abortController) {
abortController.abort()
}
if (!running || !loaded) return
abortController = new AbortController()
const { signal } = abortController
Watching isRunning, prompt, and vlm.isLoaded means the loop automatically restarts whenever any of these change. Change the prompt? The loop resets with the new question.
const loop = async () => {
while (!signal.aborted) {
if (video.value?.readyState >= 2) {
onCaptionUpdate('') // Clear previous caption
try {
await vlm.runInference(
video.value,
currentPrompt,
onCaptionUpdate
)
} catch (e) {
if (!signal.aborted) {
onError(e instanceof Error ? e.message : String(e))
}
}
}
if (signal.aborted) break
await new Promise(resolve => setTimeout(resolve, 2000))
}
}
setTimeout(loop, 0)
},
{ immediate: true }
)
onUnmounted(() => {
if (abortController) {
abortController.abort()
}
})
}
The cleanup happens automatically when the component unmounts. This is much cleaner than manually managing intervals and cleanup in vanilla TypeScript.
The Component
Now we can assemble everything into a Vue component with proper state management.
<script setup lang="ts">
import { ref, computed } from 'vue'
import { useVLM } from './useVLM'
import { useCaptioningLoop } from './useCaptioningLoop'
const videoRef = ref<HTMLVideoElement | null>(null)
const caption = ref('')
const currentPrompt = ref('Describe what you see in one sentence.')
const isLoopRunning = ref(false)
const captioningError = ref<string | null>(null)
const vlm = useVLM()
const handleCaptionUpdate = (token: string) => {
if (token === '') {
caption.value = ''
} else {
caption.value += token
}
captioningError.value = null
}
The component's state is straightforward. We're building up the caption token by token as they arrive.
const handleError = (error: string) => {
captioningError.value = error
}
useCaptioningLoop(
videoRef,
isLoopRunning,
currentPrompt,
vlm,
handleCaptionUpdate,
handleError
)
async function startCaptioning() {
await vlm.load((msg) => {
console.log(msg)
})
isLoopRunning.value = true
}
</script>
Starting the captioning process is just two steps: load the model (if needed), then start the loop. Vue's reactivity handles the rest.
Practical Considerations
Performance and User Experience
The first time someone visits your app, they'll wait 30-60 seconds while the model downloads. On subsequent visits, it'll be cached and load much faster. I show a loading progress bar with specific status messages ("Loading processor...", "Loading model...") to keep users informed. Without this feedback, people will think your app is broken.
Inference speed depends heavily on the user's GPU. On my M1 MacBook, each caption takes about 800ms to generate. On an older Windows laptop with integrated graphics, it's closer to 2-3 seconds. You need to handle this variability. I capture frames every 2 seconds, which feels responsive on most hardware without overwhelming slower GPUs.
One thing I tried initially was processing every frame (30 or 60 fps). This completely overloaded the model and caused the browser to freeze. Batching frames or using a frame skip strategy might work, but for accessibility captions, processing every few seconds is actually fine—you don't need frame-perfect synchronization.
Cost and Privacy
There's no API cost here, which is the main advantage. If you were using GPT-4V to caption frames at $0.01 per image, processing even a 5-minute video call at 1 frame per second would cost $3. That adds up fast. With in-browser inference, your only cost is the initial download (about 300MB), which is a one-time hit.
The privacy angle is meaningful for certain use cases. If you're building a telehealth app or anything involving sensitive visual content, keeping all processing local is a huge win. Nothing gets sent to OpenAI or Anthropic—the video never leaves the user's machine.
Error Handling
WebGPU isn't universally supported yet. I check for navigator.gpu before loading the model and show a clear error message if it's missing. Safari doesn't support WebGPU at all as of early 2025, so you'll need a fallback for those users (maybe a message suggesting they use Chrome).
Camera permissions can also fail in various ways. I differentiate between "permission denied" (user clicked no), "no camera found" (hardware issue), and "not secure context" (not using HTTPS). Each gets its own error message because the fix is different.
Wrapping Up
We've built a real-time video captioning system that runs entirely in the browser using vision language models and WebGPU. The vanilla TypeScript implementation shows the core pattern: load a model, capture frames, run inference with streaming, and repeat. The Vue version wraps this in reactive composables that handle state and cleanup automatically.
This approach won't replace professional captioning services, but it's remarkably effective for accessibility features, interactive demos, or any scenario where you need immediate visual feedback without the latency and cost of cloud APIs. The fact that it runs locally means it's private, fast (once loaded), and free.
If you're building on this, consider adding support for multiple languages (FastVLM supports them), implementing a confidence threshold to hide low-quality captions, or letting users provide feedback to improve the prompts. The core pattern we've built here is flexible enough to handle these extensions.
Full Code Examples
import { ref, shallowRef } from 'vue'
import {
AutoProcessor,
AutoModelForImageTextToText,
RawImage,
TextStreamer
} from '@huggingface/transformers'
import type { LlavaProcessor, PreTrainedModel } from '@huggingface/transformers'
export function useVLM() {
const processor = shallowRef<LlavaProcessor | null>(null)
const model = shallowRef<PreTrainedModel | null>(null)
const isLoaded = ref(false)
const isLoading = ref(false)
const error = ref<string | null>(null)
async function load(onProgress?: (msg: string) => void) {
if (isLoaded.value) return
isLoading.value = true
error.value = null
try {
onProgress?.('Loading processor...')
processor.value = await AutoProcessor.from_pretrained(
'onnx-community/FastVLM-0.5B-ONNX'
)
onProgress?.('Loading model...')
model.value = await AutoModelForImageTextToText.from_pretrained(
'onnx-community/FastVLM-0.5B-ONNX',
{
dtype: {
embed_tokens: 'fp16',
vision_encoder: 'q4',
decoder_model_merged: 'q4'
},
device: 'webgpu'
}
)
isLoaded.value = true
} catch (e) {
error.value = e instanceof Error ? e.message : String(e)
throw e
} finally {
isLoading.value = false
}
}
async function runInference(
video: HTMLVideoElement,
prompt: string,
onToken?: (token: string) => void
): Promise<string> {
if (!processor.value || !model.value) {
throw new Error('Model not loaded')
}
const canvas = document.createElement('canvas')
canvas.width = video.videoWidth
canvas.height = video.videoHeight
const ctx = canvas.getContext('2d')!
ctx.drawImage(video, 0, 0)
const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
const frame = new RawImage(
imageData.data,
imageData.width,
imageData.height,
4
)
const messages = [
{
role: 'system',
content: 'You are a helpful visual assistant.'
},
{ role: 'user', content: `<image>\n${prompt}` }
]
const formattedPrompt = processor.value.apply_chat_template(messages, {
add_generation_prompt: true
})
const inputs = await processor.value(frame, formattedPrompt)
let result = ''
const streamer = new TextStreamer(processor.value.tokenizer!, {
skip_prompt: true,
skip_special_tokens: true,
callback_function: (token: string) => {
result += token
onToken?.(token)
}
})
await model.value.generate({
...inputs,
max_new_tokens: 512,
do_sample: false,
streamer,
repetition_penalty: 1.2
})
return result.trim()
}
return { isLoaded, isLoading, error, load, runInference }
}
import { watch, onUnmounted, type Ref } from 'vue'
import type { useVLM } from './useVLM'
export function useCaptioningLoop(
video: Ref<HTMLVideoElement | null>,
isRunning: Ref<boolean>,
prompt: Ref<string>,
vlm: ReturnType<typeof useVLM>,
onCaptionUpdate: (caption: string) => void,
onError: (error: string) => void
) {
let abortController: AbortController | null = null
watch(
[isRunning, prompt, () => vlm.isLoaded.value],
([running, currentPrompt, loaded]) => {
if (abortController) {
abortController.abort()
}
if (!running || !loaded) return
abortController = new AbortController()
const { signal } = abortController
const loop = async () => {
while (!signal.aborted) {
if (video.value?.readyState >= 2) {
onCaptionUpdate('')
try {
await vlm.runInference(
video.value,
currentPrompt,
onCaptionUpdate
)
} catch (e) {
if (!signal.aborted) {
const msg = e instanceof Error ? e.message : String(e)
onError(msg)
}
}
}
if (signal.aborted) break
await new Promise(resolve => setTimeout(resolve, 2000))
}
}
setTimeout(loop, 0)
},
{ immediate: true }
)
onUnmounted(() => {
if (abortController) {
abortController.abort()
}
})
}
<template>
<div class="video-container">
<video
ref="videoRef"
autoplay
playsinline
muted
class="video-background"
/>
<div v-if="screen === 'permission'" class="permission-dialog">
<button @click="requestWebcamAccess">
Enable Camera
</button>
</div>
<div v-else-if="screen === 'loading'" class="loading-screen">
<p>{{ loadingMessage }}</p>
<div class="progress-bar" :style="{ width: `${loadingProgress}%` }" />
</div>
<div v-else-if="screen === 'captioning'" class="captioning-view">
<div class="controls">
<button @click="isLoopRunning = !isLoopRunning">
{{ isLoopRunning ? 'Pause' : 'Resume' }}
</button>
</div>
<div class="caption-box">
<input
v-model="currentPrompt"
placeholder="Ask a question about the video..."
class="prompt-input"
/>
<div class="caption-text">
{{ captioningError || caption || 'Starting...' }}
</div>
</div>
</div>
</div>
</template>
<script setup lang="ts">
import { ref, watch } from 'vue'
import { useVLM } from './useVLM'
import { useCaptioningLoop } from './useCaptioningLoop'
const videoRef = ref<HTMLVideoElement | null>(null)
const screen = ref<'permission' | 'loading' | 'captioning'>('permission')
const caption = ref('')
const currentPrompt = ref('Describe what you see in one sentence.')
const isLoopRunning = ref(true)
const captioningError = ref<string | null>(null)
const loadingMessage = ref('')
const loadingProgress = ref(0)
const vlm = useVLM()
async function requestWebcamAccess() {
try {
const stream = await navigator.mediaDevices.getUserMedia({
video: { facingMode: 'user' }
})
if (videoRef.value) {
videoRef.value.srcObject = stream
}
screen.value = 'loading'
await loadModel()
} catch (err) {
console.error('Camera access denied:', err)
}
}
async function loadModel() {
if (typeof navigator.gpu === 'undefined') {
loadingMessage.value = 'WebGPU not supported. Try Chrome or Edge.'
return
}
try {
await vlm.load((msg) => {
loadingMessage.value = msg
if (msg.includes('processor')) loadingProgress.value = 20
else if (msg.includes('model')) loadingProgress.value = 50
else if (msg.includes('success')) loadingProgress.value = 100
})
screen.value = 'captioning'
} catch (error) {
loadingMessage.value = 'Failed to load model'
console.error(error)
}
}
const handleCaptionUpdate = (token: string) => {
if (token === '') {
caption.value = ''
} else {
caption.value += token
}
captioningError.value = null
}
const handleError = (error: string) => {
captioningError.value = error
}
useCaptioningLoop(
videoRef,
isLoopRunning,
currentPrompt,
vlm,
handleCaptionUpdate,
handleError
)
</script>
<style scoped>
.video-container {
position: relative;
width: 100%;
height: 100vh;
background: #000;
}
.video-background {
width: 100%;
height: 100%;
object-fit: cover;
}
.permission-dialog,
.loading-screen,
.captioning-view {
position: absolute;
inset: 0;
display: flex;
align-items: center;
justify-content: center;
color: white;
}
.loading-screen {
flex-direction: column;
gap: 1rem;
}
.progress-bar {
height: 4px;
background: #3b82f6;
transition: width 0.3s;
}
.captioning-view {
flex-direction: column;
align-items: stretch;
justify-content: space-between;
padding: 1rem;
}
.controls {
display: flex;
justify-content: flex-end;
}
.caption-box {
background: rgba(0, 0, 0, 0.8);
padding: 1rem;
border-radius: 0.5rem;
}
.prompt-input {
width: 100%;
padding: 0.5rem;
margin-bottom: 1rem;
background: rgba(255, 255, 255, 0.1);
border: 1px solid rgba(255, 255, 255, 0.2);
color: white;
border-radius: 0.25rem;
}
.caption-text {
min-height: 3rem;
line-height: 1.5;
}
</style>
import {
AutoProcessor,
AutoModelForImageTextToText,
RawImage,
TextStreamer
} from '@huggingface/transformers'
import type { LlavaProcessor, PreTrainedModel } from '@huggingface/transformers'
let processor: LlavaProcessor | null = null
let model: PreTrainedModel | null = null
async function loadModel(onProgress?: (msg: string) => void) {
onProgress?.('Loading processor...')
processor = await AutoProcessor.from_pretrained(
'onnx-community/FastVLM-0.5B-ONNX'
)
onProgress?.('Loading model...')
model = await AutoModelForImageTextToText.from_pretrained(
'onnx-community/FastVLM-0.5B-ONNX',
{
dtype: {
embed_tokens: 'fp16',
vision_encoder: 'q4',
decoder_model_merged: 'q4'
},
device: 'webgpu'
}
)
}
function captureFrame(video: HTMLVideoElement): RawImage {
const canvas = document.createElement('canvas')
canvas.width = video.videoWidth
canvas.height = video.videoHeight
const ctx = canvas.getContext('2d')!
ctx.drawImage(video, 0, 0)
const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
return new RawImage(imageData.data, imageData.width, imageData.height, 4)
}
async function generateCaption(
video: HTMLVideoElement,
prompt: string,
onToken: (token: string) => void
): Promise<string> {
if (!processor || !model) {
throw new Error('Model not loaded')
}
const frame = captureFrame(video)
const messages = [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: `<image>\n${prompt}` }
]
const formattedPrompt = processor.apply_chat_template(messages, {
add_generation_prompt: true
})
const inputs = await processor(frame, formattedPrompt)
let result = ''
const streamer = new TextStreamer(processor.tokenizer!, {
skip_prompt: true,
skip_special_tokens: true,
callback_function: (token: string) => {
result += token
onToken(token)
}
})
await model.generate({
...inputs,
max_new_tokens: 512,
do_sample: false,
streamer,
repetition_penalty: 1.2
})
return result.trim()
}
async function startCaptioning(
video: HTMLVideoElement,
prompt: string,
onUpdate: (caption: string) => void,
signal: AbortSignal
) {
while (!signal.aborted) {
if (video.readyState >= 2 && !video.paused) {
let currentCaption = ''
await generateCaption(video, prompt, (token) => {
currentCaption += token
onUpdate(currentCaption)
})
}
await new Promise(resolve => setTimeout(resolve, 2000))
}
}
// Usage
const video = document.querySelector('video')!
const controller = new AbortController()
await loadModel((msg) => console.log(msg))
await startCaptioning(
video,
'Describe what you see',
(caption) => console.log(caption),
controller.signal
)

Mastering Sora 2 Prompting: A Technical Guide with Code Examples
Learn how to craft effective prompts for OpenAI's Sora 2 API with practical TypeScript examples. From basic prompt structure to advanced techniques like remix and multi-shot generation.
Placing text behind video subjects with AI that runs entirely in your browser
Build a client-side video editor that uses the RMBG V1.4 model through Transformers.js to create layered text effects. No server required.