Back to all AI tutorials
Aug 19, 2025 - 18 min read
Placing text behind video subjects with AI that runs entirely in your browser

Placing text behind video subjects with AI that runs entirely in your browser

Build a client-side video editor that uses the RMBG V1.4 model through Transformers.js to create layered text effects. No server required.

Patrick the AI Engineer

Patrick the AI Engineer

Try the Interactive Playground

This tutorial is accompanied by an interactive playground. Test the code, experiment with different parameters, and see the results in real-time.

Go to Playground

You've seen those effects where text appears behind a person in an image. We're going to do that with video, and we're going to run the entire AI model in your browser. No server, no API calls, no uploads.

The interesting part is that we can now run sophisticated computer vision models directly in JavaScript. The RMBG V1.4 model removes backgrounds and generates smooth alpha mattes. We'll use Transformers.js to run it with WebGPU acceleration, then layer video frames to create the effect.

Let's start with the basic setup. We need Transformers.js and that's it.

npm init -y
npm i @huggingface/transformers
npm i --save-dev typescript vite

The first thing we need is to load the AI model. This happens once when your app starts.

import { AutoModel, AutoProcessor, RawImage } from "@huggingface/transformers";

const model = await AutoModel.from_pretrained("briaai/RMBG-1.4", {
  config: { model_type: "custom" },
  device: "webgpu",
  dtype: "fp32",
});

We're requesting WebGPU and 32-bit floats. That's what gives us GPU acceleration. Without WebGPU this runs about 5x slower through WebAssembly, but it still works.

Now we add the processor that prepares images for the model.

import { AutoModel, AutoProcessor, RawImage } from "@huggingface/transformers";

const model = await AutoModel.from_pretrained("briaai/RMBG-1.4", {
  config: { model_type: "custom" },
  device: "webgpu",
  dtype: "fp32",
});

const processor = await AutoProcessor.from_pretrained("briaai/RMBG-1.4", {
  config: {
    do_normalize: true,
    image_mean: [0.5, 0.5, 0.5],
    image_std: [1, 1, 1],
    size: { width: 1024, height: 1024 },
  },
});

These normalization values match what the model was trained on. Change them and you get garbage output.

Let's write the background removal function. It takes a canvas and returns a new canvas with the background removed.

async function removeBackground(canvas: HTMLCanvasElement) {
  const image = RawImage.fromCanvas(canvas);
  const { pixel_values } = await processor(image);
  
  return image;
}

This converts the canvas to a format the model understands and preprocesses it. The processor handles resizing and normalization automatically.

Now we run the model and get the alpha matte.

async function removeBackground(canvas: HTMLCanvasElement) {
  const image = RawImage.fromCanvas(canvas);
  const { pixel_values } = await processor(image);
  
  const { output } = await model({ input: pixel_values });
  const mask = await RawImage.fromTensor(
    output[0].mul(255).to("uint8")
  ).resize(image.width, image.height);
  
  return mask;
}

The model outputs a tensor we convert to an image. We multiply by 255 because the model outputs values between 0 and 1, and we need 0 to 255 for pixel data.

Now we apply the mask to create transparency.

async function removeBackground(canvas: HTMLCanvasElement) {
  const image = RawImage.fromCanvas(canvas);
  const { pixel_values } = await processor(image);
  
  const { output } = await model({ input: pixel_values });
  const mask = await RawImage.fromTensor(
    output[0].mul(255).to("uint8")
  ).resize(image.width, image.height);
  
  const result = document.createElement("canvas");
  result.width = canvas.width;
  result.height = canvas.height;
  const ctx = result.getContext("2d");
  
  ctx.drawImage(canvas, 0, 0);
  const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
  
  for (let i = 0; i < mask.data.length; i++) {
    imageData.data[4 * i + 3] = mask.data[i];
  }
  
  ctx.putImageData(imageData, 0, 0);
  return result;
}

We're directly manipulating the alpha channel in the pixel array. Every 4th byte is transparency, and we're replacing it with the mask value.

That's background removal working. Now we need to process video frames. Let's write a function that extracts frames from a video element.

async function extractFrames(
  video: HTMLVideoElement,
  fps: number
) {
  const canvas = document.createElement("canvas");
  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;
  const ctx = canvas.getContext("2d", { willReadFrequently: true });
  
  return [];
}

The willReadFrequently flag tells the browser we'll be reading pixels a lot. This makes getImageData() calls faster.

Now we seek through the video and capture each frame.

async function extractFrames(
  video: HTMLVideoElement,
  fps: number
) {
  const canvas = document.createElement("canvas");
  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;
  const ctx = canvas.getContext("2d", { willReadFrequently: true });
  
  const frames: HTMLCanvasElement[] = [];
  const totalFrames = Math.floor(video.duration * fps);
  
  for (let i = 0; i < totalFrames; i++) {
    video.currentTime = i / fps;
    await new Promise(resolve => video.onseeked = resolve);
    
    ctx.drawImage(video, 0, 0);
    frames.push(canvas);
  }
  
  return frames;
}

We're setting currentTime and waiting for the seeked event. This ensures we capture the actual frame at that timestamp, not whatever's in the buffer.

But we're not processing the frames yet. Let's add that.

async function extractFrames(
  video: HTMLVideoElement,
  fps: number,
  text: string
) {
  const canvas = document.createElement("canvas");
  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;
  const ctx = canvas.getContext("2d", { willReadFrequently: true });
  
  const frames: HTMLCanvasElement[] = [];
  const totalFrames = Math.floor(video.duration * fps);
  
  for (let i = 0; i < totalFrames; i++) {
    video.currentTime = i / fps;
    await new Promise(resolve => video.onseeked = resolve);
    
    ctx.drawImage(video, 0, 0);
    const processed = await processFrame(canvas, text);
    frames.push(processed);
  }
  
  return frames;
}

Each frame goes through processFrame before we store it. This is where we create the layered effect.

The frame processing needs to composite three layers: the original image, the text, and the foreground with removed background.

async function processFrame(
  canvas: HTMLCanvasElement,
  text: string
) {
  const ctx = canvas.getContext("2d", { willReadFrequently: true });
  const original = ctx.getImageData(0, 0, canvas.width, canvas.height);
  
  const foreground = await removeBackground(canvas);
  
  return foreground;
}

We grab the original frame data first because we'll need it for the base layer.

Now let's create the text overlay layer.

async function processFrame(
  canvas: HTMLCanvasElement,
  text: string
) {
  const ctx = canvas.getContext("2d", { willReadFrequently: true });
  const original = ctx.getImageData(0, 0, canvas.width, canvas.height);
  
  const foreground = await removeBackground(canvas);
  const textLayer = createTextLayer(canvas.width, canvas.height, text);
  
  return foreground;
}

The text layer is just a canvas with text drawn on it. We'll write that function in a moment.

Now we composite all three layers together.

async function processFrame(
  canvas: HTMLCanvasElement,
  text: string
) {
  const ctx = canvas.getContext("2d", { willReadFrequently: true });
  const original = ctx.getImageData(0, 0, canvas.width, canvas.height);
  
  const foreground = await removeBackground(canvas);
  const textLayer = createTextLayer(canvas.width, canvas.height, text);
  
  const composite = document.createElement("canvas");
  composite.width = canvas.width;
  composite.height = canvas.height;
  const compositeCtx = composite.getContext("2d");
  
  compositeCtx.putImageData(original, 0, 0);
  compositeCtx.drawImage(textLayer, 0, 0);
  compositeCtx.drawImage(foreground, 0, 0);
  
  return composite;
}

The draw order creates the effect. Original on bottom, text in the middle, foreground subject on top. That makes the text appear behind the person.

The text layer function is straightforward canvas drawing.

function createTextLayer(
  width: number,
  height: number,
  text: string
) {
  const canvas = document.createElement("canvas");
  canvas.width = width;
  canvas.height = height;
  const ctx = canvas.getContext("2d");
  
  const fontSize = Math.min(width, height) / 4;
  ctx.font = `bold ${fontSize}px Arial`;
  ctx.fillStyle = "rgba(255, 255, 255, 0.9)";
  ctx.textAlign = "center";
  ctx.textBaseline = "middle";
  ctx.fillText(text, width / 2, height / 2);
  
  return canvas;
}

We size the font based on the video dimensions so it scales properly.

Now we have an array of processed frames. We need to encode them back into a video. The MediaRecorder API handles this.

async function reassembleVideo(
  frames: HTMLCanvasElement[],
  fps: number
) {
  const canvas = document.createElement("canvas");
  canvas.width = frames[0].width;
  canvas.height = frames[0].height;
  
  const stream = canvas.captureStream(fps);
  const recorder = new MediaRecorder(stream, {
    mimeType: "video/webm;codecs=vp9"
  });
}

We create a video stream from a canvas. Whatever we draw on this canvas gets recorded.

Let's set up the recorder to collect chunks.

async function reassembleVideo(
  frames: HTMLCanvasElement[],
  fps: number
) {
  const canvas = document.createElement("canvas");
  canvas.width = frames[0].width;
  canvas.height = frames[0].height;
  
  const stream = canvas.captureStream(fps);
  const recorder = new MediaRecorder(stream, {
    mimeType: "video/webm;codecs=vp9"
  });
  
  const chunks: Blob[] = [];
  
  return new Promise<Blob>(resolve => {
    recorder.ondataavailable = e => {
      if (e.data.size > 0) chunks.push(e.data);
    };
    
    recorder.onstop = () => {
      resolve(new Blob(chunks, { type: "video/webm" }));
    };
  });
}

We collect video chunks as they're encoded and combine them when recording stops.

Now we need to draw frames to the canvas at the right interval.

async function reassembleVideo(
  frames: HTMLCanvasElement[],
  fps: number
) {
  const canvas = document.createElement("canvas");
  canvas.width = frames[0].width;
  canvas.height = frames[0].height;
  const ctx = canvas.getContext("2d");
  
  const stream = canvas.captureStream(fps);
  const recorder = new MediaRecorder(stream, {
    mimeType: "video/webm;codecs=vp9"
  });
  
  const chunks: Blob[] = [];
  
  return new Promise<Blob>(resolve => {
    recorder.ondataavailable = e => {
      if (e.data.size > 0) chunks.push(e.data);
    };
    
    recorder.onstop = () => {
      resolve(new Blob(chunks, { type: "video/webm" }));
    };
    
    recorder.start();
    
    let i = 0;
    const interval = 1000 / fps;
    const draw = () => {
      if (i < frames.length) {
        ctx.drawImage(frames[i++], 0, 0);
        setTimeout(draw, interval);
      } else {
        recorder.stop();
      }
    };
    draw();
  });
}

We draw frames at the correct interval using setTimeout. This isn't perfectly frame-accurate but it's close enough.

Now let's tie everything together with a main function.

async function processVideo(videoUrl: string, text: string) {
  const video = document.createElement("video");
  video.src = videoUrl;
  video.muted = true;
  
  await new Promise(resolve => video.onloadedmetadata = resolve);
  
  const fps = 30;
  const frames = await extractFrames(video, fps, text);
  const videoBlob = await reassembleVideo(frames, fps);
  
  return videoBlob;
}

This loads the video, processes all frames, and returns a blob you can download.

That's the core implementation working. But there's a memory problem. A 10-second video at 30 fps means 300 canvases in memory. Each 720p canvas is about 1.8 MB. That's over 500 MB just for frames.

You can handle this by processing and encoding frames as you go instead of storing them all. Here's a streaming approach.

async function processVideoStreaming(videoUrl: string, text: string) {
  const video = document.createElement("video");
  video.src = videoUrl;
  video.muted = true;
  
  await new Promise(resolve => video.onloadedmetadata = resolve);
  
  const fps = 30;
  const totalFrames = Math.floor(video.duration * fps);
  
  const canvas = document.createElement("canvas");
  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;
  const ctx = canvas.getContext("2d");
  
  const stream = canvas.captureStream(fps);
  const recorder = new MediaRecorder(stream, {
    mimeType: "video/webm;codecs=vp9"
  });
  
  const chunks: Blob[] = [];
  recorder.ondataavailable = e => {
    if (e.data.size > 0) chunks.push(e.data);
  };
  
  recorder.start();
  
  for (let i = 0; i < totalFrames; i++) {
    video.currentTime = i / fps;
    await new Promise(resolve => video.onseeked = resolve);
    
    const frameCanvas = document.createElement("canvas");
    frameCanvas.width = video.videoWidth;
    frameCanvas.height = video.videoHeight;
    const frameCtx = frameCanvas.getContext("2d", { willReadFrequently: true });
    frameCtx.drawImage(video, 0, 0);
    
    const processed = await processFrame(frameCanvas, text);
    ctx.drawImage(processed, 0, 0);
    
    await new Promise(resolve => setTimeout(resolve, 1000 / fps));
  }
  
  recorder.stop();
  
  return new Promise<Blob>(resolve => {
    recorder.onstop = () => {
      resolve(new Blob(chunks, { type: "video/webm" }));
    };
  });
}

This processes and draws frames directly to the recording canvas instead of storing them. Memory usage stays constant.

Now let's look at performance. WebGPU makes this usable. On my M1 MacBook Pro, frame processing dropped from 4.76 seconds to 851 milliseconds per frame when I switched from WebAssembly to WebGPU. That's the difference between waiting an hour and waiting 7 minutes for a 10-second clip.

You should detect WebGPU support before trying to use it.

async function checkWebGPU() {
  if (!navigator.gpu) {
    throw new Error("WebGPU not supported. Use a compatible browser.");
  }
  
  const adapter = await navigator.gpu.requestAdapter();
  if (!adapter) {
    throw new Error("No WebGPU adapter found.");
  }
  
  return true;
}

Call this before loading the model so users know upfront if it'll work.

The model download is about 176 MB. It caches after first load, but that initial download takes time. Show a progress indicator.

const model = await AutoModel.from_pretrained("briaai/RMBG-1.4", {
  config: { model_type: "custom" },
  device: "webgpu",
  dtype: "fp32",
  progress_callback: (progress) => {
    if (progress.status === "downloading") {
      const percent = (progress.loaded / progress.total) * 100;
      console.log(`Downloading model: ${percent.toFixed(1)}%`);
    }
  },
});

This callback fires during download so you can update your UI.

Adding it to a Vue component

Let's adapt this for Vue. We'll create a composable to manage the model state.

// useBackgroundRemoval.ts
import { ref } from 'vue';
import { AutoModel, AutoProcessor, RawImage } from '@huggingface/transformers';

export function useBackgroundRemoval() {
  const model = ref(null);
  const processor = ref(null);
  const isLoading = ref(false);
  const progress = ref(0);
  
  async function loadModel() {
    isLoading.value = true;
    
    model.value = await AutoModel.from_pretrained("briaai/RMBG-1.4", {
      config: { model_type: "custom" },
      device: "webgpu",
      dtype: "fp32",
      progress_callback: (p) => {
        if (p.status === "downloading") {
          progress.value = (p.loaded / p.total) * 100;
        }
      },
    });
    
    processor.value = await AutoProcessor.from_pretrained("briaai/RMBG-1.4", {
      config: {
        do_normalize: true,
        image_mean: [0.5, 0.5, 0.5],
        image_std: [1, 1, 1],
        size: { width: 1024, height: 1024 },
      },
    });
    
    isLoading.value = false;
  }
  
  return { model, processor, isLoading, progress, loadModel };
}

This gives you reactive state for the model loading process.

Now use it in a component.

<script setup lang="ts">
import { ref, onMounted } from 'vue';
import { useBackgroundRemoval } from './useBackgroundRemoval';

const { model, processor, isLoading, progress, loadModel } = useBackgroundRemoval();
const videoFile = ref<File | null>(null);
const processedVideo = ref<Blob | null>(null);
const isProcessing = ref(false);

onMounted(() => {
  loadModel();
});

async function handleFileUpload(event: Event) {
  const file = (event.target as HTMLInputElement).files?.[0];
  if (file) videoFile.value = file;
}

async function process() {
  if (!videoFile.value || !model.value || !processor.value) return;
  
  isProcessing.value = true;
  const url = URL.createObjectURL(videoFile.value);
  processedVideo.value = await processVideoStreaming(url, "IMAGINATION");
  isProcessing.value = false;
}
</script>

<template>
  <div>
    <div v-if="isLoading">
      Loading model: {{ progress.toFixed(1) }}%
    </div>
    
    <input 
      v-else
      type="file" 
      accept="video/*" 
      @change="handleFileUpload"
    />
    
    <button 
      v-if="videoFile && !isProcessing"
      @click="process"
    >
      Process Video
    </button>
    
    <div v-if="isProcessing">
      Processing...
    </div>
    
    <video 
      v-if="processedVideo"
      :src="URL.createObjectURL(processedVideo)"
      controls
    />
  </div>
</template>

The component manages UI state while the composable handles the AI model.

One more thing to consider: quality. The model works best with clear foreground/background separation. Videos with messy backgrounds or motion blur give mixed results. You might want to add a preview feature so users can see a single processed frame before committing to the full video.

async function generatePreview(videoUrl: string, text: string) {
  const video = document.createElement("video");
  video.src = videoUrl;
  video.muted = true;
  
  await new Promise(resolve => video.onloadedmetadata = resolve);
  
  video.currentTime = 0;
  await new Promise(resolve => video.onseeked = resolve);
  
  const canvas = document.createElement("canvas");
  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;
  const ctx = canvas.getContext("2d", { willReadFrequently: true });
  ctx.drawImage(video, 0, 0);
  
  return await processFrame(canvas, text);
}

This processes just the first frame so users can see if the effect looks good before processing the entire video.

We've built a browser-based video editor that runs AI models locally. The key pieces are Transformers.js for model inference, WebGPU for performance, and canvas APIs for compositing layers. The streaming approach keeps memory usage reasonable, and the preview feature helps users avoid wasting time on videos that won't work well.

Demo

Full code examples

import {
  AutoModel,
  AutoProcessor,
  env,
  RawImage,
  PreTrainedModel,
  Processor,
} from "@huggingface/transformers";

env.allowLocalModels = false;

if (env.backends.onnx.wasm) {
  env.backends.onnx.wasm.proxy = true;
}

let model: PreTrainedModel | null = null;
let processor: Processor | null = null;

// Load model
async function loadModel() {
  model = await AutoModel.from_pretrained("briaai/RMBG-1.4", {
    config: { model_type: "custom" } as any,
    device: "webgpu",
    dtype: "fp32",
  });

  processor = await AutoProcessor.from_pretrained("briaai/RMBG-1.4", {
    config: {
      do_normalize: true,
      image_mean: [0.5, 0.5, 0.5],
      image_std: [1, 1, 1],
      size: { width: 1024, height: 1024 },
      rescale_factor: 0.00392156862745098,
    },
  });
}

// Remove background using AI
async function removeBackground(
  canvas: HTMLCanvasElement
): Promise<HTMLCanvasElement> {
  const image = RawImage.fromCanvas(canvas);
  const { pixel_values } = await processor!(image);
  const { output } = await model!({ input: pixel_values });
  
  const mask = await RawImage.fromTensor(
    output[0].mul(255).to("uint8")
  ).resize(image.width, image.height);
  
  const result = document.createElement("canvas");
  result.width = canvas.width;
  result.height = canvas.height;
  const ctx = result.getContext("2d", { willReadFrequently: true })!;
  
  ctx.drawImage(canvas, 0, 0);
  const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
  
  for (let i = 0; i < mask.data.length; i++) {
    imageData.data[4 * i + 3] = mask.data[i];
  }
  
  ctx.putImageData(imageData, 0, 0);
  return result;
}

// Create text overlay
function createTextLayer(
  width: number,
  height: number,
  text: string
): HTMLCanvasElement {
  const canvas = document.createElement("canvas");
  canvas.width = width;
  canvas.height = height;
  const ctx = canvas.getContext("2d")!;
  
  const fontSize = Math.min(width, height) / 4;
  ctx.font = `bold ${fontSize}px Arial`;
  ctx.fillStyle = "rgba(255, 255, 255, 0.9)";
  ctx.strokeStyle = "rgba(0, 0, 0, 0.5)";
  ctx.lineWidth = 4;
  ctx.textAlign = "center";
  ctx.textBaseline = "middle";
  
  ctx.strokeText(text, width / 2, height / 2);
  ctx.fillText(text, width / 2, height / 2);
  
  return canvas;
}

// Process single frame
async function processFrame(
  canvas: HTMLCanvasElement,
  text: string
): Promise<HTMLCanvasElement> {
  const ctx = canvas.getContext("2d", { willReadFrequently: true })!;
  const original = ctx.getImageData(0, 0, canvas.width, canvas.height);
  
  const foreground = await removeBackground(canvas);
  const textLayer = createTextLayer(canvas.width, canvas.height, text);
  
  const composite = document.createElement("canvas");
  composite.width = canvas.width;
  composite.height = canvas.height;
  const compositeCtx = composite.getContext("2d")!;
  
  compositeCtx.putImageData(original, 0, 0);
  compositeCtx.drawImage(textLayer, 0, 0);
  compositeCtx.drawImage(foreground, 0, 0);
  
  return composite;
}

// Process video with streaming approach
async function processVideoStreaming(
  videoUrl: string,
  text: string
): Promise<Blob> {
  const video = document.createElement("video");
  video.src = videoUrl;
  video.muted = true;
  
  await new Promise(resolve => video.onloadedmetadata = resolve);
  
  const fps = 30;
  const totalFrames = Math.floor(video.duration * fps);
  
  const canvas = document.createElement("canvas");
  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;
  const ctx = canvas.getContext("2d")!;
  
  const stream = canvas.captureStream(fps);
  const recorder = new MediaRecorder(stream, {
    mimeType: "video/webm;codecs=vp9",
  });
  
  const chunks: Blob[] = [];
  recorder.ondataavailable = e => {
    if (e.data.size > 0) chunks.push(e.data);
  };
  
  recorder.start();
  
  for (let i = 0; i < totalFrames; i++) {
    video.currentTime = i / fps;
    await new Promise(resolve => video.onseeked = resolve);
    
    const frameCanvas = document.createElement("canvas");
    frameCanvas.width = video.videoWidth;
    frameCanvas.height = video.videoHeight;
    const frameCtx = frameCanvas.getContext("2d", { willReadFrequently: true })!;
    frameCtx.drawImage(video, 0, 0);
    
    const processed = await processFrame(frameCanvas, text);
    ctx.drawImage(processed, 0, 0);
    
    await new Promise(resolve => setTimeout(resolve, 1000 / fps));
  }
  
  recorder.stop();
  
  return new Promise<Blob>(resolve => {
    recorder.onstop = () => {
      resolve(new Blob(chunks, { type: "video/webm" }));
    };
  });
}

// Generate preview of first frame
async function generatePreview(videoUrl: string, text: string) {
  const video = document.createElement("video");
  video.src = videoUrl;
  video.muted = true;
  
  await new Promise(resolve => video.onloadedmetadata = resolve);
  
  video.currentTime = 0;
  await new Promise(resolve => video.onseeked = resolve);
  
  const canvas = document.createElement("canvas");
  canvas.width = video.videoWidth;
  canvas.height = video.videoHeight;
  const ctx = canvas.getContext("2d", { willReadFrequently: true })!;
  ctx.drawImage(video, 0, 0);
  
  return await processFrame(canvas, text);
}

export { loadModel, processVideoStreaming, generatePreview };
Copyright © 2025