
How to Build a Video Scene Detector that Runs in Your Browser
Build a client-side scene detector using MediaBunny.js and statistical analysis. Process videos frame-by-frame without sending data to a server.
Patrick the AI Engineer
Table of Contents
You're building a video editing app and need to help users find scene changes in hour-long videos. Sending everything to your server is expensive and slow. What if the browser could do the work instead?
We're going to build a scene detector that runs entirely client-side using MediaBunny.js, a WebAssembly-powered video toolkit. The video never leaves the user's device, there's no upload time, and you don't pay for server compute.
The basic idea is simple: compare each frame to the previous one. When they're very different, you've found a scene change. Let's start by comparing two frames.
function calculateMAFD(frame1: ImageData, frame2: ImageData): number {
const data1 = frame1.data
const data2 = frame2.data
let diff = 0
for (let i = 0; i < data1.length; i += 4) {
const gray1 = (299 * data1[i] + 587 * data1[i + 1] + 114 * data1[i + 2]) / 1000
const gray2 = (299 * data2[i] + 587 * data2[i + 1] + 114 * data2[i + 2]) / 1000
diff += Math.abs(gray1 - gray2)
}
return diff / (frame1.width * frame1.height)
}
This loops through every pixel in RGBA format (4 values per pixel), converts to grayscale, and calculates the absolute difference. We're using integer math (scaled by 1000) to avoid floating-point operations. The result is normalized by the total pixel count, giving us a single number that represents how different the frames are.
But here's the problem: what's a "big" difference? A video of someone talking has small frame-to-frame changes. An action movie has constant motion. If you use a fixed threshold, you'll miss scenes in one video and get false positives in the other.
The solution is to track recent differences and look for statistical outliers. Let's add mean and standard deviation calculations.
function calculateMean(values: number[]): number {
return values.reduce((sum, val) => sum + val, 0) / values.length
}
function calculateStdDev(values: number[], mean: number): number {
const sqDiffs = values.map(val => (val - mean) ** 2)
return Math.sqrt(sqDiffs.reduce((sum, val) => sum + val, 0) / values.length)
}
Nothing fancy here, just standard statistics. Now let's use these to detect scene changes.
We'll keep a sliding window of the last 60 frame differences (6 seconds at 10fps). When a new difference is more than 3 standard deviations above the mean, that's a scene change.
const WINDOW_SIZE = 60
const mafdValues: number[] = []
let previousFrame: ImageData | null = null
function processFrame(currentFrame: ImageData) {
if (!previousFrame) {
previousFrame = currentFrame
return false
}
const diff = calculateMAFD(previousFrame, currentFrame)
previousFrame = currentFrame
if (mafdValues.length < WINDOW_SIZE) {
mafdValues.push(diff)
return false
}
const mean = calculateMean(mafdValues)
const stdDev = calculateStdDev(mafdValues, mean)
const isSceneChange = stdDev > 0.1 && Math.abs(diff - mean) / stdDev > 3
if (isSceneChange) {
mafdValues.length = 0
} else {
mafdValues.push(diff)
mafdValues.shift()
}
return isSceneChange
}
We wait until we have 60 frames of data before detecting anything. This builds a baseline for what's "normal" in this video. The stdDev > 0.1 check prevents false positives in static videos where tiny changes look like outliers.
When we detect a scene change, we reset the window. The new scene might have completely different motion characteristics, so we need a fresh baseline.
Now let's actually process a video. MediaBunny.js makes this straightforward.
import { Input, BlobSource, VideoSampleSink, ALL_FORMATS } from 'mediabunny'
async function detectScenes(videoFile: File) {
const input = new Input({
source: new BlobSource(videoFile),
formats: ALL_FORMATS
})
const videoTrack = await input.getPrimaryVideoTrack()
if (!videoTrack) throw new Error('No video track found')
const sink = new VideoSampleSink(videoTrack)
const duration = await input.computeDuration()
We load the video from a File object and get its video track. The VideoSampleSink lets us pull out individual frames at specific timestamps.
Let's set up a canvas for processing. Here's the key performance trick: we'll downscale frames to 160px wide.
const fps = 10
const totalFrames = Math.floor(duration * fps)
const scenes: { timestamp: number; image: string }[] = []
const canvas = document.createElement('canvas')
const ctx = canvas.getContext('2d')!
canvas.width = 160
canvas.height = Math.round((videoTrack.displayHeight * 160) / videoTrack.displayWidth)
Processing a 160x90 frame is about 100x faster than 1920x1080. We're only looking for big visual changes, not fine details, so this works perfectly.
Now we loop through the video at 10fps and run our detection algorithm on each frame.
for (let i = 0; i < totalFrames; i++) {
const timestamp = i / fps
const sample = await sink.getSample(timestamp)
if (sample) {
sample.draw(ctx, 0, 0, canvas.width, canvas.height)
const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
if (processFrame(imageData)) {
scenes.push({
timestamp,
image: canvas.toDataURL('image/jpeg')
})
}
sample.close()
}
}
return scenes
}
For each frame, we draw it to our small canvas, extract the pixel data, and check if it's a scene change. The sample.close() call is crucial—it frees the frame's memory so we don't leak.
That's the vanilla TypeScript version. Now let's see how Vue makes this nicer to work with.
Adding Vue Reactivity
Vue gives us automatic UI updates as we process the video. Let's start with the reactive state.
<script setup lang="ts">
import { ref, computed } from 'vue'
import { Input, BlobSource, VideoSampleSink, ALL_FORMATS } from 'mediabunny'
const videoFile = ref<File | null>(null)
const frames = ref<{ src: string; timestamp: number }[]>([])
const status = ref<'idle' | 'loading' | 'detecting'>('idle')
const processedFrames = ref(0)
const totalFrames = ref(0)
As we process frames, processedFrames increments and the UI updates automatically. Let's use that for a progress button.
const buttonText = computed(() => {
if (status.value === 'loading') return 'Loading Video...'
if (status.value === 'detecting') {
const pct = Math.round((processedFrames.value / totalFrames.value) * 100)
return `Detecting... ${pct}%`
}
return 'Detect Scenes'
})
The button text updates reactively as we process the video. Now let's adapt our detection logic.
const detectScenes = async () => {
if (!videoFile.value || status.value !== 'idle') return
try {
status.value = 'loading'
frames.value = []
const input = new Input({
source: new BlobSource(videoFile.value),
formats: ALL_FORMATS
})
const videoTrack = await input.getPrimaryVideoTrack()
if (!videoTrack) return
status.value = 'detecting'
// ...
The core algorithm is the same, but we'll update processedFrames inside the loop to give users real-time feedback.
Here's a nice touch: we'll use two canvases—one small for processing, one full-size for thumbnails.
const processingCanvas = document.createElement('canvas')
const pCtx = processingCanvas.getContext('2d', { willReadFrequently: true })
const thumbnailCanvas = document.createElement('canvas')
const tCtx = thumbnailCanvas.getContext('2d')
processingCanvas.width = 160
thumbnailCanvas.width = videoTrack.displayWidth
This lets us do fast comparisons on small frames while storing nice-looking thumbnails for display.
Inside the frame loop, we update progress and push detected scenes reactively.
for (let i = 0; i < totalFrames.value; i++) {
processedFrames.value = i + 1
// ... get sample, process frame ...
if (isSceneChange) {
sample.draw(tCtx, 0, 0)
frames.value.push({
src: thumbnailCanvas.toDataURL('image/jpeg'),
timestamp
})
}
}
As we detect scenes, they appear in the UI immediately. Users don't have to wait for the entire video to finish processing.
Performance Notes
On my laptop, this processes about 30 frames per second. A 10-minute video takes roughly 3 minutes to analyze. That's acceptable for an in-browser tool.
The downscaling is critical. I initially tried full-resolution frames and it was unusably slow—2-3 frames per second. Downscaling to 160px wide made it 10x faster with no real impact on accuracy.
Memory can be an issue with longer videos. If you detect 100 scenes, that's 100 JPEG thumbnails in RAM. Consider limiting display count or using lower quality JPEGs. The sample.close() call is essential—without it, you'll leak frame buffers and eventually crash the tab.
The 3 standard deviation threshold works well for most content, but you might want to make it adjustable. Action-heavy videos might need 4 or 5 to avoid false positives. A slider would let users tune it.
One caveat: the algorithm needs 6 seconds to build its baseline. Early transitions won't be detected. If your video starts with a fast-cut montage, you'll miss those. You could reduce the window size, but that makes detection less stable.
Wrapping Up
You've built a scene detector that runs entirely in the browser. Users can process hour-long videos on their own device, and you don't pay for compute or storage. The statistical approach adapts automatically to different video types—action movies and talking-head interviews both work without manual tuning.
Demo
Video Scene Detector
Drag & drop a video file here, or click to select a file.
Supports .mp4 and .mov
Full Code Examples
import { Input, BlobSource, VideoSampleSink, ALL_FORMATS } from 'mediabunny'
interface Scene {
timestamp: number
imageData: string
}
// Calculate Mean Absolute Frame Difference
function calculateMAFD(frame1: ImageData, frame2: ImageData): number {
const data1 = frame1.data
const data2 = frame2.data
let diff = 0
const len = data1.length
for (let i = 0; i < len; i += 4) {
// Convert to grayscale using integer math for performance
const gray1 = (299 * data1[i] + 587 * data1[i + 1] + 114 * data1[i + 2]) / 1000
const gray2 = (299 * data2[i] + 587 * data2[i + 1] + 114 * data2[i + 2]) / 1000
diff += Math.abs(gray1 - gray2)
}
return diff / (frame1.width * frame1.height)
}
function calculateMean(data: number[]): number {
if (data.length === 0) return 0
return data.reduce((sum, val) => sum + val, 0) / data.length
}
function calculateStdDev(data: number[], mean: number): number {
if (data.length === 0) return 0
const sqDiff = data.map(value => (value - mean) ** 2)
const avgSqDiff = calculateMean(sqDiff)
return Math.sqrt(avgSqDiff)
}
async function detectScenes(videoFile: File): Promise<Scene[]> {
const input = new Input({
source: new BlobSource(videoFile),
formats: ALL_FORMATS
})
const videoTrack = await input.getPrimaryVideoTrack()
if (!videoTrack) throw new Error('No video track found')
const sink = new VideoSampleSink(videoTrack)
const duration = await input.computeDuration()
const fps = 10
const totalFrames = Math.floor(duration * fps)
const WINDOW_SIZE = 60 // 6 seconds at 10fps
const DOWNSCALE_WIDTH = 160
const scenes: Scene[] = []
const mafdValues: number[] = []
let previousImageData: ImageData | null = null
let downscaleHeight = 0
// Create canvases
const processingCanvas = document.createElement('canvas')
const pCtx = processingCanvas.getContext('2d', { willReadFrequently: true })!
const thumbnailCanvas = document.createElement('canvas')
const tCtx = thumbnailCanvas.getContext('2d')!
for (let i = 0; i < totalFrames; i++) {
const timestamp = i / fps
const sample = await sink.getSample(timestamp)
if (sample) {
// Set canvas sizes on first frame
if (downscaleHeight === 0) {
downscaleHeight = Math.round(
(sample.displayHeight * DOWNSCALE_WIDTH) / sample.displayWidth
)
processingCanvas.width = DOWNSCALE_WIDTH
processingCanvas.height = downscaleHeight
thumbnailCanvas.width = sample.displayWidth
thumbnailCanvas.height = sample.displayHeight
}
// Draw to small canvas for processing
sample.draw(pCtx, 0, 0, processingCanvas.width, processingCanvas.height)
const currentImageData = pCtx.getImageData(
0, 0, processingCanvas.width, processingCanvas.height
)
// Always add first frame
if (i === 0) {
sample.draw(tCtx, 0, 0)
scenes.push({
imageData: thumbnailCanvas.toDataURL('image/jpeg'),
timestamp
})
}
if (previousImageData) {
const diff = calculateMAFD(previousImageData, currentImageData)
let isSceneChange = false
// Check for scene change if window is full
if (mafdValues.length >= WINDOW_SIZE) {
const mean = calculateMean(mafdValues)
const stdDev = calculateStdDev(mafdValues, mean)
// Detect statistical outliers
if (stdDev > 0.1 && Math.abs(diff - mean) / stdDev > 3) {
isSceneChange = true
}
}
if (isSceneChange) {
sample.draw(tCtx, 0, 0)
scenes.push({
imageData: thumbnailCanvas.toDataURL('image/jpeg'),
timestamp
})
mafdValues.length = 0 // Reset window
} else {
mafdValues.push(diff)
if (mafdValues.length > WINDOW_SIZE) {
mafdValues.shift()
}
}
}
previousImageData = currentImageData
sample.close()
}
}
return scenes
}
// Usage
const fileInput = document.querySelector('input[type="file"]') as HTMLInputElement
fileInput.addEventListener('change', async (e) => {
const file = (e.target as HTMLInputElement).files?.[0]
if (file) {
const scenes = await detectScenes(file)
console.log(`Detected ${scenes.length} scenes`)
}
})
<script setup lang="ts">
import { ref, computed, onUnmounted } from 'vue'
import { Input, BlobSource, VideoSampleSink, ALL_FORMATS } from 'mediabunny'
interface Frame {
src: string
timestamp: number
}
// Helper functions for statistical calculations
function calculateMean(data: number[]): number {
if (data.length === 0)
return 0
const sum = data.reduce((acc, value) => acc + value, 0)
return sum / data.length
}
function calculateStdDev(data: number[], mean: number): number {
if (data.length === 0)
return 0
const sqDiff = data.map(value => (value - mean) ** 2)
const avgSqDiff = calculateMean(sqDiff)
return Math.sqrt(avgSqDiff)
}
/**
* Calculates the Mean Absolute Frame Difference (MAFD) between two frames.
* This is a measure of how different two frames are.
*/
function calculateMAFD(frame1: ImageData, frame2: ImageData): number {
const data1 = frame1.data
const data2 = frame2.data
let diff = 0
const len = data1.length
for (let i = 0; i < len; i += 4) {
// Convert pixels to grayscale using integer arithmetic for performance.
// The coefficients are scaled by 1000 to avoid floating-point math.
const gray1 = (299 * data1[i] + 587 * data1[i + 1] + 114 * data1[i + 2]) / 1000
const gray2 = (299 * data2[i] + 587 * data2[i + 1] + 114 * data2[i + 2]) / 1000
diff += Math.abs(gray1 - gray2)
}
return diff / (frame1.width * frame1.height)
}
const videoFile = ref<File | null>(null)
const frames = ref<Frame[]>([])
const status = ref<'idle' | 'loading' | 'detecting'>('idle')
const fileInput = ref<HTMLInputElement | null>(null)
const processedFrames = ref(0)
const totalFrames = ref(0)
const videoUrl = ref<string | null>(null)
const videoPlayer = ref<HTMLVideoElement | null>(null)
const buttonText = computed(() => {
switch (status.value) {
case 'loading':
return 'Loading Video...'
case 'detecting':
if (totalFrames.value > 0) {
const percentage = Math.round(
(processedFrames.value / totalFrames.value) * 100
)
return `Detecting scenes... (${processedFrames.value}/${totalFrames.value}) ${percentage}%`
}
return 'Detecting Scenes...'
default:
return 'Detect Scenes'
}
})
const handleFileChange = (event: Event | DragEvent) => {
let file: File | null = null
if (event instanceof DragEvent && event.dataTransfer) {
file = event.dataTransfer.files[0] ?? null
} else if (event.target instanceof HTMLInputElement && event.target.files) {
file = event.target.files[0] ?? null
}
if (file) {
if (videoUrl.value) {
URL.revokeObjectURL(videoUrl.value)
}
videoFile.value = file
videoUrl.value = URL.createObjectURL(file)
frames.value = []
processedFrames.value = 0
totalFrames.value = 0
}
}
const openFilePicker = () => {
fileInput.value?.click()
}
onUnmounted(() => {
if (videoUrl.value) {
URL.revokeObjectURL(videoUrl.value)
}
})
const seekToTimestamp = (timestamp: number) => {
if (videoPlayer.value) {
videoPlayer.value.currentTime = timestamp
videoPlayer.value.play()
}
}
/**
* Detects scene changes in the video using a single-pass algorithm based on Mean Absolute Frame Difference (MAFD)
* and a dynamic threshold calculated from a sliding window of recent frame differences.
* This approach is inspired by the paper:
*
* Y. A. Salih, L. E. George, "Dynamic Scene Change Detection in Video Coding",
* International Journal of Engineering (IJE) TRANSACTIONS B: Applications Vol. 33, No. 5, (May 2020) 966-974
* https://www.researchgate.net/publication/341281580_Dynamic_Scene_Change_Detection_in_Video_Coding
*
* The algorithm works as follows:
* 1. Process the video frame by frame.
* 2. For each frame, calculate the MAFD from the previous frame.
* 3. Maintain a sliding window of the most recent MAFD values.
* 4. Calculate the mean and standard deviation of the values in the window.
* 5. If the current MAFD is a statistical outlier (e.g., > 3 standard deviations from the mean),
* it's considered a scene change.
* 6. The frame is immediately extracted and added to the UI for a progressive user experience.
* 7. After a scene change, the sliding window is cleared to adapt to the new scene's content.
*/
const extractFrames = async () => {
if (!videoFile.value || status.value !== 'idle') {
return
}
try {
status.value = 'loading'
frames.value = []
processedFrames.value = 0
totalFrames.value = 0
const input = new Input({
source: new BlobSource(videoFile.value),
formats: ALL_FORMATS
})
const videoTrack = await input.getPrimaryVideoTrack()
if (!videoTrack) {
console.error('No video track found')
return
}
status.value = 'detecting'
const sink = new VideoSampleSink(videoTrack)
const duration = await input.computeDuration()
const framesPerSecond = 10 // Sample 10 frames per second for more granular analysis
totalFrames.value = Math.floor(duration * framesPerSecond)
const SLIDING_WINDOW_SIZE = 60 // 6 seconds at 10fps
const DOWNSCALE_WIDTH = 160 // Downscale frames for faster processing
let downscaleHeight = 0
// --- Single Pass: Detect scenes and extract frames progressively ---
const mafdValues: number[] = [] // our sliding window
let previousImageData: ImageData | null = null
const processingCanvas = document.createElement('canvas')
const pCtx = processingCanvas.getContext('2d', { willReadFrequently: true })
if (!pCtx)
return
const thumbnailCanvas = document.createElement('canvas')
const tCtx = thumbnailCanvas.getContext('2d')
if (!tCtx)
return
for (let i = 0; i < totalFrames.value; i++) {
processedFrames.value = i + 1
const timestamp = i / framesPerSecond
const sample = await sink.getSample(timestamp)
if (sample) {
if (downscaleHeight === 0) {
if (sample.displayWidth > 0) {
downscaleHeight = Math.round(
(sample.displayHeight * DOWNSCALE_WIDTH) / sample.displayWidth
)
processingCanvas.width = DOWNSCALE_WIDTH
processingCanvas.height = downscaleHeight
thumbnailCanvas.width = sample.displayWidth
thumbnailCanvas.height = sample.displayHeight
}
else {
sample.close()
continue
}
}
// Draw the sample to the small canvas for processing, effectively downscaling it
sample.draw(pCtx, 0, 0, processingCanvas.width, processingCanvas.height)
const currentImageData = pCtx.getImageData(
0,
0,
processingCanvas.width,
processingCanvas.height
)
const createThumbnail = () => {
sample.draw(tCtx, 0, 0)
return thumbnailCanvas.toDataURL('image/jpeg')
}
if (i === 0) {
// Always add the first frame
frames.value.push({
src: createThumbnail(),
timestamp
})
}
if (previousImageData) {
const diff = calculateMAFD(previousImageData, currentImageData)
let isSceneChange = false
// Check for scene change if we have enough data in our sliding window
if (mafdValues.length >= SLIDING_WINDOW_SIZE) {
const mean = calculateMean(mafdValues)
const stdDev = calculateStdDev(mafdValues, mean)
const sceneChangeThreshold = 3 // 3 std devs
// A scene change is detected if the difference is a statistical outlier.
// We check if stdDev is large enough to avoid false positives in low-motion videos.
if (stdDev > 0.1 && Math.abs(diff - mean) / stdDev > sceneChangeThreshold)
isSceneChange = true
}
if (isSceneChange) {
frames.value.push({
src: createThumbnail(),
timestamp
})
// Reset the window after a scene change to adapt to the new scene's characteristics
mafdValues.length = 0
}
else {
mafdValues.push(diff)
if (mafdValues.length > SLIDING_WINDOW_SIZE)
mafdValues.shift()
}
}
previousImageData = currentImageData
sample.close()
}
}
}
catch (error) {
console.error('Error extracting frames:', error)
}
finally {
status.value = 'idle'
processedFrames.value = 0
totalFrames.value = 0
}
}
</script>
<template>
<UCard>
<template #header>
<h2 class="text-lg font-semibold">
Video Scene Detector
</h2>
</template>
<div class="space-y-4">
<div
v-if="videoUrl"
class="video-container"
>
<video
ref="videoPlayer"
:src="videoUrl"
controls
class="w-full rounded-lg"
/>
</div>
<div
class="border-2 border-dashed border-gray-300 dark:border-gray-700 rounded-lg p-8 text-center cursor-pointer hover:border-primary-500 transition-colors"
@click="openFilePicker"
@dragover.prevent
@drop.prevent="handleFileChange"
>
<input
ref="fileInput"
type="file"
accept="video/mp4,video/quicktime"
class="hidden"
@change="handleFileChange"
>
<div v-if="!videoFile">
<p>Drag & drop a video file here, or click to select a file.</p>
<p class="text-sm text-gray-500">
Supports .mp4 and .mov
</p>
</div>
<div v-else>
<p>Selected file: {{ videoFile.name }}</p>
</div>
</div>
<UButton
:disabled="!videoFile || status !== 'idle'"
:loading="status !== 'idle'"
@click="extractFrames"
>
{{ buttonText }}
</UButton>
</div>
<template
v-if="frames.length > 0"
#footer
>
<div>
<h3 class="text-md font-semibold mb-2">
Detected Scenes
</h3>
<div
class="grid grid-cols-2 sm:grid-cols-3 md:grid-cols-4 lg:grid-cols-5 gap-4"
>
<div
v-for="(frame, index) in frames"
:key="index"
class="aspect-w-16 aspect-h-9 relative"
>
<img
:src="frame.src"
alt="Scene Frame"
class="object-cover w-full h-full rounded-lg"
>
<UButton
size="xs"
class="absolute bottom-1 left-1"
@click="seekToTimestamp(frame.timestamp)"
>
{{ frame.timestamp.toFixed(2) }}s
</UButton>
</div>
</div>
</div>
</template>
</UCard>
</template>

Build Karaoke-Style Video Captions in the Browser with Whisper
Create word-level, karaoke-style captions entirely in the browser using Whisper, WebGPU/WASM, and burn them into videos with Mediabunny
Create a voice-assisted CLI tool that allows you to talk to a computer
Learn how to create an intelligent assistant à la Star Trek that listens to your voice commands and responds with natural speech using local Whisper.cpp transcription, Claude AI, and ElevenLabs text-to-speech.