Real-Time Video Captioning in the Browser with Vision Language Models