Create a voice-assisted CLI tool that allows you to talk to a computer

You're cooking carbonara and your hands are covered in flour. You need to know when to add the egg yolks. What if you could just say "Computer, when do I add the eggs?" and get an answer right away?

We're building a voice assistant that listens continuously, responds to a hotword, and answers questions using natural speech. The transcription runs locally using Whisper.cpp—no cloud latency, no sending voice data off your machine. Claude generates responses and ElevenLabs converts them to speech. The interesting part is managing state so the assistant doesn't respond to itself and streaming responses so you hear answers as they're generated.

Let's start with project setup. Install the dependencies:

npm init -y
npm install @ai-sdk/anthropic ai dotenv elevenlabs play-sound
npm install --save-dev @types/node typescript

pnpm init
pnpm add @ai-sdk/anthropic ai dotenv elevenlabs play-sound
pnpm add --dev @types/node typescript

Set "type": "module" in package.json and add a dev script:

{
  "type": "module",
  "scripts": {
    "dev": "npx jiti ./src/index.ts"
  }
}

Whisper.cpp needs to be compiled locally. Clone and build it:

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
make
./models/download-ggml-model.sh base.en

The base model is 39MB and fast enough for real-time conversation. Copy the compiled binary and model into your project:

mkdir -p whisper.cpp/models
cp whisper-stream whisper.cpp/
cp models/ggml-base.en.bin whisper.cpp/models/

Create a .env file with your API keys:

ANTHROPIC_API_KEY=your_anthropic_key
ELEVENLABS_API_KEY=your_elevenlabs_key

Now we'll start Whisper and capture its output. Create src/index.ts:

import { spawn } from 'child_process';

const whisper = spawn("./whisper.cpp/whisper-stream", [
  "-m", "./whisper.cpp/models/ggml-base.en.bin",
  "-t", "6",
  "--step", "0",
  "-vth", "0.6"
]);

console.log("Listening...");

This spawns Whisper in streaming mode. The -vth 0.6 sets voice activity threshold to filter background noise. It works well for a quiet kitchen but you'll need 0.7 or 0.8 if there's a stand mixer running.

Listen for transcription output:

import { spawn } from 'child_process';

const whisper = spawn("./whisper.cpp/whisper-stream", [
  "-m", "./whisper.cpp/models/ggml-base.en.bin",
  "-t", "6",
  "--step", "0",
  "-vth", "0.6"
]);

whisper.stdout.on("data", (data) => {
  const text = data.toString().trim();
  if (text && text.length > 2) {
    console.log("Heard:", text);
  }
});

Whisper streams transcribed text as it processes audio. We're just logging it for now. Run this with npm run dev and say something—you'll see it transcribed in real time.

Whisper outputs include timestamps and noise markers. Let's filter them:

whisper.stdout.on("data", (data) => {
  const text = data.toString().trim();
  
  if (!text || text.length < 2) return;
  if (text.includes("[BLANK_AUDIO]")) return;
  if (text.includes("[ Silence ]")) return;
  
  console.log("Heard:", text);
});

We need hotword detection. The assistant should only respond when you say "Computer":

const HOTWORD = "computer";
let isListening = false;

whisper.stdout.on("data", (data) => {
  const text = data.toString().trim().toLowerCase();
  
  if (!text || text.length < 2) return;
  if (text.includes("[BLANK_AUDIO]")) return;
  
  if (!isListening && text.includes(HOTWORD)) {
    isListening = true;
    console.log("Ready for command...");
    return;
  }
  
  if (isListening) {
    handleCommand(text);
    isListening = false;
  }
});

When the hotword is detected, we flip isListening to true. The next transcription gets passed to handleCommand. After processing, we reset the flag.

Now we'll connect to Claude. Create src/response.ts:

import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

export async function generateResponse(userInput: string) {
  const result = await streamText({
    model: anthropic("claude-3-5-haiku-20241022"),
    messages: [
      {
        role: "system",
        content: "You are a helpful assistant. Keep responses under 3 sentences."
      },
      { role: "user", content: userInput }
    ]
  });
  
  return result.textStream;
}

Claude Haiku is fast and costs about $0.005 per request. The system prompt keeps answers brief—long responses take longer to synthesize.

Add text-to-speech:

import { ElevenLabsClient } from 'elevenlabs';

const client = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY
});

async function speakText(text: string) {
  const audioStream = await client.textToSpeech.convertAsStream(
    "AbGwSoGa8Pj2vDIi0RvL",
    { text, optimize_streaming_latency: 3 }
  );
  
  return audioStream;
}

The optimize_streaming_latency parameter trades quality for speed. For conversation, speed matters more. This adds about 300ms latency.

Now connect the pieces. Stream Claude's response and synthesize as sentences complete:

export async function generateAndSpeak(userInput: string) {
  const textStream = await generateResponse(userInput);
  let buffer = "";
  
  for await (const chunk of textStream) {
    buffer += chunk;
    
    if (buffer.endsWith(".") || buffer.endsWith("!") || buffer.endsWith("?")) {
      const audio = await speakText(buffer);
      await playAudio(audio);
      buffer = "";
    }
  }
  
  if (buffer) {
    const audio = await speakText(buffer);
    await playAudio(audio);
  }
}

We accumulate text until we hit a sentence boundary, then immediately send it to ElevenLabs. This gets audio playing while Claude is still generating the rest.

Playing the audio is straightforward:

import Player from 'play-sound';
import { PassThrough } from 'stream';
import { writeFileSync } from 'fs';

const player = Player();

async function playAudio(audioStream: AsyncIterable<Buffer>) {
  const chunks: Buffer[] = [];
  
  for await (const chunk of audioStream) {
    chunks.push(chunk);
  }
  
  const buffer = Buffer.concat(chunks);
  const tempFile = `/tmp/speech-${Date.now()}.mp3`;
  writeFileSync(tempFile, buffer);
  
  return new Promise<void>((resolve) => {
    player.play(tempFile, () => resolve());
  });
}

We collect the audio chunks, write them to a temp file, and play it. ElevenLabs streams MP3 data.

Here's the biggest gotcha: the microphone picks up the speaker output. If Whisper is running while audio plays, it transcribes the assistant's own voice. Add a flag to prevent this:

let isListening = false;
let isProcessing = false;

whisper.stdout.on("data", (data) => {
  if (isProcessing) return;  // Skip while speaking
  
  const text = data.toString().trim().toLowerCase();
  
  if (!text || text.length < 2) return;
  if (text.includes("[BLANK_AUDIO]")) return;
  
  if (!isListening && text.includes(HOTWORD)) {
    isListening = true;
    return;
  }
  
  if (isListening) {
    handleCommand(text);
    isListening = false;
  }
});

async function handleCommand(text: string) {
  isProcessing = true;
  await generateAndSpeak(text);
  isProcessing = false;
}

The isProcessing flag blocks transcription while the assistant speaks. This prevents feedback loops.

A better approach is stopping Whisper entirely during playback. The audio might keep buffering otherwise:

let whisper: ChildProcess | null = null;

function startWhisper() {
  whisper = spawn("./whisper.cpp/whisper-stream", [
    "-m", "./whisper.cpp/models/ggml-base.en.bin",
    "-t", "6",
    "--step", "0",
    "-vth", "0.6"
  ]);
  
  whisper.stdout?.on("data", handleTranscription);
}

function stopWhisper() {
  whisper?.kill();
  whisper = null;
}

Now stop Whisper before speaking and restart after:

async function handleCommand(text: string) {
  stopWhisper();
  await generateAndSpeak(text);
  await new Promise(r => setTimeout(r, 500));  // Brief pause
  startWhisper();
}

The 500ms pause prevents Whisper from picking up the tail end of audio playback.

Here's the complete flow in src/index.ts:

import { spawn, ChildProcess } from 'child_process';
import { generateAndSpeak } from './response.js';

const HOTWORD = "computer";
let whisper: ChildProcess | null = null;
let isListening = false;

function startWhisper() {
  whisper = spawn("./whisper.cpp/whisper-stream", [
    "-m", "./whisper.cpp/models/ggml-base.en.bin",
    "-t", "6",
    "--step", "0",
    "-vth", "0.6"
  ]);
  
  whisper.stdout?.on("data", async (data) => {
    const text = data.toString().trim().toLowerCase();
    
    if (!text || text.length < 2) return;
    if (text.includes("[blank_audio]")) return;
    
    if (!isListening && text.includes(HOTWORD)) {
      isListening = true;
      console.log("Listening for command...");
      return;
    }
    
    if (isListening) {
      console.log("You:", text);
      isListening = false;
      stopWhisper();
      await generateAndSpeak(text);
      await new Promise(r => setTimeout(r, 500));
      startWhisper();
    }
  });
}

function stopWhisper() {
  whisper?.kill();
  whisper = null;
}

console.log("Starting voice assistant...");
startWhisper();

process.on("SIGINT", () => {
  stopWhisper();
  process.exit(0);
});

Run it with npm run dev. Say "Computer" and then ask a question. The assistant responds with voice and returns to listening mode.

The streaming approach gets audio playing in 1-2 seconds. Most of that is Claude generating the first sentence. Waiting for the complete response before synthesizing would add 5-10 seconds.

Each interaction costs about $0.01 with Claude Haiku and ElevenLabs. For occasional questions that's fine. If you're building something that runs continuously, you'll want usage tracking and error handling.

Here's the complete implementation:

import 'dotenv/config';
import { spawn, ChildProcess } from 'child_process';
import { generateAndSpeak } from './response.js';

const HOTWORD = "computer";
let whisper: ChildProcess | null = null;
let isListening = false;

function startWhisper() {
  whisper = spawn("./whisper.cpp/whisper-stream", [
    "-m", "./whisper.cpp/models/ggml-base.en.bin",
    "-t", "6",
    "--step", "0",
    "-vth", "0.6"
  ]);
  
  whisper.stdout?.on("data", async (data) => {
    const text = data.toString().trim().toLowerCase();
    
    if (!text || text.length < 2) return;
    if (text.includes("[blank_audio]")) return;
    if (text.includes("[ silence ]")) return;
    
    if (!isListening && text.includes(HOTWORD)) {
      isListening = true;
      console.log("Listening for command...");
      return;
    }
    
    if (isListening) {
      console.log("You:", text);
      isListening = false;
      stopWhisper();
      
      try {
        await generateAndSpeak(text);
      } catch (err) {
        console.error("Error:", err);
      }
      
      await new Promise(r => setTimeout(r, 500));
      startWhisper();
    }
  });
  
  whisper.on("exit", () => {
    whisper = null;
  });
}

function stopWhisper() {
  whisper?.kill();
  whisper = null;
}

console.log("Starting voice assistant...");
startWhisper();

process.on("SIGINT", () => {
  stopWhisper();
  process.exit(0);
});

import 'dotenv/config';
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
import { ElevenLabsClient } from 'elevenlabs';
import Player from 'play-sound';
import { writeFileSync, unlinkSync } from 'fs';

const client = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY
});

const player = Player();

async function generateResponse(userInput: string) {
  const result = await streamText({
    model: anthropic("claude-3-5-haiku-20241022"),
    messages: [
      {
        role: "system",
        content: "You are a helpful assistant. Keep responses clear, concise, and under 3 sentences."
      },
      { role: "user", content: userInput }
    ]
  });
  
  return result.textStream;
}

async function speakText(text: string) {
  const audioStream = await client.textToSpeech.convertAsStream(
    "AbGwSoGa8Pj2vDIi0RvL",
    {
      text,
      model_id: "eleven_multilingual_v2",
      optimize_streaming_latency: 3
    }
  );
  
  const chunks: Buffer[] = [];
  
  for await (const chunk of audioStream) {
    chunks.push(chunk);
  }
  
  const buffer = Buffer.concat(chunks);
  const tempFile = `/tmp/speech-${Date.now()}.mp3`;
  writeFileSync(tempFile, buffer);
  
  return new Promise<void>((resolve) => {
    player.play(tempFile, () => {
      unlinkSync(tempFile);
      resolve();
    });
  });
}

export async function generateAndSpeak(userInput: string) {
  const textStream = await generateResponse(userInput);
  let buffer = "";
  
  for await (const chunk of textStream) {
    process.stdout.write(chunk);
    buffer += chunk;
    
    if (buffer.endsWith(".") || buffer.endsWith("!") || buffer.endsWith("?")) {
      await speakText(buffer);
      buffer = "";
    }
  }
  
  if (buffer) {
    await speakText(buffer);
  }
  
  console.log("\n");
}

AI Engineer

Create a voice-assisted CLI tool that allows you to talk to a computer

Ready to innovate with AI?