Build Your Own AI Voice Assistant in Under 100 Lines of Code
Create a private, customizable voice assistant using Whisper and TTS models.
Unlock the Future of AI Interaction
Voice is the future of AI interaction. While companies like OpenAI offer powerful speech-to-speech APIs, you can build your own local voice assistant that runs entirely on your laptop — no cloud required, no privacy concerns, and complete customization control.
In this practical guide, I’ll walk you through creating a responsive voice assistant using OpenAI’s Whisper model for speech recognition and neural TTS models for natural-sounding responses — all running locally on CPU hardware.
According to a reputable study, voice assistants are becoming increasingly important in our daily lives, with over 4.2 billion voice assistants expected to be in use by 2024. Building your own gives you unparalleled control and privacy.
Why Build a Local Voice Assistant?
There are compelling reasons to build and run your voice assistant locally:
- Complete privacy — Your conversations never leave your device
- No usage limits or API costs — Run it as much as you want
- Customization freedom — Fine-tune models for your specific use cases
- Low latency — No network round-trips mean faster responses
- Offline capability — Works without internet connection
The Core Components
Our voice assistant requires four key components:
- Audio recording — Capture voice from your microphone
- Speech-to-text — Convert spoken words to text
- Text processing — Generate intelligent responses
- Text-to-speech — Convert text responses to natural speech
Let’s see how to implement each part with minimal code.
Recording Audio Input
First, we need to capture audio from the microphone. The sounddevice library makes this straightforward:
import sounddevice as sd
import numpy as np
import wave
def record_audio(duration=5, sample_rate=16000):
"""Record audio from microphone for specified duration."""
print("Recording...")
audio = sd.rec(
int(duration * sample_rate),
samplerate=sample_rate,
channels=1,
dtype=np.int16
)
sd.wait() # Wait until recording is finished
print("Done recording")
# Save to WAV file
with wave.open("input.wav", "w") as wf:
wf.setnchannels(1)
wf.setsampwidth(2) # 16-bit audio
wf.setframerate(sample_rate)
wf.writeframes(audio.tobytes())
return "input.wav"Converting Speech to Text with Whisper
OpenAI’s Whisper is an incredibly accurate speech recognition model that can run locally. We’ll use whisper.cpp, a lightweight C++ implementation optimized for CPU:
import subprocess
def transcribe_audio(audio_file, model_path="models/ggml-base.en.bin"):
"""Transcribe audio file using whisper.cpp."""
try:
result = subprocess.run(
[
"./whisper", # Path to whisper.cpp binary
"-m", model_path,
"-f", audio_file,
"-l", "en",
"-otxt"
],
capture_output=True,
text=True,
)
return result.stdout.strip()
except Exception as e:
print(f"Transcription error: {e}")
return ""The ggml-base.en.bin model provides a good balance of accuracy and speed, running smoothly even on modest hardware.
Generating Responses with a Local LLM
For intelligence, we’ll use a lightweight LLM that can run on CPU. Ollama makes this easy:
def generate_response(transcription, model="qwen:0.5b"):
"""Generate a response to the transcribed text using a local LLM."""
try:
prompt = f"Please respond to this concisely: {transcription}"
result = subprocess.run(
["ollama", "run", model],
input=prompt,
text=True,
capture_output=True,
check=True,
)
return result.stdout.strip()
except Exception as e:
print(f"LLM error: {e}")
return "Sorry, I couldn't process that."The Qwen 0.5B model is particularly well-suited for this task, requiring only about 400MB of RAM while still providing decent responses.
Converting Text to Speech
Finally, we’ll convert our text response back to natural-sounding speech using NVIDIA’s NeMo toolkit:
import torch
import torchaudio
from nemo.collections.tts.models import FastPitchModel, HifiGanModel
# Load models once at startup
fastpitch = FastPitchModel.from_pretrained("tts_en_fastpitch")
hifigan = HifiGanModel.from_pretrained("tts_en_lj_hifigan")
def text_to_speech(text):
"""Convert text to speech using NeMo TTS models."""
fastpitch.eval()
parsed = fastpitch.parse(text)
spectrogram = fastpitch.generate_spectrogram(tokens=parsed)
hifigan.eval()
audio = hifigan.convert_spectrogram_to_audio(spec=spectrogram)
# Save to WAV file
torchaudio.save("response.wav", audio.cpu(), 22050)
return "response.wav"Putting It All Together
Now let’s combine everything into a simple voice assistant:
def voice_assistant():
"""Run the complete voice assistant pipeline."""
# Record audio
audio_file = record_audio()
# Transcribe to text
transcription = transcribe_audio(audio_file)
print(f"You said: {transcription}")
# Generate response
response = generate_response(transcription)
print(f"Assistant: {response}")
# Convert response to speech
speech_file = text_to_speech(response)
# Play the response
subprocess.run(["play", speech_file])
# Main loop
while True:
print("Press Enter to ask a question (or 'q' to quit)")
if input() == 'q':
break
voice_assistant()Performance Optimization Tips
To improve responsiveness:
- Use quantized models — The GGML versions of Whisper run much faster
- Preload models — Keep models loaded in memory between queries
- Adjust audio settings — Shorter recording duration and proper silence detection
- Use a faster LLM — Try models like Phi-2 for better speed/quality balance
- Batch processing — Process audio chunks in parallel when possible
Customization Ideas
Once you have the basic system working, consider these enhancements:
- Custom wake word detection — Add a lightweight model to trigger the assistant
- Domain-specific fine-tuning — Train the LLM on your personal knowledge base
- Voice cloning — Use your own voice for the responses
- Multi-turn conversations — Maintain context between interactions
- Local knowledge base — Connect to your notes, documents, or other personal data
Conclusion
Building a local voice assistant gives you control, privacy, and customization that cloud services can’t match. With the approach outlined here, you can create a surprisingly capable system that runs entirely on your laptop’s CPU.
The code samples in this article provide a starting point that you can extend and customize to suit your specific needs. As these models continue to improve, the gap between cloud and local AI assistants will continue to narrow.
Ready to get started? All the code from this article is available in my GitHub repository.
👋 Hey, I’m Dani García — Senior ML Engineer working across startups, academia, and consulting.
I write practical guides and build tools to help you get faster results in ML.
💡 If this post helped you, clap and subscribe so you don’t miss the next one.
🚀 Take the next step:
- 🎁 Free “ML Second Brain” Template
The Notion system I use to track experiments & ideas.
Grab your free copy - 📬 Spanish Data Science Newsletter
Weekly deep dives & tutorials in your inbox.
Join here - 📘 Full-Stack ML Engineer Guide
Learn to build real-world ML systems end-to-end.
Get the guide - 🤝 Work with Me
Need help with ML, automation, or AI strategy?
Let’s talk - 🔗 Connect on LinkedIn
Share ideas, collaborate, or just say hi.
Connect
