Voice AI Engineer — Build a Custom Conversational Voice Pipeline for AI Companion App

UpworkUSNot specifiedexpert

Automatic Speech RecognitionPythonMachine Learningtext to speechArtificial IntelligenceNatural Language ProcessingDeep LearningChatbot Development

About the Project

We're building a personal AI chief of staff / companion app for founders and executives. 

The app already has a working voice system (Hume AI TTS + Deepgram/Whisper STT), but we need to move to a fully custom voice pipeline that feels like talking to a real person, not a text-to-speech engine.

This is a dedicated, focused engagement. You'll own the entire voice pipeline end to end.

What We Need Built:

Custom TTS Voice

- Fine-tune or train a custom voice model with a specific character: warm, confident, slightly playful, natural pacing with real pauses and breath

- We have a reference voice (Hume voice ID) as a starting point for the character we want

- Models we're open to: StyleTTS2, XTTS v2, Kokoro, Piper, or your recommendation based on quality vs. latency tradeoffs

- Emotional modulation: the voice should sound different based on mood parameters (concerned, excited, calm, pushing back). Not dramatic shifts, subtle ones

Sub-800ms End-to-End Latency

- Voice in → transcription → LLM response (first token) → TTS out → audio playing

- Streaming architecture: TTS should start generating audio from the first sentence while the LLM is still producing the rest

- We're on Vercel (serverless) for the main app but open to a dedicated voice server (Fly.io, Railway, bare metal) if needed for latency

Interruption Handling

- User starts talking mid-response → AI stops gracefully
- No awkward overlap or repeated content
- VAD (Voice Activity Detection) integrated into the pipeline

Wake Word Detection

- Custom wake word trigger phrase
- Background listening on desktop (Electron app) and mobile (React Native, future)
- Low CPU/battery footprint
- Porcupine or openWakeWord, or your recommendation

Integration Points
- The voice pipeline needs to plug into our existing Next.js / TypeScript app
- Current flow: user speaks → Whisper transcription → Claude API (streaming) → Hume TTS → audio playback

- New flow should replace Hume TTS (and potentially Whisper) with your custom pipeline
- WebSocket or WebRTC based, your call on architecture

What We're NOT Looking For

- Someone who just integrates existing TTS APIs (ElevenLabs, Play.ht, etc.). We need custom model work.
- A research project. We need a production pipeline that works reliably.
- Someone unfamiliar with real-time audio. If you haven't dealt with audio streaming, buffer management, and latency optimization before, this isn't the right fit.

Ideal Background

- You've fine-tuned or trained TTS models (StyleTTS2, XTTS, Tortoise, VALL-E, or similar)
- You've built real-time voice pipelines with sub-second latency
- You understand WebSocket/WebRTC audio streaming
- You've worked with VAD and/or wake word detection
- Bonus: experience with emotional/expressive TTS or prosody control
- Bonus: experience with conversational AI products (not just TTS in isolation)

Engagement Details

- Type: Contract, 20-40 hrs/week depending on your availability
- Duration: 8-12 weeks estimated, could extend
- Communication: Daily async updates, 2-3 sync calls per week
- Stack context: Next.js, TypeScript, Supabase, Anthropic Claude, currently Hume AI for TTS

To Apply

Please include:

1. A specific example of a TTS model you've fine-tuned or a real-time voice system you've built. Link to a demo if possible.

2. Your recommended architecture for hitting sub-800ms e2e latency. Even a few sentences showing your thinking.

3. Your availability and hourly rate.

Generic proposals will be ignored. We're looking for someone who gets excited about making an AI voice feel genuinely alive.

View Original Listing

Unlock AI intelligence, score breakdowns, and real-time alerts

Upgrade to Pro — $29.99/mo

Client

Spent: $734,808.62Rating: 4.9Verified