Tips

Voice Tracking Teleprompter

A voice tracking teleprompter listens to what you’re saying and matches the script to your speech — you can speed up, slow down, pause, and repeat lines, and the text follows you. It’s the feature that lets you talk like a human instead of reading like a robot, and it’s the single biggest upgrade over the timed-scrolling teleprompters most people start with. This post covers why timed scrolling fails on real shoots, what voice tracking actually does differently, and why on-device tracking is the only version that genuinely works.

Why timed scrolling fails

Timed scrolling is the original teleprompter behaviour — the text rolls at a fixed speed and you have to keep up. It works for newsreaders reading polished scripts at a known pace because that’s exactly how broadcast teleprompters were designed, around 1950. It does not work for modern solo creators shooting scripted YouTube content.

The fundamental problem is rhythm. You don’t talk at a fixed speed. You pause for emphasis on the important lines. You slow down when you’re explaining something complex. You speed up through familiar phrases. You sometimes need to redo a line if you fluffed a word. Timed scrolling treats all of these as failures — the text keeps moving while you’re not reading at its pace, so either you race to keep up (and sound rushed) or you fall behind (and lose your place).

The result with timed scrolling is delivery that sounds mechanical. The pace is the prompter’s, not yours. Audiences pick this up instantly even if they can’t name what’s wrong.

What voice tracking does differently

Voice tracking listens to your speech in real time and matches what you’re saying against the script. The scroll position updates based on where the match lands — so the text follows your speech instead of forcing your speech to follow the text.

The behavioural difference on a shoot day is enormous:

You can pause for emphasis. The script holds. When you resume, tracking picks up cleanly.

You can speed up through familiar passages or slow down on complex ones. The text moves at the pace you’re actually talking.

You can repeat a line, restart a sentence, or backtrack to redo a fluffed word. The matching engine handles small drift gracefully.

You can stop mid-take to sip water, glance at notes, or reset for a better take. The script doesn’t run off without you.

Suddenly you’re delivering content the way you’d naturally talk, not racing a moving page.

Cloud vs on-device voice tracking — the part that matters

There are two ways voice tracking can run, and the difference between them is bigger than most people realise.

Cloud-based voice tracking sends your microphone audio to a remote server (Google Cloud Speech, AWS Transcribe, or the browser’s built-in Web Speech API which routes to Google). The server processes the audio and returns text. Latency is typically 200–500ms per word, the session times out after about a minute of silence, and the whole thing stops working if your internet has a wobble.

On-device voice tracking runs the speech recognition locally on your phone or tablet. Apple’s SpeechAnalyzer on iOS 26+, Apple’s older SFSpeechRecognizer, VOSK on Android — these process audio without sending anything to a remote server. Latency is typically 50–100ms (two to ten times faster than cloud), it works offline, and properly-implemented on-device tracking handles pauses without the session timing out.

For real shoot-day conditions — Wi-Fi sometimes drops, you take long pauses between takes, you want consistent latency — on-device tracking is the only version that consistently works. Cloud-based tracking looks fine in demos and falls apart in real production.

Why on-device tracking is harder to build well

If on-device tracking is so obviously better, why don’t all teleprompter apps use it? Because it’s significantly harder to build well.

Cloud speech engines (Google, AWS, Apple’s server-side) handle the speech-recognition complexity for you — the app developer just calls an API. On-device engines require the app developer to handle the entire pipeline themselves: continuous audio capture, streaming recognition, partial-result handling, silence-detection, session-resumption after pauses, and the fuzzy text-matching that maps recognised speech back to the script position.

A teleprompter app team that wants to ship quickly defaults to the cloud API. A team that wants the product to work on a real shoot day puts in the months of engineering to do it on-device. Most apps in the category took the quicker path.

What we built and why

Steady Cue’s voice tracking is built natively, on-device, by working presenters who needed it for their own shoots. Andy and Josh are Fiverr Pro top-rated sellers — 600+ and 1.8k five-star reviews between us. We coach other spokespeople up to that level through our presenting academy.

We tried every teleprompter on the market for our own paid client work, and every cloud-based voice tracking implementation failed us on real shoots — long pauses dropping the session, internet wobbles breaking tracking mid-take, latency that made us read out of sync. So we built the on-device version ourselves.

In our testing, it’s the smoothest voice tracking we’ve used.

Steady Cue’s voice tracking runs natively on your device. Try it for free at steadycue.com.