5 Tips for Better AI Dubbing Results with UltiVoice
Practical tips to get higher-quality dubbed video output from UltiVoice — audio preparation, voice cloning references, segment editing, subtitle timing, and export settings.
Getting great results from AI dubbing is as much about preparation as it is about the AI models. Here are five practical tips that consistently improve output quality.
1. Clean your source audio before importing
The single biggest factor in transcription accuracy is source audio quality. Whisper handles noisy audio reasonably well, but you will get noticeably fewer transcription errors — and therefore better translation and synthesis — if you reduce background noise first.
Quick workflow:
- Export just the audio track from your video using FFmpeg:
ffmpeg -i input.mp4 -vn -ar 16000 audio.wav - Run it through a noise-reduction tool (Audacity’s noise gate, or a dedicated denoise app).
- Import the clean audio as a separate track in UltiVoice using Settings → Source audio override.
This takes 5 extra minutes and can reduce transcription errors by 30–50% on difficult source material.
2. Use a longer voice cloning reference
The minimum for voice cloning is 5 seconds, but 15–30 seconds produces noticeably more consistent output — especially on longer videos where the TTS engine needs to stay in character across hundreds of segments.
Pick a reference clip where the speaker is:
- Talking at their normal conversational pace (not presenting or shouting).
- In a quiet environment with no background music.
- Speaking a complete sentence — not mid-word or trailing off.
If the original video has background music behind the speech, use Demucs source separation first: Pipeline → Advanced → Separate background music before extracting the reference.
3. Review flagged segments first
After the pipeline runs, the Transcript tab highlights segments in yellow where the TTS output is significantly shorter or longer than the original audio slot. These timing mismatches are the segments most likely to sound unnatural in the final video.
Sort by the Timing column and fix the worst mismatches first. Usually this means editing the translated text to be closer in length to the original — a shorter sentence in the target language often works better than a longer, more literal translation.
4. Adjust subtitle timing after TTS, not before
It is tempting to edit subtitle timing in the Transcript tab right after transcription. Resist the urge — TTS synthesis shifts the actual audio timing. Do your timing passes after synthesis completes, using the Preview player to verify that subtitle display matches the dubbed audio, not the original.
The workflow:
- Run the full pipeline first.
- Open Preview → toggle to dubbed audio.
- Step through subtitles with arrow keys.
- Fix timing on any segments that are visually off.
5. Use “Copy original” video codec for fastest export
Unless you need to change resolution or frame rate, set the video codec in the Export tab to Copy original. This re-uses the original video bitstream without re-encoding — a 5-minute 1080p video exports in under 5 seconds instead of 3–5 minutes.
You still get a re-encoded audio track (the dubbed audio) and optionally burnt-in subtitles. Only the video frames are copied unchanged.
Use H.264 or H.265 encoding only when you need to:
- Downscale to a smaller resolution (720p for social media, 480p for messaging apps).
- Reduce file size for upload (streaming platforms have bitrate limits).
- Change the container from MKV to MP4.
These tips apply to any content type, but they matter most for interview-style videos with close-up talking heads — the format where voice naturalness is most scrutinised by viewers.
For the full workflow documentation, start with the Dubbing Workflow guide.