Voice Selection & Cloning
import { Steps } from ‘@astrojs/starlight/components’;
UltiVoice supports two voice modes for each target language: preset voices from the built-in library, and voice cloning which reproduces the original speaker’s timbre and style.
Preset voices
Section titled “Preset voices”The built-in voice library provides ready-to-use voices for each supported language. Preset voices are:
- Available immediately — no reference audio needed.
- Faster to synthesise than cloned voices.
- A good default for content where speaker identity does not matter (tutorials, explainers, product demos).
To select a preset voice:
- Open the Voice panel in the Pipeline tab.
- Under Voice mode, select Preset.
- Choose a language variant and gender from the dropdown.
- Click Preview to hear a sample.
Voice cloning
Section titled “Voice cloning”Voice cloning uses F5-TTS-Vi technology to synthesise speech that sounds like the original speaker. This produces a more natural dub for content where speaker consistency matters — interviews, narrations, personal vlogs.
Prerequisites
Section titled “Prerequisites”- A reference audio clip of the speaker you want to clone: 5–30 seconds of clean speech, minimal background noise, single speaker only.
- The clip can be extracted from the source video or provided as a separate WAV/MP3 file.
- Voice cloning is available on all paid license tiers.
Clone a voice
Section titled “Clone a voice”-
In the Voice panel, under Voice mode, select Clone voice.
-
Click Add reference clip and select a WAV, MP3, or M4A file (or choose a segment from the imported video using the timeline picker).
-
UltiVoice analyses the clip and shows a quality indicator:
- Good — sufficient duration, clean audio.
- Short — clip is under 5 seconds; synthesis will work but quality may vary.
- Noisy — background noise detected; consider using a cleaner clip.
-
Click Test synthesis to hear a sample phrase in the cloned voice before running the full pipeline.
-
Optionally give the cloned voice a name to save it for future projects.
-
Proceed to run the dubbing pipeline.
Per-speaker voice assignment
Section titled “Per-speaker voice assignment”If your video has multiple speakers (e.g. an interview with two participants), you can assign different voices to different speakers:
- After transcription, open the Transcript tab.
- Each segment has a Speaker field. Assign a speaker label (Speaker A, Speaker B, etc.) to each segment, or use the auto-detect button to let the app group segments by voice similarity.
- In the Voice panel, assign a preset or cloned voice to each speaker label.
Multi-speaker dubbing is available on Standard and Professional license tiers.
Tips for better cloning quality
Section titled “Tips for better cloning quality”- Use a clip where the speaker is talking at their normal pace — avoid whispers, shouts, or singing.
- Remove any background music before using a clip as a reference.
- Longer reference clips (15–30 seconds) generally produce better results than the minimum 5 seconds.
- If synthesis sounds robotic, try a different reference clip from a different part of the video.