Voice Selection & Cloning

import { Steps } from ‘@astrojs/starlight/components’;

UltiVoice supports two voice modes for each target language: preset voices from the built-in library, and voice cloning which reproduces the original speaker’s timbre and style.

Preset voices

The built-in voice library provides ready-to-use voices for each supported language. Preset voices are:

Available immediately — no reference audio needed.
Faster to synthesise than cloned voices.
A good default for content where speaker identity does not matter (tutorials, explainers, product demos).

To select a preset voice:

Open the Voice panel in the Pipeline tab.
Under Voice mode, select Preset.
Choose a language variant and gender from the dropdown.
Click Preview to hear a sample.

Voice cloning

Voice cloning uses F5-TTS-Vi technology to synthesise speech that sounds like the original speaker. This produces a more natural dub for content where speaker consistency matters — interviews, narrations, personal vlogs.

Prerequisites

A reference audio clip of the speaker you want to clone: 5–30 seconds of clean speech, minimal background noise, single speaker only.
The clip can be extracted from the source video or provided as a separate WAV/MP3 file.
Voice cloning is available on all paid license tiers.

Clone a voice

In the Voice panel, under Voice mode, select Clone voice.
Click Add reference clip and select a WAV, MP3, or M4A file (or choose a segment from the imported video using the timeline picker).
UltiVoice analyses the clip and shows a quality indicator:
- Good — sufficient duration, clean audio.
- Short — clip is under 5 seconds; synthesis will work but quality may vary.
- Noisy — background noise detected; consider using a cleaner clip.
Click Test synthesis to hear a sample phrase in the cloned voice before running the full pipeline.
Optionally give the cloned voice a name to save it for future projects.
Proceed to run the dubbing pipeline.

Per-speaker voice assignment

If your video has multiple speakers (e.g. an interview with two participants), you can assign different voices to different speakers:

After transcription, open the Transcript tab.
Each segment has a Speaker field. Assign a speaker label (Speaker A, Speaker B, etc.) to each segment, or use the auto-detect button to let the app group segments by voice similarity.
In the Voice panel, assign a preset or cloned voice to each speaker label.

Multi-speaker dubbing is available on Standard and Professional license tiers.

Tips for better cloning quality

Use a clip where the speaker is talking at their normal pace — avoid whispers, shouts, or singing.
Remove any background music before using a clip as a reference.
Longer reference clips (15–30 seconds) generally produce better results than the minimum 5 seconds.
If synthesis sounds robotic, try a different reference clip from a different part of the video.