Skip to content

Voice Selection & Cloning

import { Steps } from ‘@astrojs/starlight/components’;

UltiVoice supports two voice modes for each target language: preset voices from the built-in library, and voice cloning which reproduces the original speaker’s timbre and style.

The built-in voice library provides ready-to-use voices for each supported language. Preset voices are:

  • Available immediately — no reference audio needed.
  • Faster to synthesise than cloned voices.
  • A good default for content where speaker identity does not matter (tutorials, explainers, product demos).

To select a preset voice:

  1. Open the Voice panel in the Pipeline tab.
  2. Under Voice mode, select Preset.
  3. Choose a language variant and gender from the dropdown.
  4. Click Preview to hear a sample.

Voice cloning uses F5-TTS-Vi technology to synthesise speech that sounds like the original speaker. This produces a more natural dub for content where speaker consistency matters — interviews, narrations, personal vlogs.

  • A reference audio clip of the speaker you want to clone: 5–30 seconds of clean speech, minimal background noise, single speaker only.
  • The clip can be extracted from the source video or provided as a separate WAV/MP3 file.
  • Voice cloning is available on all paid license tiers.
  1. In the Voice panel, under Voice mode, select Clone voice.

  2. Click Add reference clip and select a WAV, MP3, or M4A file (or choose a segment from the imported video using the timeline picker).

  3. UltiVoice analyses the clip and shows a quality indicator:

    • Good — sufficient duration, clean audio.
    • Short — clip is under 5 seconds; synthesis will work but quality may vary.
    • Noisy — background noise detected; consider using a cleaner clip.
  4. Click Test synthesis to hear a sample phrase in the cloned voice before running the full pipeline.

  5. Optionally give the cloned voice a name to save it for future projects.

  6. Proceed to run the dubbing pipeline.

If your video has multiple speakers (e.g. an interview with two participants), you can assign different voices to different speakers:

  1. After transcription, open the Transcript tab.
  2. Each segment has a Speaker field. Assign a speaker label (Speaker A, Speaker B, etc.) to each segment, or use the auto-detect button to let the app group segments by voice similarity.
  3. In the Voice panel, assign a preset or cloned voice to each speaker label.

Multi-speaker dubbing is available on Standard and Professional license tiers.

  • Use a clip where the speaker is talking at their normal pace — avoid whispers, shouts, or singing.
  • Remove any background music before using a clip as a reference.
  • Longer reference clips (15–30 seconds) generally produce better results than the minimum 5 seconds.
  • If synthesis sounds robotic, try a different reference clip from a different part of the video.