Generate text-to-speech audio

Generate AI speech from a script using a stock voice or your own voice clone. Narrate a video or build a full voiceover track without picking up a mic. You'll create your script in Write mode, assign a Speaker, and Descript will generate an AI voiceover.

This article covers

Plan availability

On current plans, this feature uses AI Credits. Learn more about tracking your Media Minutes and AI Credits.

Legacy and Sunset plans track usage differently. See our Understanding your Legacy and Sunset plan guide for details.

Supported text-to-speech (TTS) languages

Supported languages
English (US) Finnish Portuguese (Portugal) Slovak
Croatian French (FR) Romanian Turkish
Czech German Malay Danish
Hungarian Polish Spanish (US) Dutch
Italian Portuguese (Brazil) Swedish

First, choose a voice generation model

Descript supports two AI speech models from ElevenLabs. Switch between them anytime in App Settings > AI models.

  • Multilingual v2 (default): Reliable, fast generation across all supported languages. The default model for new projects.
  • ElevenLabs v3: More natural and expressive AI speech, with support for tone tags to direct delivery (e.g. whisper, laugh, sigh). Slightly slower generation than v2. Not available on legacy plans.

All custom voice clones are affected by your selected model. Custom voices tend to sound more consistent and respond reliably to tone tags.

Stock voices are more expressive on the v3 model. They respond to tone tags more reliably and deliver more natural, animated speech. Because of these new capabilities, some voices may sound slightly different or vary in tone across generations.

The following stock voices are tuned for v3: Edward, Elizabeth, Grace, Joshua, Kyle, Libby, Michael, Owen, Ryan, Sarah, Simon, Ursula, Vernon.

Then, generate text-to-speech audio

The exact workflow is slightly different for single speaker vs multi-speaker TTS generation.

Single speaker

  1. Click Add speaker at the top of your composition and select a Speaker.
  2. Enter Write mode and type your script.
  3. When finished, click Done writing. Descript automatically generates AI speech for the entire script using your assigned Speaker.
    Done writing button in Write mode

Multiple speakers

  1. Enter Write mode before selecting a Speaker.
  2. Write or paste your full script into the script panel. When you're done, click Done writing to exit Write mode.
  3. Highlight a paragraph (or any portion of text), press the @ key, and assign a Speaker. Descript will generate TTS audio in that voice for the selection. Repeat for each section that needs a different Speaker.

After you generate

AI-generated speech behaves differently from recorded audio. To make timeline edits, apply fades or crossfades, or get precise control over playback, you'll need to convert it to a standard audio layer first.

Convert to audio option in the clip context menu

FAQs and troubleshooting

Audio isn't generating

Confirm the Speaker has a voice assigned. If not, assign one in the Speaker card.

Mispronounced words

AI Speakers may occasionally mispronounce words. See our pronunciation guide.

Black frames appear after generating TTS

TTS and Regenerate don't work over sequences. If used on a sequence, video may be removed.

Workaround:

  1. Convert the AI voice clip into an audio layer.
  2. Cut the AI audio clip from the script (Cmd + X / Ctrl + X).
  3. Restore the original script track by expanding the clip in the timeline or using Undo.
  4. Paste the AI audio as a layer above the original script track.
  5. Split and mute the original script section using the Blade tool.

Unexpected background noise

Artifacts usually come from the original training audio. Try to avoid:

  • Static or sudden loud sounds
  • Background noise (traffic, appliances, music)
  • Excessive mouth noise or breathing