Generate text-to-speech audio

This article shows you how to generate text-to-speech (TTS) audio using an AI Speaker in Descript. You'll write your script in Write mode, assign a speaker, and Descript will create AI-generated voiceover.

This article covers

How to generate text-to-speech audio
Supported languages
How to generate speech audio for part of your script
Generating audio with the Elevenlabs v3 (alpha) model

Usage note

On current plans, this feature uses AI Credits. Learn more about tracking your Media Minutes and AI Credits.

Legacy and Sunset plans track usage differently. See our Understanding your Legacy and Sunset plan guide for details.

Available text-to-speech languages

Supported languages
English (US)	Finnish	Portuguese (Portugal)	Slovak
Croatian	French (FR)	Romanian	Turkish
Czech	German	Malay	Danish
Hungarian	Polish	Spanish (US)	Dutch
Italian	Portuguese (Brazil)	Swedish

Before you start

You must be connected to the internet.
You’ll need to assign a voice—use an AI stock voice or create/select a custom AI Speaker.
If you're creating a custom AI Speaker, the consent statement must be recorded in English, even for non-English output.
For best performance, keep each paragraph under 1800 characters. Learn more.
(Optional) Select your preferred AI speech generation model in App settings Some models support inline prompts to control voice behavior and some non-speech audio.

How to generate text-to-speech

Open or create a project.
Write your script in Write mode. Start typing on any blank line or click Start writing. You’ll see a blue border and the label “Write mode.”
Click Done writing to exit Write mode. You can return to it later to make changes or regenerate the audio.
Click Add speaker and select a stock voice or AI Speaker.
Descript will generate the voiceover. A loading icon appears next to your text, followed by a green check mark once complete.

Generate speech for selected text

Want to generate voice for only part of your script?

Highlight the specific text, press the @ key, and assign a speaker. Descript will generate voiceover only for the selected portion.

Generating speech for part of a script

Voice behavior with the ElevenLabs v3 (alpha) model

The ElevenLabs v3 model offers more natural and expressive AI speech, along with support for inline prompts to adjust tone, delivery, and non-verbal audio, too.

What does (alpha) mean?

The ElevenLabs v3 model is currently in alpha, which means it’s still being tested and improved. You may notice occasional issues like inconsistent tone, unexpected artifacts (such as music or background noise), slower generation speed, or inline prompts that don’t always behave as expected.

If you’ve selected the ElevenLabs v3 model in your App Settings (not available to legacy users), here’s what to expect:

Modify output with inline prompts and tags

When using the ElevenLabs v3 model, you can add square bracket prompts—like [whispers], [laughs], or [sarcastic]—to adjust tone, pacing, emotion, or add sound effects to your AI-generated voice.

Common tags include:

Voice/emotion: [whispers], [laughs], [curious], [sighs], [excited], [mischievously]
Sound effects: [clapping], [swallows], [gulps],[explosion]

Note: Tag behavior can vary by voice, and not all tags will work with every speaker. Results may also vary depending on the length and complexity of your script.

For full details about the ElevenLabs v3 model, see this guide from ElevenLabs. This link will open in a new tab.

Inline prompts will appear in your transcript. To remove them without affecting your audio, first convert your AI-generated speech to a standard audio layer, then correct your transcript to remove the bracketed prompt.

Generation speed

Voice generation with ElevenLabs v3 is slightly slower than Multilingual v2, especially for long paragraphs or when prompts are included.

Tone and voice continuity

You may notice inconsistencies in how the output from this model sounds across paragraphs. This can include shifts in tone, volume, accent, or even speaker identity. These variations affect both custom and stock voices and are more likely to occur in longer or more complex scripts.

Custom AI Speakers vs stock voices

All custom voices are affected by your selected model. Custom voices tend to sound more consistent and respond reliably to prompt instructions.

With the v3 model, stock voices have become more expressive — they now respond to emotion tags and can deliver more natural, animated speech. Because of these new expressive capabilities, some voices may sound slightly different or vary in tone across generations.

The following stock voices support v3’s expressive features:
Edward, Elizabeth, Grace, Joshua, Kyle, Libby, Michael, Owen, Ryan, Sarah, Simon, Ursula, Vernon