Generate text-to-speech audio

Generate AI speech from a script using a stock voice or your own voice clone. Narrate a video or build a full voiceover track without picking up a mic. You'll create your script in Write mode, assign a Speaker, and Descript will generate an AI voiceover.

This article covers

Supported languages
Choose a voice generation model
Generate text-to-speech audio
Direct AI voice performance with tone tags
After you generate
Known limitations
FAQs and troubleshooting

Plan availability

On current plans, this feature uses AI Credits. Learn more about tracking your Media Minutes and AI Credits.

Legacy and Sunset plans track usage differently. See our Understanding your Legacy and Sunset plan guide for details.

Supported text-to-speech (TTS) languages

Supported languages
English (US)	Finnish	Portuguese (Portugal)	Slovak
Croatian	French (FR)	Romanian	Turkish
Czech	German	Malay	Danish
Hungarian	Polish	Spanish (US)	Dutch
Italian	Portuguese (Brazil)	Swedish

First, choose a voice generation model

Descript supports two AI speech models from ElevenLabs. Switch between them anytime in App Settings > AI models.

Multilingual v2 (default): Reliable, fast generation across all supported languages. The default model for new projects.
ElevenLabs v3: More natural and expressive AI speech, with support for tone tags to direct delivery (e.g. whisper, laugh, sigh). Slightly slower generation than v2. Not available on legacy plans.

All custom voice clones are affected by your selected model. Custom voices tend to sound more consistent and respond reliably to tone tags.

Stock voices are more expressive on the v3 model. They respond to tone tags more reliably and deliver more natural, animated speech. Because of these new capabilities, some voices may sound slightly different or vary in tone across generations.

The following stock voices are tuned for v3: Edward, Elizabeth, Grace, Joshua, Kyle, Libby, Michael, Owen, Ryan, Sarah, Simon, Ursula, Vernon.

Then, generate text-to-speech audio

The exact workflow is slightly different for single speaker vs multi-speaker TTS generation.

Single speaker

Click Add speaker at the top of your composition and select a Speaker.
Enter Write mode and type your script.
When finished, click Done writing. Descript automatically generates AI speech for the entire script using your assigned Speaker.

Multiple speakers

Enter Write mode before selecting a Speaker.
Write or paste your full script into the script panel. When you're done, click Done writing to exit Write mode.
Highlight a paragraph (or any portion of text), press the @ key, and assign a Speaker. Descript will generate TTS audio in that voice for the selection. Repeat for each section that needs a different Speaker.

Direct AI voice performance with tone tags

With the ElevenLabs v3 model, you can add tone tags to direct how the AI speaks — for example, telling it to whisper, laugh, sigh, or speak seriously. Tone tags appear in your script as gray text inside parentheses and are interpreted by the model when audio is generated.

Tone input isn't available on non-TTS content, on AI speech that's been converted to audio, or when using the v2 model.

Add a tone tag

Highlight a selection in your script.
Click the Tone button (wave icon) in the selection toolbar.
Pick a preset like serious, sigh, laugh, or long pause — or choose Custom to write your own.

The tag appears in your script as gray text inside parentheses, and the AI uses it to shape its delivery on your next generation. Custom tone tags can't be longer than 150 characters.

Other ways to add a tone tag

The Tone button is the easiest way, but you can also:

Type it manually. While in Write Mode, place your cursor where you want the tag, type an opening parenthesis (, enter your tag, and close it with ).
Use the action bar. Open with command/ctrl + K and search for Inline note.

Where tone tags appear

Surface	Visible?
Your script (in the editor)	Yes — gray text in parentheses
Generated audio	Interpreted by the model, not spoken aloud
Exported transcripts	Yes — included as text
Captions	No
Share page transcript	No

Square brackets no longer supported for tone

In earlier versions, you could type prompts directly into your script using square brackets — for example, [whispers] or [sigh]. That approach is no longer supported on v3.

If your script contains raw [...], Descript will block AI speech generation and prompt you to fix it. To migrate an existing script, replace any square-bracket prompts with parentheses, or use the Tone button to insert tags through the new flow.

After you generate

AI-generated speech behaves differently from recorded audio. To make timeline edits, apply fades or crossfades, or get precise control over playback, you'll need to convert it to a standard audio layer first.

Convert to audio option in the clip context menu

Known limitations

Generation speed. Voice generation with ElevenLabs v3 is slightly slower than Multilingual v2, especially for long paragraphs or when tone tags are included.
Tone and voice continuity. You may notice inconsistencies in how the output from this model sounds across paragraphs. This can include shifts in tone, volume, accent, or even speaker identity. These variations affect both custom and stock voices and are more likely to occur in longer or more complex scripts.
Custom tone tags are experimental. Short, direct instructions like (annoyed) or (slowly) tend to work, but longer descriptive prompts may be spoken aloud instead of treated as direction. If a custom tag isn't working as expected, try shortening it.
Tone tag scope. A tone tag affects delivery in the paragraph where it's added. The model decides how long the effect lasts — Descript doesn't control which exact words a tag applies to. Use paragraph breaks to reset the delivery.

FAQs and troubleshooting

Audio isn't generating

Confirm the Speaker has a voice assigned. If not, assign one in the Speaker card.

Mispronounced words

AI Speakers may occasionally mispronounce words. See our pronunciation guide.

Black frames appear after generating TTS

TTS and Regenerate don't work over sequences. If used on a sequence, video may be removed.

Workaround:

Convert the AI voice clip into an audio layer.
Cut the AI audio clip from the script (Cmd + X / Ctrl + X).
Restore the original script track by expanding the clip in the timeline or using Undo.
Paste the AI audio as a layer above the original script track.
Split and mute the original script section using the Blade tool.

Unexpected background noise

Artifacts usually come from the original training audio. Try to avoid:

Static or sudden loud sounds
Background noise (traffic, appliances, music)
Excessive mouth noise or breathing