Generate AI speech from a script using a stock voice or your own voice clone. Narrate a video or build a full voiceover track without picking up a mic. You'll create your script in Write mode, assign a Speaker, and Descript will generate an AI voiceover.
This article covers
- Supported languages
- Choose a voice generation model
- Generate text-to-speech audio
- Direct AI voice performance with tone tags
- After you generate
- Known limitations
- FAQs and troubleshooting
Plan availability
On current plans, this feature uses AI Credits. Learn more about tracking your Media Minutes and AI Credits.
Legacy and Sunset plans track usage differently. See our Understanding your Legacy and Sunset plan guide for details.
Supported text-to-speech (TTS) languages
| Supported languages | |||
|---|---|---|---|
| English (US) | Finnish | Portuguese (Portugal) | Slovak |
| Croatian | French (FR) | Romanian | Turkish |
| Czech | German | Malay | Danish |
| Hungarian | Polish | Spanish (US) | Dutch |
| Italian | Portuguese (Brazil) | Swedish | |
First, choose a voice generation model
Descript supports two AI speech models from ElevenLabs. Switch between them anytime in App Settings > AI models.
- Multilingual v2 (default): Reliable, fast generation across all supported languages. The default model for new projects.
- ElevenLabs v3: More natural and expressive AI speech, with support for tone tags to direct delivery (e.g. whisper, laugh, sigh). Slightly slower generation than v2. Not available on legacy plans.
All custom voice clones are affected by your selected model. Custom voices tend to sound more consistent and respond reliably to tone tags.
Stock voices are more expressive on the v3 model. They respond to tone tags more reliably and deliver more natural, animated speech. Because of these new capabilities, some voices may sound slightly different or vary in tone across generations.
The following stock voices are tuned for v3: Edward, Elizabeth, Grace, Joshua, Kyle, Libby, Michael, Owen, Ryan, Sarah, Simon, Ursula, Vernon.
Then, generate text-to-speech audio
The exact workflow is slightly different for single speaker vs multi-speaker TTS generation.
Single speaker
- Click Add speaker at the top of your composition and select a Speaker.
- Enter Write mode and type your script.
- When finished, click Done writing. Descript automatically generates AI speech for the entire script using your assigned Speaker.
Multiple speakers
- Enter Write mode before selecting a Speaker.
- Write or paste your full script into the script panel. When you're done, click Done writing to exit Write mode.
- Highlight a paragraph (or any portion of text), press the @ key, and assign a Speaker. Descript will generate TTS audio in that voice for the selection. Repeat for each section that needs a different Speaker.
Direct AI voice performance with tone tags
With the ElevenLabs v3 model, you can add tone tags to direct how the AI speaks — for example, telling it to whisper, laugh, sigh, or speak seriously. Tone tags appear in your script as gray text inside parentheses and are interpreted by the model when audio is generated.
Tone input isn't available on non-TTS content, on AI speech that's been converted to audio, or when using the v2 model.
Add a tone tag
- Highlight a selection in your script.
- Click the Tone button (wave icon) in the selection toolbar.
- Pick a preset like serious, sigh, laugh, or long pause — or choose Custom to write your own.
The tag appears in your script as gray text inside parentheses, and the AI uses it to shape its delivery on your next generation. Custom tone tags can't be longer than 150 characters.
Other ways to add a tone tag
The Tone button is the easiest way, but you can also:
-
Type it manually. While in Write Mode, place your cursor where you want the tag, type an opening parenthesis
(, enter your tag, and close it with). -
Use the action bar. Open with
command/ctrl + Kand search for Inline note.
Where tone tags appear
| Surface | Visible? |
|---|---|
| Your script (in the editor) | Yes — gray text in parentheses |
| Generated audio | Interpreted by the model, not spoken aloud |
| Exported transcripts | Yes — included as text |
| Captions | No |
| Share page transcript | No |
Square brackets no longer supported for tone
In earlier versions, you could type prompts directly into your script using square brackets — for example, [whispers] or [sigh]. That approach is no longer supported on v3.
If your script contains raw [...], Descript will block AI speech generation and prompt you to fix it. To migrate an existing script, replace any square-bracket prompts with parentheses, or use the Tone button to insert tags through the new flow.
After you generate
AI-generated speech behaves differently from recorded audio. To make timeline edits, apply fades or crossfades, or get precise control over playback, you'll need to convert it to a standard audio layer first.
Known limitations
- Generation speed. Voice generation with ElevenLabs v3 is slightly slower than Multilingual v2, especially for long paragraphs or when tone tags are included.
- Tone and voice continuity. You may notice inconsistencies in how the output from this model sounds across paragraphs. This can include shifts in tone, volume, accent, or even speaker identity. These variations affect both custom and stock voices and are more likely to occur in longer or more complex scripts.
-
Custom tone tags are experimental. Short, direct instructions like
(annoyed)or(slowly)tend to work, but longer descriptive prompts may be spoken aloud instead of treated as direction. If a custom tag isn't working as expected, try shortening it. - Tone tag scope. A tone tag affects delivery in the paragraph where it's added. The model decides how long the effect lasts — Descript doesn't control which exact words a tag applies to. Use paragraph breaks to reset the delivery.
FAQs and troubleshooting
Audio isn't generating
Confirm the Speaker has a voice assigned. If not, assign one in the Speaker card.
Mispronounced words
AI Speakers may occasionally mispronounce words. See our pronunciation guide.
Black frames appear after generating TTS
TTS and Regenerate don't work over sequences. If used on a sequence, video may be removed.
Workaround:
- Convert the AI voice clip into an audio layer.
- Cut the AI audio clip from the script (
Cmd + X/Ctrl + X). - Restore the original script track by expanding the clip in the timeline or using Undo.
- Paste the AI audio as a layer above the original script track.
- Split and mute the original script section using the Blade tool.
Unexpected background noise
Artifacts usually come from the original training audio. Try to avoid:
- Static or sudden loud sounds
- Background noise (traffic, appliances, music)
- Excessive mouth noise or breathing