Generate AI speech from a script using a stock voice or your own voice clone. Narrate a video or build a full voiceover track without picking up a mic. You'll create your script in Write mode, assign a Speaker, and Descript will generate an AI voiceover.
This article covers
- Supported languages
- Choose a voice generation model
- Generate text-to-speech audio
- After you generate
- Known limitations
- FAQs and troubleshooting
Plan availability
On current plans, this feature uses AI Credits. Learn more about tracking your Media Minutes and AI Credits.
Legacy and Sunset plans track usage differently. See our Understanding your Legacy and Sunset plan guide for details.
Supported text-to-speech (TTS) languages
| Supported languages | |||
|---|---|---|---|
| English (US) | Finnish | Portuguese (Portugal) | Slovak |
| Croatian | French (FR) | Romanian | Turkish |
| Czech | German | Malay | Danish |
| Hungarian | Polish | Spanish (US) | Dutch |
| Italian | Portuguese (Brazil) | Swedish | |
First, choose a voice generation model
Descript supports two AI speech models from ElevenLabs. Switch between them anytime in App Settings > AI models.
- Multilingual v2 (default): Reliable, fast generation across all supported languages. The default model for new projects.
- ElevenLabs v3: More natural and expressive AI speech, with support for tone tags to direct delivery (e.g. whisper, laugh, sigh). Slightly slower generation than v2. Not available on legacy plans.
All custom voice clones are affected by your selected model. Custom voices tend to sound more consistent and respond reliably to tone tags.
Stock voices are more expressive on the v3 model. They respond to tone tags more reliably and deliver more natural, animated speech. Because of these new capabilities, some voices may sound slightly different or vary in tone across generations.
The following stock voices are tuned for v3: Edward, Elizabeth, Grace, Joshua, Kyle, Libby, Michael, Owen, Ryan, Sarah, Simon, Ursula, Vernon.
Then, generate text-to-speech audio
The exact workflow is slightly different for single speaker vs multi-speaker TTS generation.
Single speaker
- Click Add speaker at the top of your composition and select a Speaker.
- Enter Write mode and type your script.
- When finished, click Done writing. Descript automatically generates AI speech for the entire script using your assigned Speaker.
Multiple speakers
- Enter Write mode before selecting a Speaker.
- Write or paste your full script into the script panel. When you're done, click Done writing to exit Write mode.
- Highlight a paragraph (or any portion of text), press the @ key, and assign a Speaker. Descript will generate TTS audio in that voice for the selection. Repeat for each section that needs a different Speaker.
After you generate
AI-generated speech behaves differently from recorded audio. To make timeline edits, apply fades or crossfades, or get precise control over playback, you'll need to convert it to a standard audio layer first.
FAQs and troubleshooting
Audio isn't generating
Confirm the Speaker has a voice assigned. If not, assign one in the Speaker card.
Mispronounced words
AI Speakers may occasionally mispronounce words. See our pronunciation guide.
Black frames appear after generating TTS
TTS and Regenerate don't work over sequences. If used on a sequence, video may be removed.
Workaround:
- Convert the AI voice clip into an audio layer.
- Cut the AI audio clip from the script (
Cmd + X/Ctrl + X). - Restore the original script track by expanding the clip in the timeline or using Undo.
- Paste the AI audio as a layer above the original script track.
- Split and mute the original script section using the Blade tool.
Unexpected background noise
Artifacts usually come from the original training audio. Try to avoid:
- Static or sudden loud sounds
- Background noise (traffic, appliances, music)
- Excessive mouth noise or breathing