Advanced SSML: Creating a Podcast with Multiple Voices
Learn how to leverage WesenAI's unique long-form audio generation capabilities to create a podcast with multiple speakers using a single SSML file.
This tutorial demonstrates one of the most powerful and unique features of the WesenAI Text-to-Speech (TTS) API: generating long-form, multi-speaker audio from a single Speech Synthesis Markup Language (SSML) file. We will walk through creating a short podcast segment, highlighting the simplicity and capabilities that set WesenAI apart from other services.
The WesenAI Advantage: Long-Form Audio Dialogues
While services like AWS Polly, Google Cloud TTS, and Azure AI Speech are powerful, they often have significant limitations on the length of audio that can be generated in a single, synchronous request, and orchestrating dialogues between different voices can be complex.
WesenAI is engineered to overcome these challenges:
- Up to 1-Hour Audio Generation: Our asynchronous, job-based architecture is designed from the ground up to handle long-form content. You can submit an SSML document that results in up to one hour of audio, perfect for podcasts, audiobooks, and long-form narration.
- Seamless Voice Blending: Simply use the
<voice>
tag with a validconfigId
to switch between speakers within the same SSML document. There's no need to generate separate audio files and stitch them together manually. WesenAI handles the blending for you, creating a natural-sounding dialogue. - Simplified Workflow: The entire podcast is a single API job. Submit your script, wait for it to complete, and download the final, mixed-down audio file.
Prerequisites
This tutorial assumes you are familiar with the basic workflow of our TTS API. If you haven't already, please review the TTS API Tutorial to understand how to submit jobs, poll for status, and retrieve results.
Step 1: Choose Your Cast (The Voices)
First, you need to select the voices for your podcast hosts. You can get the full list from the /v1/meta/voices
endpoint. For this example, let's assume we've chosen two distinct Amharic voices:
- Host 1 (Abebe):
dawit
- Host 2 (Birtukan):
almaz
Step 2: Write the Podcast Script in SSML
The magic happens in the SSML. We will structure a conversation using standard tags like <p>
for paragraphs, <s>
for sentences, <break>
for pauses, and most importantly, <voice>
to switch between our hosts.
Here is an example script for a short segment about the history of coffee:
<speak> <p> <voice name="dawit"> <s>ሰላም እና ጤና ይስጥልኝ! እንኳን ወደ "ታሪክ እና ባህል" ፖድካስታችን በደህና መጣችሁ።</s> <s>እኔ አበበ ነኝ።</s> </voice> </p> <p> <voice name="almaz"> <s>እኔ ደግሞ ብርቱካን።</s> <s>ዛሬ በጣም አስደሳች እና ዓለም-አቀፍ ተጽዕኖ ስላለው ጉዳይ ነው የምናወራው፤ ስለ ቡና አመጣጥ።</s> </voice> </p> <p> <voice name="dawit"> <s>ልክ ነው ብርቱካን።</s> <s>ብዙ ሰው እንደሚገምተው፣ የቡና ታሪክ ከኢትዮጵያ ጋር የተቆራኘ ነው።</s> <s>አንድ ፍየል ጠባቂ የነበረው ካልዲ የተባለ ወጣት፣ ፍየሎቹ የቡና ፍሬዎችን ከበሉ በኋላ እንዴት ሃይል እንደሚሰጣቸው ተመለከተ ይባላል።</s> </voice> </p> <break time="1s"/> <p> <voice name="almaz"> <s>አስገራሚ ታሪክ ነው!</s> <s>ከዚያ በኋላ የቡና አጠቃቀም ወደ መካከለኛው ምስራቅና ከዚያም ወደ አውሮፓ ተስፋፋ።</s> <s>በ 17ኛው ክፍለ ዘመን ለንደን ውስጥ "የቡና ቤቶች" የእውቀት እና የንግድ ማዕከል ሆነው ነበር።</s> </voice> </p> <p> <voice name="dawit"> <s>በእርግጥም። ለዛሬው ዝግጅታችን ይህን ይመስል ነበር። በቀጣይ ሳምንት በሌላ ርዕስ እንገናኝ።</s> </voice> </p> </speak>
This single SSML document defines the entire dialogue, including pauses and speaker changes.
Step 3: Submit the Podcast Job
Now, you submit this SSML to the TTS API just like any other job. The only difference is that your payload contains the ssml
field instead of text
.
import requests import time WESEN_API_KEY = "YOUR_API_KEY" TTS_API_URL = "https://tts.api.wesen.ai/v1" headers = { "Authorization": f"Bearer {WESEN_API_KEY}", "Content-Type": "application/json" } # The full SSML script from above podcast_ssml = """ <speak> <p> <voice name="dawit"> <s>ሰላም እና ጤና ይስጥልኝ! እንኳን ወደ "ታሪክ እና ባህል" ፖድካስታችን በደህና መጣችሁ።</s> <s>እኔ አበበ ነኝ።</s> </voice> </p> <p> <voice name="almaz"> <s>እኔ ደግሞ ብርቱካን።</s> <s>ዛሬ በጣም አስደሳች እና ዓለም-አቀፍ ተጽዕኖ ስላለው ጉዳይ ነው የምናወራው፤ ስለ ቡና አመጣጥ።</s> </voice> </p> <p> <voice name="dawit"> <s>ልክ ነው ብርቱካን።</s> <s>ብዙ ሰው እንደሚገምተው፣ የቡና ታሪክ ከኢትዮጵያ ጋር የተቆራኘ ነው።</s> <s>አንድ ፍየል ጠባቂ የነበረው ካልዲ የተባለ ወጣት፣ ፍየሎቹ የቡና ፍሬዎችን ከበሉ በኋላ እንዴት ሃይል እንደሚሰጣቸው ተመለከተ ይባላል።</s> </voice> </p> <break time="1s"/> <p> <voice name="almaz"> <s>አስገራሚ ታሪክ ነው!</s> <s>ከዚያ በኋላ የቡና አጠቃቀም ወደ መካከለኛው ምስራቅና ከዚያም ወደ አውሮፓ ተስፋፋ።</s> <s>በ 17ኛው ክፍለ ዘመን ለንደን ውስጥ "የቡና ቤቶች" የእውቀት እና የንግድ ማዕከል ሆነው ነበር።</s> </voice> </p> <p> <voice name="dawit"> <s>በእርግጥም። ለዛሬው ዝግጅታችን ይህን ይመስል ነበር። በቀጣይ ሳምንት በሌላ ርዕስ እንገናኝ።</s> </voice> </p> </speak> """ payload = { "ssml": podcast_ssml, "format": "mp3", "configId": "almaz" # A default voice is still required } submit_response = requests.post(f"{TTS_API_URL}/job", headers=headers, json=payload) if submit_response.status_code == 201: job_id = submit_response.json().get("jobId") print(f"Podcast job submitted successfully. Job ID: {job_id}") # Proceed to poll for status and retrieve audio as shown in the main TTS tutorial. else: print(f"Error submitting job: {submit_response.status_code} - {submit_response.text}")
Note: Even when using <voice>
tags, you must provide a a configId
in the main request body. This acts as a default voice for any text outside of a <voice>
tag.
Conclusion
With WesenAI's TTS API, creating sophisticated, long-form audio content is no longer a complex orchestration task. By leveraging SSML and our powerful asynchronous job system, you can produce high-quality podcasts, audiobooks, and dynamic dialogues with minimal effort, allowing you to focus on creating great content rather than on technical limitations.