STT API Tutorial

A developer-focused guide to the WesenAI Speech-to-Text API, including file uploads, job-based transcription, and result retrieval based on the OpenAPI specification.

This tutorial provides a comprehensive, technically accurate guide to the WesenAI Speech-to-Text (STT) API. We will cover the primary workflow: uploading a file, submitting a transcription job, and retrieving the results.

Core Concepts: Asynchronous Transcription

Similar to our TTS service, audio transcription is an asynchronous process. The STT API is designed to handle audio files of varying lengths without blocking your application or causing request timeouts.

The primary workflow is as follows:

  1. Upload Audio File: You send your audio file via a POST request to the /v1/job/upload endpoint.
  2. Submit Transcription Job: The server processes the upload and returns a jobId. You use this jobId to start the transcription.
  3. Poll for Status: Using the jobId, you poll the /v1/job/{jobId}/status endpoint until the job is completed.
  4. Retrieve Results: Once complete, you retrieve the transcription from the /v1/job/{jobId}/result endpoint.

API Reference

  • Base URL: https://stt.api.wesen.ai
  • Authentication: Authorization: Bearer YOUR_API_KEY or X-API-Key: YOUR_API_KEY

Step 1: Uploading an Audio File

The process begins by uploading your audio file. The API accepts various formats, but wav or mp3 are recommended.

POST /v1/job/upload

This endpoint accepts multipart/form-data. You must provide the audio file and the necessary parameters for transcription.

import requests

WESEN_API_KEY = "YOUR_API_KEY"
UPLOAD_URL = "https://stt.api.wesen.ai/v1/job/upload"
AUDIO_FILE_PATH = "path/to/your/audio.wav" # Replace with your file path

headers = {
    "Authorization": f"Bearer {WESEN_API_KEY}"
}

files = {
    'file': (AUDIO_FILE_PATH, open(AUDIO_FILE_PATH, 'rb'), 'audio/wav'),
    'provider': (None, 'google'),
    'language': (None, 'am-ET'), # Amharic (Ethiopia)
    'includeWordBoundaries': (None, 'true')
}

upload_response = requests.post(UPLOAD_URL, headers=headers, files=files)

job_id = None
if upload_response.status_code == 201:
    job_info = upload_response.json()
    job_id = job_info.get("jobId")
    print(f"File uploaded and job created successfully. Job ID: {job_id}")
else:
    print(f"Error uploading file: {upload_response.status_code} - {upload_response.text}")

The response to a successful upload is a JobSubmissionResponse object, which contains the jobId needed for the next steps.

Step 2: Polling for Transcription Completion

With the jobId, you can now poll the /v1/job/{jobId}/status endpoint.

GET /v1/job/{jobId}/status

The logic here is identical to the TTS workflow. You poll this endpoint periodically until the job state becomes completed or failed.

import time

if job_id:
    status_url = f"https://stt.api.wesen.ai/v1/job/{job_id}/status"
    
    while True:
        status_response = requests.get(status_url, headers=headers)
        
        if status_response.status_code != 200:
            print(f"Error fetching status: {status_response.status_code} - {status_response.text}")
            break

        status_data = status_response.json()
        job_state = status_data.get("state")
        progress = status_data.get("progress", 0)

        print(f"Job state: {job_state} ({progress}%)")

        if job_state == "completed":
            print("Transcription complete!")
            break
        elif job_state == "failed":
            error_details = status_data.get("error", "Unknown error")
            print(f"Job failed: {error_details}")
            break
        
        time.sleep(5)

Step 3: Retrieving the Transcription Result

Once the job state is completed, the transcription is ready. You can retrieve it from the /v1/job/{jobId}/result endpoint.

GET /v1/job/{jobId}/result

You can request the result as either json (default) or text. The JSON response includes the full transcript, confidence score, and word timings if requested.

if job_state == "completed":
    result_url = f"https://stt.api.wesen.ai/v1/job/{job_id}/result?format=json"
    result_response = requests.get(result_url, headers=headers)

    if result_response.status_code == 200:
        transcription_data = result_response.json()
        print("\n--- Transcription Result ---")
        print(f"Text: {transcription_data.get('text')}")
        print(f"Confidence: {transcription_data.get('confidence')}")
        
        word_timings = transcription_data.get('wordTimings')
        if word_timings:
            print("\n--- Word Timings ---")
            for word_info in word_timings:
                word = word_info.get('word')
                start = word_info.get('startTime')
                end = word_info.get('endTime')
                print(f"- Word: '{word}', Start: {start}s, End: {end}s")
    else:
        print(f"Error downloading result: {result_response.status_code} - {result_response.text}")

Alternative Workflow: Transcribe from URL

If your audio file is already accessible via a public URL, you can skip the upload step and submit the job directly using the /v1/job/transcribe endpoint.

POST /v1/job/transcribe

Request Body (TranscriptionRequestDto):

  • audioUrl (string, required): A public URL to the audio file.
  • provider (string, required): google, azure, etc.
  • language (string, required): Language code (e.g., am-ET).

This method initiates the same asynchronous job, and you would use the same polling and result retrieval logic as shown above.