Skip to content
Go To Dashboard

Audio Services

Convert text to natural-sounding speech, transcribe audio to text, or generate sound effects — all through a single API with no account setup required.

import { withSapiom } from "@sapiom/axios";
import axios from "axios";
import fs from "fs";
// Create a Sapiom-wrapped Axios client
const client = withSapiom(
axios.create({ baseURL: "https://elevenlabs.services.sapiom.ai" }),
{
apiKey: process.env.SAPIOM_API_KEY,
baseURL: "https://api.sapiom.ai",
serviceName: "ElevenLabs TTS",
agentName: "my-agent",
}
);
// Convert text to speech - Sapiom tracks cost automatically
const { data } = await client.post(
"/v1/text-to-speech/EXAVITQu4vr4xnSDxMaL",
{
text: "Hello! Welcome to Sapiom. This is a test of the text-to-speech API.",
model_id: "eleven_multilingual_v2",
},
{ responseType: "arraybuffer" }
);
// Save the audio to a file
fs.writeFileSync("output.mp3", Buffer.from(data));
console.log("Audio saved to output.mp3");

Sapiom routes audio requests to ElevenLabs, which provides state-of-the-art voice AI technology. The SDK handles payment negotiation automatically — you pay based on character count (TTS), audio duration (STT), or a flat rate (sound effects).

The service supports three operations:

  1. Text-to-Speech — Convert text to natural-sounding audio
  2. Speech-to-Text — Transcribe audio files to text
  3. Sound Effects — Generate sound effects from text descriptions

Powered by ElevenLabs. ElevenLabs provides industry-leading voice synthesis with natural intonation and emotional range across 29 languages.

Endpoint: POST https://elevenlabs.services.sapiom.ai/v1/text-to-speech/{voiceId}

Convert text to natural-sounding speech. The voice ID is specified in the URL path.

Popular voice IDs:

  • EXAVITQu4vr4xnSDxMaL — Sarah (female, soft)
  • JBFqnCBsd6RMkjVDRZzb — George (male, narrative)
  • 21m00Tcm4TlvDq8ikWAM — Rachel (female, calm)
  • AZnzlk1XvdvUeBnXmlld — Domi (female, strong)
ParameterTypeRequiredDescription
textstringYesText to convert to speech (max 5000 characters)
model_idstringNoModel for synthesis (default: eleven_multilingual_v2)
output_formatstringNoAudio format (default: mp3_44100_128)

Output format options:

  • MP3: mp3_22050_32, mp3_44100_64, mp3_44100_128, mp3_44100_192
  • PCM: pcm_16000, pcm_22050, pcm_24000, pcm_44100
  • Opus: opus_48000_64, opus_48000_128
{
"text": "Welcome to our application. How can I help you today?",
"model_id": "eleven_multilingual_v2"
}

The response is binary audio data with the appropriate Content-Type header:

  • audio/mpeg for MP3 formats
  • audio/pcm for PCM formats
  • audio/basic for μ-law/A-law formats

The X-Character-Count header contains the number of characters processed.

Endpoint: POST https://elevenlabs.services.sapiom.ai/v1/speech-to-text

Transcribe audio to text.

ParameterTypeRequiredDescription
audioBase64stringYesBase64-encoded audio content
durationSecondsnumberYesAudio duration in seconds (required for pricing)
fileNamestringNoOriginal filename for logging
modelIdstringNoTranscription model (default: scribe_v1)
languageCodestringNoLanguage code (auto-detected if not specified)

Supported languages: English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Japanese, Chinese, Korean, and more.

{
"audioBase64": "SGVsbG8gV29ybGQh...",
"durationSeconds": 30.5,
"fileName": "meeting-recording.mp3",
"languageCode": "en"
}
{
"text": "Hello and welcome to today's meeting. We have several items on the agenda...",
"language_code": "en",
"language_probability": 0.98,
"words": [
{
"text": "Hello",
"start": 0.0,
"end": 0.5
}
]
}

Endpoint: POST https://elevenlabs.services.sapiom.ai/v1/sound-effects

Generate sound effects from text descriptions.

ParameterTypeRequiredDescription
textstringYesDescription of the sound effect to generate
durationSecondsnumberNoDuration in seconds, 0.5-22.0 (default: 2.0)
promptInfluencenumberNoHow literally to follow the prompt, 0.0-1.0 (default: 0.3)
{
"text": "Cinematic braam, horror atmosphere",
"durationSeconds": 3.0,
"promptInfluence": 0.5
}

The response is binary MP3 audio data with Content-Type: audio/mpeg.

Endpoints:

  • POST https://elevenlabs.services.sapiom.ai/v1/text-to-speech/price
  • POST https://elevenlabs.services.sapiom.ai/v1/speech-to-text/price
  • POST https://elevenlabs.services.sapiom.ai/v1/sound-effects/price

Get the estimated cost before making a request. Accepts the same parameters as the main endpoint.

{
"price": "$0.012",
"currency": "USD"
}
CodeDescription
400Invalid request — check parameters
402Payment required — ensure you’re using the Sapiom SDK
404Voice or model not found
413Text or audio too large
429Rate limit exceeded
import { withSapiom } from "@sapiom/axios";
import axios from "axios";
const client = withSapiom(axios.create(), {
apiKey: process.env.SAPIOM_API_KEY,
});
const baseUrl = "https://elevenlabs.services.sapiom.ai/v1";
async function createPodcastIntro(title: string, host: string) {
// Generate podcast intro with TTS
const script = `Welcome to ${title}. I'm your host, ${host}. Let's dive in.`;
const response = await client.post(
`${baseUrl}/text-to-speech`,
{
text: script,
voiceId: "JBFqnCBsd6RMkjVDRZzb",
outputFormat: "mp3_44100_192",
},
{ responseType: "arraybuffer" }
);
return Buffer.from(response.data);
}
async function transcribeRecording(audioBase64: string, duration: number) {
// Transcribe an audio recording
const { data } = await client.post(`${baseUrl}/speech-to-text`, {
audioBase64,
durationSeconds: duration,
languageCode: "en",
});
return data.text;
}
async function generateTransitionSound() {
// Create a custom sound effect
const response = await client.post(
`${baseUrl}/sound-effects`,
{
text: "Soft whoosh transition, podcast style",
durationSeconds: 1.5,
},
{ responseType: "arraybuffer" }
);
return Buffer.from(response.data);
}
// Usage
const introAudio = await createPodcastIntro("Tech Weekly", "Alex");
console.log("Intro audio size:", introAudio.byteLength, "bytes");
const transitionSfx = await generateTransitionSound();
console.log("Transition audio size:", transitionSfx.byteLength, "bytes");
OperationPriceUnit
Text-to-Speech$0.24per 1,000 characters
Speech-to-Text$0.08per minute
Sound Effects$0.08flat per generation

Minimums:

  • Text-to-Speech: $0.001 minimum per request
  • Speech-to-Text: $0.01 minimum per request

Example costs:

  • 500 character TTS: ~$0.12
  • 5 minute transcription: ~$0.40
  • Sound effect: $0.08