Audio Services

Convert text to natural-sounding speech, transcribe audio to text, or generate sound effects — all through a single API with no account setup required.

Quick Example

Axios
Fetch

import { withSapiom } from "@sapiom/axios";
import axios from "axios";
import fs from "fs";

// Create a Sapiom-wrapped Axios client
const client = withSapiom(
  axios.create({ baseURL: "https://elevenlabs.services.sapiom.ai" }),
  {
    apiKey: process.env.SAPIOM_API_KEY,
    baseURL: "https://api.sapiom.ai",
    serviceName: "ElevenLabs TTS",
    agentName: "my-agent",
  }
);

// Convert text to speech - Sapiom tracks cost automatically
const { data } = await client.post(
  "/v1/text-to-speech/EXAVITQu4vr4xnSDxMaL",
  {
    text: "Hello! Welcome to Sapiom. This is a test of the text-to-speech API.",
    model_id: "eleven_multilingual_v2",
  },
  { responseType: "arraybuffer" }
);

// Save the audio to a file
fs.writeFileSync("output.mp3", Buffer.from(data));
console.log("Audio saved to output.mp3");

import { createFetch } from "@sapiom/fetch";
import fs from "fs";

// Create a Sapiom-tracked fetch function
const sapiomFetch = createFetch({
  apiKey: process.env.SAPIOM_API_KEY,
  baseURL: "https://api.sapiom.ai",
  serviceName: "ElevenLabs TTS",
  agentName: "my-agent",
});

// Convert text to speech - SDK handles payment/auth automatically
const response = await sapiomFetch(
  "https://elevenlabs.services.sapiom.ai/v1/text-to-speech/EXAVITQu4vr4xnSDxMaL",
  {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      text: "Hello! Welcome to Sapiom. This is a test of the text-to-speech API.",
      model_id: "eleven_multilingual_v2",
    }),
  }
);

// Save the audio to a file
const buffer = await response.arrayBuffer();
fs.writeFileSync("output.mp3", Buffer.from(buffer));
console.log("Audio saved to output.mp3");

How It Works

Sapiom routes audio requests to ElevenLabs, which provides state-of-the-art voice AI technology. The SDK handles payment negotiation automatically — you pay based on character count (TTS), audio duration (STT), or a flat rate (sound effects).

The service supports three operations:

Text-to-Speech — Convert text to natural-sounding audio
Speech-to-Text — Transcribe audio files to text
Sound Effects — Generate sound effects from text descriptions

Provider

Powered by ElevenLabs. ElevenLabs provides industry-leading voice synthesis with natural intonation and emotional range across 29 languages.

API Reference

Text-to-Speech

Endpoint: POST https://elevenlabs.services.sapiom.ai/v1/text-to-speech/{voiceId}

Convert text to natural-sounding speech. The voice ID is specified in the URL path.

Popular voice IDs:

EXAVITQu4vr4xnSDxMaL — Sarah (female, soft)
JBFqnCBsd6RMkjVDRZzb — George (male, narrative)
21m00Tcm4TlvDq8ikWAM — Rachel (female, calm)
AZnzlk1XvdvUeBnXmlld — Domi (female, strong)

Request

Parameter	Type	Required	Description
`text`	string	Yes	Text to convert to speech (max 5000 characters)
`model_id`	string	No	Model for synthesis (default: `eleven_multilingual_v2`)
`output_format`	string	No	Audio format (default: `mp3_44100_128`)

Output format options:

MP3: mp3_22050_32, mp3_44100_64, mp3_44100_128, mp3_44100_192
PCM: pcm_16000, pcm_22050, pcm_24000, pcm_44100
Opus: opus_48000_64, opus_48000_128

{
  "text": "Welcome to our application. How can I help you today?",
  "model_id": "eleven_multilingual_v2"
}

Response

The response is binary audio data with the appropriate Content-Type header:

audio/mpeg for MP3 formats
audio/pcm for PCM formats
audio/basic for μ-law/A-law formats

The X-Character-Count header contains the number of characters processed.

Speech-to-Text

Endpoint: POST https://elevenlabs.services.sapiom.ai/v1/speech-to-text

Transcribe audio to text.

Request

Parameter	Type	Required	Description
`audioBase64`	string	Yes	Base64-encoded audio content
`durationSeconds`	number	Yes	Audio duration in seconds (required for pricing)
`fileName`	string	No	Original filename for logging
`modelId`	string	No	Transcription model (default: `scribe_v1`)
`languageCode`	string	No	Language code (auto-detected if not specified)

Supported languages: English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Japanese, Chinese, Korean, and more.

{
  "audioBase64": "SGVsbG8gV29ybGQh...",
  "durationSeconds": 30.5,
  "fileName": "meeting-recording.mp3",
  "languageCode": "en"
}

Response

{
  "text": "Hello and welcome to today's meeting. We have several items on the agenda...",
  "language_code": "en",
  "language_probability": 0.98,
  "words": [
    {
      "text": "Hello",
      "start": 0.0,
      "end": 0.5
    }
  ]
}

Sound Effects

Endpoint: POST https://elevenlabs.services.sapiom.ai/v1/sound-effects

Generate sound effects from text descriptions.

Request

Parameter	Type	Required	Description
`text`	string	Yes	Description of the sound effect to generate
`durationSeconds`	number	No	Duration in seconds, 0.5-22.0 (default: 2.0)
`promptInfluence`	number	No	How literally to follow the prompt, 0.0-1.0 (default: 0.3)

{
  "text": "Cinematic braam, horror atmosphere",
  "durationSeconds": 3.0,
  "promptInfluence": 0.5
}

Response

The response is binary MP3 audio data with Content-Type: audio/mpeg.

Price Estimation

Endpoints:

POST https://elevenlabs.services.sapiom.ai/v1/text-to-speech/price
POST https://elevenlabs.services.sapiom.ai/v1/speech-to-text/price
POST https://elevenlabs.services.sapiom.ai/v1/sound-effects/price

Get the estimated cost before making a request. Accepts the same parameters as the main endpoint.

{
  "price": "$0.012",
  "currency": "USD"
}

Error Codes

Code	Description
400	Invalid request — check parameters
402	Payment required — ensure you’re using the Sapiom SDK
404	Voice or model not found
413	Text or audio too large
429	Rate limit exceeded

import { withSapiom } from "@sapiom/axios";
import axios from "axios";

const client = withSapiom(axios.create(), {
  apiKey: process.env.SAPIOM_API_KEY,
});

const baseUrl = "https://elevenlabs.services.sapiom.ai/v1";

async function createPodcastIntro(title: string, host: string) {
  // Generate podcast intro with TTS
  const script = `Welcome to ${title}. I'm your host, ${host}. Let's dive in.`;

  const response = await client.post(
    `${baseUrl}/text-to-speech`,
    {
      text: script,
      voiceId: "JBFqnCBsd6RMkjVDRZzb",
      outputFormat: "mp3_44100_192",
    },
    { responseType: "arraybuffer" }
  );

  return Buffer.from(response.data);
}

async function transcribeRecording(audioBase64: string, duration: number) {
  // Transcribe an audio recording
  const { data } = await client.post(`${baseUrl}/speech-to-text`, {
    audioBase64,
    durationSeconds: duration,
    languageCode: "en",
  });

  return data.text;
}

async function generateTransitionSound() {
  // Create a custom sound effect
  const response = await client.post(
    `${baseUrl}/sound-effects`,
    {
      text: "Soft whoosh transition, podcast style",
      durationSeconds: 1.5,
    },
    { responseType: "arraybuffer" }
  );

  return Buffer.from(response.data);
}

// Usage
const introAudio = await createPodcastIntro("Tech Weekly", "Alex");
console.log("Intro audio size:", introAudio.byteLength, "bytes");

const transitionSfx = await generateTransitionSound();
console.log("Transition audio size:", transitionSfx.byteLength, "bytes");

import { createFetch } from "@sapiom/fetch";

const fetch = createFetch({
  apiKey: process.env.SAPIOM_API_KEY,
});

const baseUrl = "https://elevenlabs.services.sapiom.ai/v1";

async function createPodcastIntro(title: string, host: string) {
  // Generate podcast intro with TTS
  const script = `Welcome to ${title}. I'm your host, ${host}. Let's dive in.`;

  const response = await fetch(`${baseUrl}/text-to-speech`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      text: script,
      voiceId: "JBFqnCBsd6RMkjVDRZzb",
      outputFormat: "mp3_44100_192",
    }),
  });

  return Buffer.from(await response.arrayBuffer());
}

async function transcribeRecording(audioBase64: string, duration: number) {
  // Transcribe an audio recording
  const response = await fetch(`${baseUrl}/speech-to-text`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      audioBase64,
      durationSeconds: duration,
      languageCode: "en",
    }),
  });

  const data = await response.json();
  return data.text;
}

async function generateTransitionSound() {
  // Create a custom sound effect
  const response = await fetch(`${baseUrl}/sound-effects`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      text: "Soft whoosh transition, podcast style",
      durationSeconds: 1.5,
    }),
  });

  return Buffer.from(await response.arrayBuffer());
}

// Usage
const introAudio = await createPodcastIntro("Tech Weekly", "Alex");
console.log("Intro audio size:", introAudio.byteLength, "bytes");

const transitionSfx = await generateTransitionSound();
console.log("Transition audio size:", transitionSfx.byteLength, "bytes");

Pricing

Operation	Price	Unit
Text-to-Speech	$0.24	per 1,000 characters
Speech-to-Text	$0.08	per minute
Sound Effects	$0.08	flat per generation

Minimums:

Text-to-Speech: $0.001 minimum per request
Speech-to-Text: $0.01 minimum per request

Example costs:

500 character TTS: ~$0.12
5 minute transcription: ~$0.40
Sound effect: $0.08