voicebox.tts package

Submodules

voicebox.tts.amazonpolly module

class voicebox.tts.amazonpolly.AmazonPolly(client: PollyClient, voice_id: Literal['Aditi', 'Adriano', 'Amy', 'Andres', 'Aria', 'Arlet', 'Arthur', 'Astrid', 'Ayanda', 'Bianca', 'Brian', 'Burcu', 'Camila', 'Carla', 'Carmen', 'Celine', 'Chantal', 'Conchita', 'Cristiano', 'Daniel', 'Danielle', 'Dora', 'Elin', 'Emma', 'Enrique', 'Ewa', 'Filiz', 'Gabrielle', 'Geraint', 'Giorgio', 'Gregory', 'Gwyneth', 'Hala', 'Hannah', 'Hans', 'Hiujin', 'Ida', 'Ines', 'Isabelle', 'Ivy', 'Jacek', 'Jan', 'Jasmine', 'Jihye', 'Jitka', 'Joanna', 'Joey', 'Justin', 'Kajal', 'Karl', 'Kazuha', 'Kendra', 'Kevin', 'Kimberly', 'Laura', 'Lea', 'Liam', 'Lisa', 'Liv', 'Lotte', 'Lucia', 'Lupe', 'Mads', 'Maja', 'Marlene', 'Mathieu', 'Matthew', 'Maxim', 'Mia', 'Miguel', 'Mizuki', 'Naja', 'Niamh', 'Nicole', 'Ola', 'Olivia', 'Pedro', 'Penelope', 'Raveena', 'Remi', 'Ricardo', 'Ruben', 'Russell', 'Ruth', 'Sabrina', 'Salli', 'Seoyeon', 'Sergio', 'Sofie', 'Stephen', 'Suvi', 'Takumi', 'Tatyana', 'Thiago', 'Tomoko', 'Vicki', 'Vitoria', 'Zayd', 'Zeina', 'Zhiyu'], engine: Literal['generative', 'long-form', 'neural', 'standard'] = None, language_code: Literal['ar-AE', 'arb', 'ca-ES', 'cmn-CN', 'cs-CZ', 'cy-GB', 'da-DK', 'de-AT', 'de-CH', 'de-DE', 'en-AU', 'en-GB', 'en-GB-WLS', 'en-IE', 'en-IN', 'en-NZ', 'en-SG', 'en-US', 'en-ZA', 'es-ES', 'es-MX', 'es-US', 'fi-FI', 'fr-BE', 'fr-CA', 'fr-FR', 'hi-IN', 'is-IS', 'it-IT', 'ja-JP', 'ko-KR', 'nb-NO', 'nl-BE', 'nl-NL', 'pl-PL', 'pt-BR', 'pt-PT', 'ro-RO', 'ru-RU', 'sv-SE', 'tr-TR', 'yue-CN'] = None, lexicon_names: Sequence[str] = None, sample_rate: Literal[8000, 16000] = 16000)[source]

Bases: TTS

TTS using Amazon Polly.

See the Amazon Polly documentation for full descriptions of the parameters.

Supports SSML: ✔ (docs)

client: PollyClient

Boto3 Polly client, created by e.g.

>>> session = boto3.Session(...)
>>> client = session.client('polly')

engine: Literal['generative', 'long-form', 'neural', 'standard'] = None: Specifies the engine (standard or neural) for Amazon Polly to use when processing input text for speech synthesis.

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

language_code: Literal['ar-AE', 'arb', 'ca-ES', 'cmn-CN', 'cs-CZ', 'cy-GB', 'da-DK', 'de-AT', 'de-CH', 'de-DE', 'en-AU', 'en-GB', 'en-GB-WLS', 'en-IE', 'en-IN', 'en-NZ', 'en-SG', 'en-US', 'en-ZA', 'es-ES', 'es-MX', 'es-US', 'fi-FI', 'fr-BE', 'fr-CA', 'fr-FR', 'hi-IN', 'is-IS', 'it-IT', 'ja-JP', 'ko-KR', 'nb-NO', 'nl-BE', 'nl-NL', 'pl-PL', 'pt-BR', 'pt-PT', 'ro-RO', 'ru-RU', 'sv-SE', 'tr-TR', 'yue-CN'] = None: Optional language code for the Synthesize Speech request.

lexicon_names: Sequence[str] = None: List of one or more pronunciation lexicon names you want the service to apply during synthesis.

sample_rate: Literal[8000, 16000] = 16000: Sample rate of returned audio. Must be 8000 or 16000.

voice_id: Literal['Aditi', 'Adriano', 'Amy', 'Andres', 'Aria', 'Arlet', 'Arthur', 'Astrid', 'Ayanda', 'Bianca', 'Brian', 'Burcu', 'Camila', 'Carla', 'Carmen', 'Celine', 'Chantal', 'Conchita', 'Cristiano', 'Daniel', 'Danielle', 'Dora', 'Elin', 'Emma', 'Enrique', 'Ewa', 'Filiz', 'Gabrielle', 'Geraint', 'Giorgio', 'Gregory', 'Gwyneth', 'Hala', 'Hannah', 'Hans', 'Hiujin', 'Ida', 'Ines', 'Isabelle', 'Ivy', 'Jacek', 'Jan', 'Jasmine', 'Jihye', 'Jitka', 'Joanna', 'Joey', 'Justin', 'Kajal', 'Karl', 'Kazuha', 'Kendra', 'Kevin', 'Kimberly', 'Laura', 'Lea', 'Liam', 'Lisa', 'Liv', 'Lotte', 'Lucia', 'Lupe', 'Mads', 'Maja', 'Marlene', 'Mathieu', 'Matthew', 'Maxim', 'Mia', 'Miguel', 'Mizuki', 'Naja', 'Niamh', 'Nicole', 'Ola', 'Olivia', 'Pedro', 'Penelope', 'Raveena', 'Remi', 'Ricardo', 'Ruben', 'Russell', 'Ruth', 'Sabrina', 'Salli', 'Seoyeon', 'Sergio', 'Sofie', 'Stephen', 'Suvi', 'Takumi', 'Tatyana', 'Thiago', 'Tomoko', 'Vicki', 'Vitoria', 'Zayd', 'Zeina', 'Zhiyu']: Voice ID to use for the synthesis.

voicebox.tts.cache module

class voicebox.tts.cache.CachedTTS(tts: TTS, cache: MutableMapping)[source]

Bases: TTS

Wraps a TTS instance in a cache to reduce calls to the TTS.

classmethod build(tts: ~voicebox.tts.tts.TTS, max_size: int | float = 60, size_func: ~typing.Literal['bytes', 'count', 'seconds'] | ~typing.Callable[[~typing.Any], int | float] = 'seconds', cache_class: ~typing.Type[~cachetools.Cache] = <class 'cachetools.LRUCache'>) → CachedTTS[source]

Constructs a cache that by default will keep the most recently used 60 seconds of audio, and wraps the given TTS instance in the cache so the TTS is only called for text not contained in the cache.

Parameters:

tts – The TTS instance to wrap. Will be called for text not contained in the cache.
max_size – The maximum size of the cache, as determined by size_func. Defaults to 60 (seconds).
size_func – The function that measures the size of each item. If set to ‘seconds’ (default), then max_size will be in units of audio seconds. If set to ‘bytes’, then max_size will be in units of audio bytes. If set to ‘count’, then max_size will be simply the number of audio clips to cache. Alternatively, any function that takes an Audio instance as input and returns a size value can be passed in.
cache_class – The Cache class used to construct the cache. Defaults to cachetools.LRUCache, a Least Recently Used cache.

Returns:

An instance of CachedTTS.

cache: MutableMapping

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

tts: TTS

class voicebox.tts.cache.PrerecordedTTS(texts_to_audios: Mapping[str | SSML, Audio], fallback_tts: TTS = None)[source]

Bases: TTS

Returns audio from a map of message texts to Audio instances. Useful for playing back pre-recorded messages. Also supports an optional fallback TTS instance for messages not in the map.

Parameters:

texts_to_audios – Mapping of message texts to Audio instances.
fallback_tts – Optional fallback TTS instance that will be used if a text is not found in messages.

fallback_tts: TTS = None

classmethod from_tts(tts: TTS, texts: Iterable[str | SSML], use_as_fallback: bool = True) → PrerecordedTTS[source]

Returns a PrerecordedTTS instance using audio generated by the given TTS instance.

Parameters:

tts – The TTS instance to use.
texts – The texts to generate audio of.
use_as_fallback – If True, then the given TTS instance will be used as a fallback for texts not in the map.

Example

>>> tts = ...some tts...
>>> tts = PrerecordedTTS.from_tts(
>>>     tts,
>>>     texts=[
>>>         'I say this all the time.',
>>>         'Hello there!',
>>>     ],
>>> )

classmethod from_wav_files(texts_to_files: Mapping[str | SSML, IO[bytes] | str | Path], fallback_tts: TTS = None) → PrerecordedTTS[source]

Returns a PrerecordedTTS instance using audio from the specified wav files.

Parameters:

texts_to_files – Mapping of texts to wav files.
fallback_tts – Optional fallback TTS instance.

Example

>>> tts = PrerecordedTTS.from_wav_files({
>>>     'startup': 'audio/startup.wav',
>>>     'no_internet': 'audio/no_internet.wav',
>>> })

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

texts_to_audios: Mapping[str | SSML, Audio]

voicebox.tts.cache.SizeFunc

Returns the size of the given item.

alias of Callable[[Any], int | float]

voicebox.tts.elevenlabs module

class voicebox.tts.elevenlabs.ElevenLabsTTS(*, voice_id: str, api_key: str = None, client: ElevenLabs = None, sample_rate: int = 32000, convert_kwargs: dict[str, Any] = None)[source]

Bases: TTS

TTS using the ElevenLabs API.

Supports SSML: ✔ (docs)

Parameters:

voice_id – Voice to use. See here for a list of valid voice IDs.
api_key – (Optional) Your ElevenLabs API key. If this and client are not given, then the client will pull the API key from the ELEVENLABS_API_KEY env var. Note: Cannot be used with the client arg!
client – (Optional) An elevenlabs.client.ElevenLabs instance. Use this if you want to further customize the client behavior. Note: Cannot be used with the api_key arg!
sample_rate –
(Optional) PCM audio sample rate. Defaults to 32kHz. This is used to set the output_format of the request. See here for valid options. Note: You must pick a sample rate from one of the output_format options beginning with pcm_! Other codecs are not supported.
convert_kwargs – (Optional) Additional kwargs to pass to the client.text_to_speech.convert call. See here for all options: https://elevenlabs.io/docs/api-reference/text-to-speech/convert

property api_key: str

client: ElevenLabs

convert_kwargs: dict[str, Any]

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

property output_format

sample_rate: int

voice_id: str

voicebox.tts.espeakng module

class voicebox.tts.espeakng.ESpeakConfig(amplitude: int = None, word_gap_seconds: float = None, capitals: int = None, line_length: int = None, pitch: int = None, speed: int = None, voice: str = None, no_final_pause: bool = False, speak_punctuation: bool | str = False, exe_path: str = 'espeak-ng', timeout: float = None)[source]

Bases: object

Configuration for the eSpeak NG engine.

Run “espeak-ng -h” for more information on these options.

amplitude: int = None

capitals: int = None

exe_path: str = 'espeak-ng'

line_length: int = None

no_final_pause: bool = False

pitch: int = None

speak_punctuation: bool | str = False

speed: int = None

timeout: float = None

voice: str = None

word_gap_seconds: float = None

class voicebox.tts.espeakng.ESpeakNG(config: ~voicebox.tts.espeakng.ESpeakConfig = <factory>)[source]

Bases: TTS

TTS using the eSpeak NG engine.

You may need to install it:

On Debian/Ubuntu: sudo apt install espeak-ng

Supports SSML: ✔ (docs)

Parameters:: config – Optional configuration for the eSpeak NG engine. If not given, a default config will be used.

config: ESpeakConfig

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

voicebox.tts.googlecloudtts module

class voicebox.tts.googlecloudtts.GoogleCloudTTS(client: ~google.cloud.texttospeech_v1.services.text_to_speech.client.TextToSpeechClient, voice_params: ~google.cloud.texttospeech_v1.types.cloud_tts.VoiceSelectionParams, audio_config: ~google.cloud.texttospeech_v1.types.cloud_tts.AudioConfig = <factory>, timeout: float = _MethodDefault._DEFAULT_VALUE)[source]

Bases: TTS

TTS using Google Cloud TTS.

You will need to set up a Google Cloud project with billing enabled. See this quickstart guide to get started.

Supports SSML: ✔ (docs)

audio_config: AudioConfig

client: TextToSpeechClient

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

timeout: float = <object object>

voice_params: VoiceSelectionParams

voicebox.tts.gtts module

class voicebox.tts.gtts.gTTS(temp_file_dir: str = None, temp_file_prefix: str = 'voicebox-gtts-', **gtts_kwargs: dict[str, Any])[source]

Bases: Mp3FileTTS

Online TTS using the gTTS library, which queries the Google Translate TTS API under the hood.

Supports SSML: ✘

Parameters:: gtts_kwargs – These will be passed to the gtts.gTTS constructor. See the gTTS docs for options.

generate_speech_audio_file(text: str | SSML, audio_file_path: Path) → None[source]: Generates a speech audio file from the given text.

gtts_kwargs: Dict[str, Any]

voicebox.tts.parlertts module

voicebox.tts.picotts module

class voicebox.tts.picotts.PicoTTS(pico2wave_path: str = 'pico2wave', language: str = None, temp_file_dir: str = None, temp_file_prefix: str = 'voicebox-pico-tts-')[source]

Bases: WavFileTTS

TTS using Pico TTS.

You may need to install it:

On Debian/Ubuntu: sudo apt install libttspico-utils

Supports SSML: ✘

generate_speech_audio_file(text: str | SSML, file_path: Path) → None[source]: Generates a speech audio file from the given text.

language: str = None

pico2wave_path: str = 'pico2wave'

voicebox.tts.pyttsx3 module

class voicebox.tts.pyttsx3.Pyttsx3TTS(engine: Engine = None, temp_file_dir: str = None, temp_file_prefix: str = 'voicebox-pyttsx3-')[source]

Bases: WavFileTTS

TTS using pyttsx3.

Parameters:

engine – (Optional) The pyttsx3 engine to use. If not given, a new engine will be created via pyttsx3.init().
temp_file_dir – (Optional) The directory to save temporary audio files to. If not given, then the default temporary directory will be used.
temp_file_prefix – (Optional) The prefix to use for temporary audio files. Defaults to ‘voicebox-pyttsx3-‘.

generate_speech_audio_file(text: str | SSML, file_path: Path) → None[source]: Generates a speech audio file from the given text.

voicebox.tts.tts module

class voicebox.tts.tts.AudioFileTTS(temp_file_dir: str | None, temp_file_prefix: str)[source]

Bases: TTS, ABC

Base class for text-to-speech engines that generate audio files.

abstractmethod generate_speech_audio_file(text: str | SSML, audio_file_path: Path) → None[source]: Generates a speech audio file from the given text.

abstractmethod get_audio_file_type() → str[source]: Returns the file type of the audio files generated by this TTS.

abstractmethod get_audio_from_file(file_path: Path) → Audio[source]: Returns an Audio instance from the given file path.

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

temp_file_dir: str | None

temp_file_prefix: str

class voicebox.tts.tts.FallbackTTS(ttss: ~typing.Sequence[~voicebox.tts.tts.TTS], exceptions_to_catch: ~typing.Tuple[~typing.Type[BaseException]] = (<class 'Exception'>,), log: ~logging.Logger = <Logger voicebox.tts.tts (WARNING)>)[source]

Bases: TTS

Attempts to call the TTSs in order, returning results from the first TTS that does not raise an exception.

Useful if you have e.g. an online TTS that you want to use primarily, and want to fall back to an offline TTS in case something goes wrong.

Parameters:

ttss – The TTSs to try, in order.
exceptions_to_catch – The exceptions to catch and log when calling the TTSs. If an exception is raised that is not in this tuple, then it will not be caught.
log – The logger to use for logging exceptions.

exceptions_to_catch: Tuple[Type[BaseException]] = (<class 'Exception'>,)

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

handle_exception(e: BaseException, tts: TTS, tts_index: int) → None[source]

log: Logger = <Logger voicebox.tts.tts (WARNING)>

ttss: Sequence[TTS]

class voicebox.tts.tts.Mp3FileTTS(temp_file_dir: str | None, temp_file_prefix: str)[source]

Bases: AudioFileTTS, ABC

get_audio_file_type() → str[source]: Returns the file type of the audio files generated by this TTS.

get_audio_from_file(file_path: Path) → Audio[source]: Returns an Audio instance from the given file path.

class voicebox.tts.tts.RetryTTS(tts: ~voicebox.tts.tts.TTS, max_attempts: int = 3, exceptions_to_catch: ~typing.Tuple[~typing.Type[BaseException]] = (<class 'Exception'>, ), log: ~logging.Logger = <Logger voicebox.tts.tts (WARNING)>)[source]

Bases: TTS

If an exception occurs while getting speech from the given TTS, retry until max_attempts is reached.

Parameters:

tts – The TTS to call.
max_attempts – The maximum number of attempts to make.
exceptions_to_catch – The exceptions to catch and log when calling the TTS. If an exception is raised that is not in this tuple, then it will not be caught.
log – The logger to use for logging exceptions.

exceptions_to_catch: Tuple[Type[BaseException]] = (<class 'Exception'>,)

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

handle_exception(e: BaseException, attempt: int) → None[source]

log: Logger = <Logger voicebox.tts.tts (WARNING)>

max_attempts: int = 3

tts: TTS

class voicebox.tts.tts.TTS[source]

Bases: ABC

Base class for text-to-speech engines.

abstractmethod get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

class voicebox.tts.tts.WavFileTTS(temp_file_dir: str | None, temp_file_prefix: str)[source]

Bases: AudioFileTTS, ABC

get_audio_file_type() → str[source]: Returns the file type of the audio files generated by this TTS.

get_audio_from_file(file_path: Path) → Audio[source]: Returns an Audio instance from the given file path.

voicebox.tts.utils module

voicebox.tts.utils.add_optional_items(d: dict, items: Iterable[Tuple[K, V | None]]) → dict[source]: Adds items with non-null values to the given dict.

voicebox.tts.utils.get_audio_from_mp3(file) → Audio[source]: Returns an Audio instance from an MP3 file.

voicebox.tts.utils.get_audio_from_samples(samples: ndarray, sample_rate: int) → Audio[source]

Takes raw int-typed samples and a sample rate, and returns an Audio instance with signal properly scaled to range [-1, 1).

Parameters:

samples – The raw samples as a numpy array. dtype must be int8, int16, or int32.
sample_rate – The sample rate of the samples in Hz.

voicebox.tts.utils.get_audio_from_wav_file(file_or_path: IO[bytes] | str | Path) → Audio[source]: Returns an Audio instance from a WAV file.

voicebox.tts.voiceai module

class voicebox.tts.voiceai.VoiceAiTTS(api_key: str, voice_id: str = None, temperature: float = None, top_p: float = None, model: str = None, language: str = None, api_url: str = 'https://dev.voice.ai/api/v1/tts/speech', extra_json: dict[str, Any] = None, extra_headers: dict[str, str] = None, request_kwargs: dict[str, Any] = None)[source]

Bases: TTS

TTS using the Voice.AI API.

Supports SSML: ✔ (docs)

Parameters:

api_key – Your Voice.AI API key. Create one here: https://voice.ai/app/dashboard/developers
voice_id – (Optional) Voice ID. If omitted, the default built-in voice is used.
temperature – (Optional) Temperature for generation (0.0-2.0).
top_p – (Optional) Top-p sampling parameter (0.0-1.0).
model – (Optional) TTS model to use. See here for options: https://voice.ai/docs/api-reference/text-to-speech/generate-speech#body-model-one-of-0
language – (Optional) Language code (ISO 639-1 format, e.g., ‘en’, ‘es’, ‘fr’). Defaults to ‘en’ if not provided.
api_url – (Optional) Override the default API URL.
extra_json – (Optional) Extra request parameters to put in the JSON payload.
extra_headers – (Optional) Extra headers to add to the request.
request_kwargs – (Optional) Extra kwargs to pass to the requests.post() call.

get_speech(text: str | SSML) → Audio[source]: Returns audio of the given text.

Module contents

voicebox.tts.default_tts() → TTS[source]: Returns a new instance of the default TTS, PicoTTS.