voicebox.tts package
Submodules
voicebox.tts.amazonpolly module
- class voicebox.tts.amazonpolly.AmazonPolly(client: PollyClient, voice_id: Literal['Aditi', 'Adriano', 'Amy', 'Andres', 'Aria', 'Arlet', 'Arthur', 'Astrid', 'Ayanda', 'Bianca', 'Brian', 'Burcu', 'Camila', 'Carla', 'Carmen', 'Celine', 'Chantal', 'Conchita', 'Cristiano', 'Daniel', 'Danielle', 'Dora', 'Elin', 'Emma', 'Enrique', 'Ewa', 'Filiz', 'Gabrielle', 'Geraint', 'Giorgio', 'Gregory', 'Gwyneth', 'Hala', 'Hannah', 'Hans', 'Hiujin', 'Ida', 'Ines', 'Isabelle', 'Ivy', 'Jacek', 'Jan', 'Jasmine', 'Jihye', 'Jitka', 'Joanna', 'Joey', 'Justin', 'Kajal', 'Karl', 'Kazuha', 'Kendra', 'Kevin', 'Kimberly', 'Laura', 'Lea', 'Liam', 'Lisa', 'Liv', 'Lotte', 'Lucia', 'Lupe', 'Mads', 'Maja', 'Marlene', 'Mathieu', 'Matthew', 'Maxim', 'Mia', 'Miguel', 'Mizuki', 'Naja', 'Niamh', 'Nicole', 'Ola', 'Olivia', 'Pedro', 'Penelope', 'Raveena', 'Remi', 'Ricardo', 'Ruben', 'Russell', 'Ruth', 'Sabrina', 'Salli', 'Seoyeon', 'Sergio', 'Sofie', 'Stephen', 'Suvi', 'Takumi', 'Tatyana', 'Thiago', 'Tomoko', 'Vicki', 'Vitoria', 'Zayd', 'Zeina', 'Zhiyu'], engine: Literal['generative', 'long-form', 'neural', 'standard'] = None, language_code: Literal['ar-AE', 'arb', 'ca-ES', 'cmn-CN', 'cs-CZ', 'cy-GB', 'da-DK', 'de-AT', 'de-CH', 'de-DE', 'en-AU', 'en-GB', 'en-GB-WLS', 'en-IE', 'en-IN', 'en-NZ', 'en-SG', 'en-US', 'en-ZA', 'es-ES', 'es-MX', 'es-US', 'fi-FI', 'fr-BE', 'fr-CA', 'fr-FR', 'hi-IN', 'is-IS', 'it-IT', 'ja-JP', 'ko-KR', 'nb-NO', 'nl-BE', 'nl-NL', 'pl-PL', 'pt-BR', 'pt-PT', 'ro-RO', 'ru-RU', 'sv-SE', 'tr-TR', 'yue-CN'] = None, lexicon_names: Sequence[str] = None, sample_rate: Literal[8000, 16000] = 16000)[source]
Bases:
TTSTTS using Amazon Polly.
See the Amazon Polly documentation for full descriptions of the parameters.
- client: PollyClient
Boto3 Polly client, created by e.g.
>>> session = boto3.Session(...) >>> client = session.client('polly')
- engine: Literal['generative', 'long-form', 'neural', 'standard'] = None
Specifies the engine (
standardorneural) for Amazon Polly to use when processing input text for speech synthesis.
- language_code: Literal['ar-AE', 'arb', 'ca-ES', 'cmn-CN', 'cs-CZ', 'cy-GB', 'da-DK', 'de-AT', 'de-CH', 'de-DE', 'en-AU', 'en-GB', 'en-GB-WLS', 'en-IE', 'en-IN', 'en-NZ', 'en-SG', 'en-US', 'en-ZA', 'es-ES', 'es-MX', 'es-US', 'fi-FI', 'fr-BE', 'fr-CA', 'fr-FR', 'hi-IN', 'is-IS', 'it-IT', 'ja-JP', 'ko-KR', 'nb-NO', 'nl-BE', 'nl-NL', 'pl-PL', 'pt-BR', 'pt-PT', 'ro-RO', 'ru-RU', 'sv-SE', 'tr-TR', 'yue-CN'] = None
Optional language code for the Synthesize Speech request.
- lexicon_names: Sequence[str] = None
List of one or more pronunciation lexicon names you want the service to apply during synthesis.
- sample_rate: Literal[8000, 16000] = 16000
Sample rate of returned audio. Must be
8000or16000.
- voice_id: Literal['Aditi', 'Adriano', 'Amy', 'Andres', 'Aria', 'Arlet', 'Arthur', 'Astrid', 'Ayanda', 'Bianca', 'Brian', 'Burcu', 'Camila', 'Carla', 'Carmen', 'Celine', 'Chantal', 'Conchita', 'Cristiano', 'Daniel', 'Danielle', 'Dora', 'Elin', 'Emma', 'Enrique', 'Ewa', 'Filiz', 'Gabrielle', 'Geraint', 'Giorgio', 'Gregory', 'Gwyneth', 'Hala', 'Hannah', 'Hans', 'Hiujin', 'Ida', 'Ines', 'Isabelle', 'Ivy', 'Jacek', 'Jan', 'Jasmine', 'Jihye', 'Jitka', 'Joanna', 'Joey', 'Justin', 'Kajal', 'Karl', 'Kazuha', 'Kendra', 'Kevin', 'Kimberly', 'Laura', 'Lea', 'Liam', 'Lisa', 'Liv', 'Lotte', 'Lucia', 'Lupe', 'Mads', 'Maja', 'Marlene', 'Mathieu', 'Matthew', 'Maxim', 'Mia', 'Miguel', 'Mizuki', 'Naja', 'Niamh', 'Nicole', 'Ola', 'Olivia', 'Pedro', 'Penelope', 'Raveena', 'Remi', 'Ricardo', 'Ruben', 'Russell', 'Ruth', 'Sabrina', 'Salli', 'Seoyeon', 'Sergio', 'Sofie', 'Stephen', 'Suvi', 'Takumi', 'Tatyana', 'Thiago', 'Tomoko', 'Vicki', 'Vitoria', 'Zayd', 'Zeina', 'Zhiyu']
Voice ID to use for the synthesis.
voicebox.tts.cache module
- class voicebox.tts.cache.CachedTTS(tts: TTS, cache: MutableMapping)[source]
Bases:
TTSWraps a
TTSinstance in a cache to reduce calls to theTTS.- classmethod build(tts: ~voicebox.tts.tts.TTS, max_size: int | float = 60, size_func: ~typing.Literal['bytes', 'count', 'seconds'] | ~typing.Callable[[~typing.Any], int | float] = 'seconds', cache_class: ~typing.Type[~cachetools.Cache] = <class 'cachetools.LRUCache'>) CachedTTS[source]
Constructs a cache that by default will keep the most recently used 60 seconds of audio, and wraps the given
TTSinstance in the cache so theTTSis only called for text not contained in the cache.- Parameters:
tts – The TTS instance to wrap. Will be called for text not contained in the cache.
max_size – The maximum size of the cache, as determined by
size_func. Defaults to 60 (seconds).size_func – The function that measures the size of each item. If set to ‘seconds’ (default), then
max_sizewill be in units of audio seconds. If set to ‘bytes’, thenmax_sizewill be in units of audio bytes. If set to ‘count’, thenmax_sizewill be simply the number of audio clips to cache. Alternatively, any function that takes anAudioinstance as input and returns a size value can be passed in.cache_class – The
Cacheclass used to construct the cache. Defaults tocachetools.LRUCache, a Least Recently Used cache.
- Returns:
An instance of
CachedTTS.
- cache: MutableMapping
- class voicebox.tts.cache.PrerecordedTTS(texts_to_audios: Mapping[str | SSML, Audio], fallback_tts: TTS = None)[source]
Bases:
TTSReturns audio from a map of message texts to
Audioinstances. Useful for playing back pre-recorded messages. Also supports an optional fallbackTTSinstance for messages not in the map.- Parameters:
texts_to_audios – Mapping of message texts to
Audioinstances.fallback_tts – Optional fallback
TTSinstance that will be used if a text is not found inmessages.
- classmethod from_tts(tts: TTS, texts: Iterable[str | SSML], use_as_fallback: bool = True) PrerecordedTTS[source]
Returns a
PrerecordedTTSinstance using audio generated by the givenTTSinstance.- Parameters:
tts – The
TTSinstance to use.texts – The texts to generate audio of.
use_as_fallback – If
True, then the givenTTSinstance will be used as a fallback for texts not in the map.
Example
>>> tts = ...some tts... >>> tts = PrerecordedTTS.from_tts( >>> tts, >>> texts=[ >>> 'I say this all the time.', >>> 'Hello there!', >>> ], >>> )
- classmethod from_wav_files(texts_to_files: Mapping[str | SSML, IO[bytes] | str | Path], fallback_tts: TTS = None) PrerecordedTTS[source]
Returns a
PrerecordedTTSinstance using audio from the specified wav files.- Parameters:
texts_to_files – Mapping of texts to wav files.
fallback_tts – Optional fallback
TTSinstance.
Example
>>> tts = PrerecordedTTS.from_wav_files({ >>> 'startup': 'audio/startup.wav', >>> 'no_internet': 'audio/no_internet.wav', >>> })
- voicebox.tts.cache.SizeFunc
Returns the size of the given item.
alias of
Callable[[Any],int|float]
voicebox.tts.elevenlabs module
- class voicebox.tts.elevenlabs.ElevenLabsTTS(*, voice_id: str, api_key: str = None, client: ElevenLabs = None, sample_rate: int = 32000, convert_kwargs: dict[str, Any] = None)[source]
Bases:
TTSTTS using the ElevenLabs API.
- Parameters:
voice_id – Voice to use. See here for a list of valid voice IDs.
api_key – (Optional) Your ElevenLabs API key. If this and client are not given, then the client will pull the API key from the
ELEVENLABS_API_KEYenv var. Note: Cannot be used with theclientarg!client – (Optional) An
elevenlabs.client.ElevenLabsinstance. Use this if you want to further customize the client behavior. Note: Cannot be used with theapi_keyarg!sample_rate –
(Optional) PCM audio sample rate. Defaults to 32kHz. This is used to set the
output_formatof the request. See here for valid options. Note: You must pick a sample rate from one of theoutput_formatoptions beginning withpcm_! Other codecs are not supported.convert_kwargs – (Optional) Additional kwargs to pass to the
client.text_to_speech.convertcall. See here for all options: https://elevenlabs.io/docs/api-reference/text-to-speech/convert
- property api_key: str
- client: ElevenLabs
- convert_kwargs: dict[str, Any]
- property output_format
- sample_rate: int
- voice_id: str
voicebox.tts.espeakng module
- class voicebox.tts.espeakng.ESpeakConfig(amplitude: int = None, word_gap_seconds: float = None, capitals: int = None, line_length: int = None, pitch: int = None, speed: int = None, voice: str = None, no_final_pause: bool = False, speak_punctuation: bool | str = False, exe_path: str = 'espeak-ng', timeout: float = None)[source]
Bases:
objectConfiguration for the eSpeak NG engine.
Run “
espeak-ng -h” for more information on these options.- amplitude: int = None
- capitals: int = None
- exe_path: str = 'espeak-ng'
- line_length: int = None
- no_final_pause: bool = False
- pitch: int = None
- speak_punctuation: bool | str = False
- speed: int = None
- timeout: float = None
- voice: str = None
- word_gap_seconds: float = None
- class voicebox.tts.espeakng.ESpeakNG(config: ~voicebox.tts.espeakng.ESpeakConfig = <factory>)[source]
Bases:
TTSTTS using the eSpeak NG engine.
You may need to install it:
On Debian/Ubuntu:
sudo apt install espeak-ng
- Parameters:
config – Optional configuration for the eSpeak NG engine. If not given, a default config will be used.
- config: ESpeakConfig
voicebox.tts.googlecloudtts module
- class voicebox.tts.googlecloudtts.GoogleCloudTTS(client: ~google.cloud.texttospeech_v1.services.text_to_speech.client.TextToSpeechClient, voice_params: ~google.cloud.texttospeech_v1.types.cloud_tts.VoiceSelectionParams, audio_config: ~google.cloud.texttospeech_v1.types.cloud_tts.AudioConfig = <factory>, timeout: float = _MethodDefault._DEFAULT_VALUE)[source]
Bases:
TTSTTS using Google Cloud TTS.
You will need to set up a Google Cloud project with billing enabled. See this quickstart guide to get started.
- audio_config: AudioConfig
- client: TextToSpeechClient
- timeout: float = <object object>
- voice_params: VoiceSelectionParams
voicebox.tts.gtts module
- class voicebox.tts.gtts.gTTS(temp_file_dir: str = None, temp_file_prefix: str = 'voicebox-gtts-', **gtts_kwargs: dict[str, Any])[source]
Bases:
Mp3FileTTSOnline TTS using the gTTS library, which queries the Google Translate TTS API under the hood.
Supports SSML: ✘
- Parameters:
gtts_kwargs – These will be passed to the
gtts.gTTSconstructor. See the gTTS docs for options.
- generate_speech_audio_file(text: str | SSML, audio_file_path: Path) None[source]
Generates a speech audio file from the given text.
- gtts_kwargs: Dict[str, Any]
voicebox.tts.parlertts module
voicebox.tts.picotts module
- class voicebox.tts.picotts.PicoTTS(pico2wave_path: str = 'pico2wave', language: str = None, temp_file_dir: str = None, temp_file_prefix: str = 'voicebox-pico-tts-')[source]
Bases:
WavFileTTSTTS using Pico TTS.
You may need to install it:
On Debian/Ubuntu:
sudo apt install libttspico-utils
Supports SSML: ✘
- generate_speech_audio_file(text: str | SSML, file_path: Path) None[source]
Generates a speech audio file from the given text.
- language: str = None
- pico2wave_path: str = 'pico2wave'
voicebox.tts.pyttsx3 module
- class voicebox.tts.pyttsx3.Pyttsx3TTS(engine: Engine = None, temp_file_dir: str = None, temp_file_prefix: str = 'voicebox-pyttsx3-')[source]
Bases:
WavFileTTSTTS using pyttsx3.
- Parameters:
engine – (Optional) The pyttsx3 engine to use. If not given, a new engine will be created via pyttsx3.init().
temp_file_dir – (Optional) The directory to save temporary audio files to. If not given, then the default temporary directory will be used.
temp_file_prefix – (Optional) The prefix to use for temporary audio files. Defaults to ‘voicebox-pyttsx3-‘.
voicebox.tts.tts module
- class voicebox.tts.tts.AudioFileTTS(temp_file_dir: str | None, temp_file_prefix: str)[source]
Bases:
TTS,ABCBase class for text-to-speech engines that generate audio files.
- abstractmethod generate_speech_audio_file(text: str | SSML, audio_file_path: Path) None[source]
Generates a speech audio file from the given text.
- abstractmethod get_audio_file_type() str[source]
Returns the file type of the audio files generated by this TTS.
- abstractmethod get_audio_from_file(file_path: Path) Audio[source]
Returns an Audio instance from the given file path.
- temp_file_dir: str | None
- temp_file_prefix: str
- class voicebox.tts.tts.FallbackTTS(ttss: ~typing.Sequence[~voicebox.tts.tts.TTS], exceptions_to_catch: ~typing.Tuple[~typing.Type[BaseException]] = (<class 'Exception'>,), log: ~logging.Logger = <Logger voicebox.tts.tts (WARNING)>)[source]
Bases:
TTSAttempts to call the TTSs in order, returning results from the first TTS that does not raise an exception.
Useful if you have e.g. an online TTS that you want to use primarily, and want to fall back to an offline TTS in case something goes wrong.
- Parameters:
ttss – The TTSs to try, in order.
exceptions_to_catch – The exceptions to catch and log when calling the TTSs. If an exception is raised that is not in this tuple, then it will not be caught.
log – The logger to use for logging exceptions.
- exceptions_to_catch: Tuple[Type[BaseException]] = (<class 'Exception'>,)
- log: Logger = <Logger voicebox.tts.tts (WARNING)>
- class voicebox.tts.tts.Mp3FileTTS(temp_file_dir: str | None, temp_file_prefix: str)[source]
Bases:
AudioFileTTS,ABC
- class voicebox.tts.tts.RetryTTS(tts: ~voicebox.tts.tts.TTS, max_attempts: int = 3, exceptions_to_catch: ~typing.Tuple[~typing.Type[BaseException]] = (<class 'Exception'>, ), log: ~logging.Logger = <Logger voicebox.tts.tts (WARNING)>)[source]
Bases:
TTSIf an exception occurs while getting speech from the given TTS, retry until
max_attemptsis reached.- Parameters:
tts – The TTS to call.
max_attempts – The maximum number of attempts to make.
exceptions_to_catch – The exceptions to catch and log when calling the TTS. If an exception is raised that is not in this tuple, then it will not be caught.
log – The logger to use for logging exceptions.
- exceptions_to_catch: Tuple[Type[BaseException]] = (<class 'Exception'>,)
- log: Logger = <Logger voicebox.tts.tts (WARNING)>
- max_attempts: int = 3
- class voicebox.tts.tts.WavFileTTS(temp_file_dir: str | None, temp_file_prefix: str)[source]
Bases:
AudioFileTTS,ABC
voicebox.tts.utils module
- voicebox.tts.utils.add_optional_items(d: dict, items: Iterable[Tuple[K, V | None]]) dict[source]
Adds items with non-null values to the given dict.
- voicebox.tts.utils.get_audio_from_mp3(file) Audio[source]
Returns an
Audioinstance from an MP3 file.
- voicebox.tts.utils.get_audio_from_samples(samples: ndarray, sample_rate: int) Audio[source]
Takes raw int-typed samples and a sample rate, and returns an
Audioinstance withsignalproperly scaled to range[-1, 1).- Parameters:
samples – The raw samples as a numpy array. dtype must be int8, int16, or int32.
sample_rate – The sample rate of the samples in Hz.
voicebox.tts.voiceai module
- class voicebox.tts.voiceai.VoiceAiTTS(api_key: str, voice_id: str = None, temperature: float = None, top_p: float = None, model: str = None, language: str = None, api_url: str = 'https://dev.voice.ai/api/v1/tts/speech', extra_json: dict[str, Any] = None, extra_headers: dict[str, str] = None, request_kwargs: dict[str, Any] = None)[source]
Bases:
TTSTTS using the Voice.AI API.
- Parameters:
api_key – Your Voice.AI API key. Create one here: https://voice.ai/app/dashboard/developers
voice_id – (Optional) Voice ID. If omitted, the default built-in voice is used.
temperature – (Optional) Temperature for generation (0.0-2.0).
top_p – (Optional) Top-p sampling parameter (0.0-1.0).
model – (Optional) TTS model to use. See here for options: https://voice.ai/docs/api-reference/text-to-speech/generate-speech#body-model-one-of-0
language – (Optional) Language code (ISO 639-1 format, e.g., ‘en’, ‘es’, ‘fr’). Defaults to ‘en’ if not provided.
api_url – (Optional) Override the default API URL.
extra_json – (Optional) Extra request parameters to put in the JSON payload.
extra_headers – (Optional) Extra headers to add to the request.
request_kwargs – (Optional) Extra kwargs to pass to the
requests.post()call.