This is an old revision of the document!

Español en Mac

git clone https://github.com/jpgallegoar/Spanish-F5.git
python3.11 -m venv venv
source venv/bin/activate
python -m pip install torch torchvision torchaudio
python -m pip install soundfile librosa gradio_client
python -m pip install -e .

Esta parte es sobretodo por Mac y la memoria compartida:

En el fichero Spanish-F5/src/f5_tts/infer/utils_infer.py línea 342 substituimos:

audio, sr = torchaudio.load(ref_audio)

Por:

import soundfile as sf
import torch

data, sr = sf.read(ref_audio)
audio = torch.FloatTensor(data).unsqueeze(0) if data.ndim == 1 else torch.FloatTensor(data).T

Ahora lo arrancamos y la primera vez tarda mucho porque se descarga el modelo:

./venv/bin/f5-tts_infer-gradio --port 7860 --host 127.0.0.1

Poner el ejecutable de ffmpeg en el path. Lo descargamos de https://evermeet.cx/ffmpeg/:

cp ffmpeg Spanish-F5/venv/bin/ffmpeg
chmod +x Spanish-F5/venv/bin/ffmpeg

Para ver la api:

python -c "from gradio_client import Client; print(Client('http://127.0.0.1:7860/').view_api())"

Levantar el servidor sin conexión a internet

GRADIO_ANALYTICS_ENABLED=False ./venv/bin/f5-tts_infer-gradio --port 7860 --host 127.0.0.1

fichero generar_voz.py

import os
import sys
from gradio_client import Client, handle_file

# 1. Detectar automáticamente la carpeta de este experimento
BASE_DIR = os.path.dirname(os.path.abspath(__file__))

ruta_audio_ref = os.path.join(BASE_DIR, "voz.wav")
ruta_texto_ref = os.path.join(BASE_DIR, "voz.txt")

# Verificar los argumentos de la línea de comandos (Script, Entrada, Salida)
if len(sys.argv) < 3:
    print("❌ Error: Faltan argumentos.")
    print("Uso correcto: python generar_voz.py <fichero_entrada.txt> <fichero_salida.mp3>")
    sys.exit(1)

# Capturar los nombres pasados por parámetro
nombre_entrada = sys.argv[1]
nombre_salida_mp3 = sys.argv[2]

ruta_entrada_texto = os.path.join(BASE_DIR, nombre_entrada)
ruta_salida_mp3 = os.path.join(BASE_DIR, nombre_salida_mp3)
ruta_temporal_wav = os.path.join(BASE_DIR, "temp_salida.wav")

# Verificar que los archivos base existan
if not os.path.exists(ruta_audio_ref) or not os.path.exists(ruta_texto_ref):
    print(f"❌ Error: No se encuentra 'voz.wav' o 'voz.txt' en: {BASE_DIR}")
    sys.exit(1)

if not os.path.exists(ruta_entrada_texto):
    print(f"❌ Error: No se encuentra el archivo de entrada '{nombre_entrada}' en: {BASE_DIR}")
    sys.exit(1)

# 2. Conectar al Gradio local
client = Client("http://127.0.0.1:7860/")

# 3. Leer textos
with open(ruta_texto_ref, "r", encoding="utf-8") as f:
    texto_referencia = f.read().strip()

with open(ruta_entrada_texto, "r", encoding="utf-8") as f:
    nuevo_texto = f.read().strip()

if not nuevo_texto:
    print(f"⚠️ El archivo '{nombre_entrada}' está vacío.")
    sys.exit(1)

print(f"📖 Leyendo: {nombre_entrada}")
print(f"🗣️ Texto: \"{nuevo_texto}\"")
print(f"📥 Procesando en Spanish-F5...")

try:
    # 4. Llamada a la API
    resultado = client.predict(
        handle_file(ruta_audio_ref),   # ref_audio_orig
        texto_referencia,              # ref_text
        nuevo_texto,                   # gen_text
        "F5-TTS",                      # model
        False,                         # remove_silence
        0.15,                          # cross_fade_duration
        0.88,                          # speed
        api_name="/infer"
    )

    ruta_temporal_gradio = resultado[0]

    # 5. Guardar el WAV temporal de la IA
    if os.path.exists(ruta_temporal_wav):
        os.remove(ruta_temporal_wav)
    os.rename(ruta_temporal_gradio, ruta_temporal_wav)

    # 6. Convertir a MP3 usando el ffmpeg de tu entorno virtual
    print(f"🎵 Convirtiendo a MP3...")
    ruta_ffmpeg = os.path.abspath(os.path.join(BASE_DIR, "..", "..", "venv", "bin", "ffmpeg"))

    # Si no estuviera ahí, buscamos en el PATH general que ya reparamos
    if not os.path.exists(ruta_ffmpeg):
        ruta_ffmpeg = "ffmpeg"

    if os.path.exists(ruta_salida_mp3):
        os.remove(ruta_salida_mp3)

    comando_ffmpeg = f'{ruta_ffmpeg} -i "{ruta_temporal_wav}" -codec:a libmp3lame -qscale:a 2 "{ruta_salida_mp3}" -loglevel error'
    os.system(comando_ffmpeg)

    # Limpiar el archivo temporal
    if os.path.exists(ruta_temporal_wav):
        os.remove(ruta_temporal_wav)

    print(f"✅ ¡Éxito! Archivo MP3 generado en: {ruta_salida_mp3}")

except Exception as e:
    print(f"❌ Error: {e}")

Uso:

python generar_voz.py entrada.txt salida.mp3

COQUI TTS

instalamos pyenv para poder tener python 3.11

curl https://pyenv.run | bash

Metemos esto en .bashrc

# Load pyenv automatically by appending
# the following to 
# ~/.bash_profile if it exists, otherwise ~/.profile (for login shells)
# and ~/.bashrc (for interactive shells) :

export PYENV_ROOT="$HOME/.pyenv"
[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init - bash)"

# Restart your shell for the changes to take effect.

# Load pyenv-virtualenv automatically by adding
# the following to ~/.bashrc:

eval "$(pyenv virtualenv-init -)"

Cambiamos a python 3.11

pyenv install 3.11.9
pyenv global 3.11.9

Reiniciar shell

git clone https://github.com/idiap/coqui-ai-TTS.git
cd coqui-ai-TTS

Nos aseguramos versión 3.11 de python

python --version

Creamos entorno virtual

python -m venv venv --clear
source venv/bin/activate
pip install -U pip setuptools wheel
pip install --no-cache-dir torch torchaudio torchcodec
pip install -e .

Podemos listar los modelos:

tts --list_models

Ahora ya funciona, la primera vez se descarga el modelo y tarda bastante:

tts   --text "Hola, esto es una prueba de Coqui TTS funcionando en español."   --model_name tts_models/es/css10/vits   --out_path salida.wav

Entrenar COQUI TTS con XTTS

Para entrenar a COQUI hay que hacerlo con el modelo de fine-tunning xtts_v2 pero siempre carga los modelos y tarda mínimo 15 segundos. Lo ideal es entrenar a VITS con muchos audios y la ejecución dura menos de 1 segundo. Lo explico mas abajo

Tenemos que instalarlo con estable:

git clone https://github.com/idiap/coqui-ai-TTS.git
python3 -m venv venv
pip install -U pip setuptools wheel
pip install torch torchaudio torchcodec

Tenía problemas e hice:

pip install "torch<2.6" "torchaudio<2.6" --force-reinstall

tts \
  --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
  --text "Hola, estoy probando mi propia voz clonada" \
  --speaker_wav /ruta/a/tu_audio.wav \
  --language_idx es \
  --out_path salida.wav

Tarda 17 segundos

Podemos calcular el espectro de nuestra voz una vez y pasarlo

calcular_voz.py

import os
import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# 1. Fijamos la ruta real de la carpeta del snapshot que nos dio tu error
# (Quitamos el "model.pth/config.json" del final para quedarnos solo con la carpeta contenedora)
checkpoint_dir = os.path.expanduser("~/.cache/huggingface/hub/models--tts-hub--XTTS-v2/snapshots/345b56f6fbe25cca7103f7f34e471b8fe8e4945f")
config_file = os.path.join(checkpoint_dir, "config.json")

print(f"-> Cargando modelo XTTS v2 desde: {checkpoint_dir}")

if not os.path.exists(config_file):
    print(f"Error: No se encontró 'config.json' en la ruta {checkpoint_dir}")
    exit()

# 2. Cargamos la configuración desde el archivo JSON
config = XttsConfig()
config.load_json(config_file)

# 3. Inicializamos el modelo XTTS
model = Xtts(config)
model.load_checkpoint(config, checkpoint_dir=checkpoint_dir, eval=True)

# 4. Listamos tus archivos WAV de referencia
folder = "./mis_audios"
if not os.path.exists(folder):
    os.makedirs(folder)
    print(f"\n[!] Se ha creado la carpeta '{folder}'.")
    print("Por favor, mete tus archivos .wav allí y vuelve a ejecutar este script.")
    exit()

audio_files = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith('.wav')]

if not audio_files:
    print(f"\n[!] Error: No se encontraron archivos .wav en la carpeta '{folder}'")
    print("Mete los audios con los que quieres entrenar tu voz ahí dentro.")
    exit()

print(f"\nProcesando {len(audio_files)} archivos para crear tu huella de voz...")

# 5. Calculamos los latents promediando todos los audios de la carpeta
gpt_conditioning_latents, speaker_embedding = model.get_conditioning_latents(audio_path=audio_files)

# 6. Guardamos los vectores calculados
voice_data = {
    "gpt_conditioning_latents": gpt_conditioning_latents,
    "speaker_embedding": speaker_embedding
}

output_path = "mi_voz_entrenada.pth"
torch.save(voice_data, output_path)
print(f"\n[OK] ¡Listo! Tu voz se ha promediado y guardado con éxito en '{output_path}'")

Grabamos ficheros wav con nuestra voz en el directorio mis_audios

python calcular_voz.py

-> Cargando modelo XTTS v2 desde: .cache/huggingface/hub/models--tts-hub--XTTS-v2/snapshots/345b56f6fbe25cca7103f7f34e471b8fe8e4945f

Procesando 1 archivos para crear tu huella de voz...

[OK] ¡Listo! Tu voz se ha promediado y guardado con éxito en 'mi_voz_entrenada.pth'

Luego ejecutamos el fichero hablar.py

hablar.py

import os
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# 1. Texto a sintetizar y salida
texto_a_decir = "Hola, ahora estoy generando voz al vuelo de forma mucho más rápida y directa."
archivo_salida = "resultado_rapido2.wav"

# 2. Rutas del modelo base
checkpoint_dir = os.path.expanduser("~/.cache/huggingface/hub/models--tts-hub--XTTS-v2/snapshots/345b56f6fbe25cca7103f7f34e471b8fe8e4945f")
config_file = os.path.join(checkpoint_dir, "config.json")

# 3. Cargamos configuración y modelo
config = XttsConfig()
config.load_json(config_file)
model = Xtts(config)
model.load_checkpoint(config, checkpoint_dir=checkpoint_dir, eval=True)

# 4. Cargamos tu huella de voz precalculada (.pth)
print("Cargando tu huella de voz precalculada...")
voice_data = torch.load("mi_voz_entrenada.pth", weights_only=False)

# Mapeamos los nombres a lo que exige el método interno .inference()
gpt_cond_latent = voice_data["gpt_conditioning_latents"]
speaker_embedding = voice_data["speaker_embedding"]

# 5. Inferencia con los nombres de variables exactos del código de Coqui
print("Sintetizando audio al vuelo...")
outputs = model.inference(
    text=texto_a_decir,
    language="es",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=config.temperature,
    length_penalty=config.length_penalty,
    repetition_penalty=config.repetition_penalty,
    top_k=config.top_k,
    top_p=config.top_p,
)

# 6. Guardamos el resultado en un archivo .wav
audio_tensor = torch.tensor(outputs["wav"]).unsqueeze(0)
torchaudio.save(archivo_salida, audio_tensor, 24000)

print(f"¡Listo! Audio generado súper rápido en: {archivo_salida}")

Entrenar COQUI TTS con XTTS

Necesitamos muchísimas horas de audio y preparar un dataset

Creamos la siguiente estructura de ficheros

mi_dataset_vits/
├── wavs/
│   ├── 00001.wav
│   ├── 00002.wav
│   └── 00003.wav ... (todos tus archivos de audio)
└── metadata.csv

El archivo metadata.csv tiene que tener el formato:

00001|Hola, esta es la primera frase que grabé para mi modelo.
00002|El entrenamiento de inteligencia artificial requiere paciencia.
00003|Generar voz al vuelo ahora será inmediato.

Script entrenamiento

entrenar_vits.py

import os
from trainer import Trainer, TrainerArgs
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits

# 1. Rutas de carpetas
PATH_DATASET = "/Users/T054810/Personal/IA/coqui-ai-TTS/mi_dataset_vits"
PATH_SALIDA = "/Users/T054810/Personal/IA/coqui-ai-TTS/resultado_entrenamiento"

# 2. Configurar el Dataset en formato LJSpeech
dataset_config = BaseDatasetConfig(
    formatter="ljspeech",
    meta_file_name="metadata.csv",
    path=PATH_DATASET
)

# 3. Configurar la arquitectura VITS
config = VitsConfig(
    audio=None,  # Coqui calculará los parámetros de audio automáticamente
    run_name="mi_voz_vits_rapida",
    batch_size=16,       # Si te da error de memoria, bájalo a 8 o 4
    eval_batch_size=8,
    num_loader_workers=2,
    num_eval_loader_workers=2,
    run_eval=True,
    test_delay_epochs=5,
    epochs=100,          # Puedes subirlo si quieres que aprenda más (ej. 500)
    text_cleaner="multilingual_cleaners",
    language="es",       # Idioma español
    phoneme_language="es",
    phoneme_cache_path=os.path.join(PATH_SALIDA, "phoneme_cache"),
    datasets=[dataset_config],
    output_path=PATH_SALIDA
)

# 4. Cargar muestras de audio y texto
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)

# 5. Inicializar el modelo VITS
model = Vits(config)

# 6. Configurar el entrenador (Trainer)
trainer = Trainer(
    TrainerArgs(),
    config,
    output_path=PATH_SALIDA,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
)

# 7. ¡FUEGO! Lanzar el entrenamiento
print("Iniciando entrenamiento del modelo VITS con tu voz...")
trainer.fit()

Legido Wiki

Table of Contents

Español en Mac

COQUI TTS

Entrenar COQUI TTS con XTTS

Entrenar COQUI TTS con XTTS