TTS & SVC Survey (For Fun)

发表于 2024-10-06 更新于 2025-04-27 分类于 misc 阅读次数：本文字数： 437 阅读时长 ≈ 2 分钟

#Tacotron2

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, ICASSP 2018

#VITS

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech, ICML 2021
文本转语音 (TTS) 任务
技术: 端到端

#SoftVC / HuBERT / Voice Conversion

A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion, ICASSP 2022
声音转换 (Voice Conversion) 任务
HuBERT 模型

#MoeGoe

B 站 UP 主 CjangCjengh 开源 https://github.com/CjangCjengh/MoeGoe
VITS 工具化
基于 Tacotron2 版 Demo https://www.bilibili.com/video/BV1rV4y177Z7/
基于 VITS 版 MoeGoe 工具 https://www.bilibili.com/video/BV1A8411t7sK/

#Fastpitch

#VITS2

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design, Interspeech 2023
文本转语音 (TTS) 任务

#SoVITS / so-vits-svc 🔥🔥🔥

B 站 UP 主 Rcell 开源 https://github.com/innnky/so-vits-svc (已删)
歌声转换 (Singing Voice Conversion) 任务
技术实现: 把 VITS2 中的 text encoder 替换为 SoftVC 中的 HuBERT
荣誉: 时代周刊评价 2023 最佳发明 https://time.com/collection/best-inventions-2023/6327135/so-vits-svc/
一些训练好的声音模型: https://huggingface.co/spaces/zomehwh/vits-models/tree/main/pretrained_models

#AI 峰哥

效果演示: https://www.bilibili.com/video/BV1w24y1c7z9/
Idea: Fastpitch + NLP 模型
MassTTS: https://github.com/anyvoiceai/MassTTS
ChatGLM-6B: https://github.com/lich99/ChatGLM-finetune-LoRA

#BERT-VITS2

B 站 fishaudio 开源 https://www.bilibili.com/video/BV18E421371Q/ https://github.com/fishaudio/Bert-VITS2
启发自 AI 峰哥
技术实现: 把 VITS2 中的 text encoder 替换为 BERT

#OpenVoice

即时语音转换 (Immediate Voice Conversion, Zero-shot TTS) 任务

#RVC

B 站 UP 主花儿不哭开源 https://www.bilibili.com/video/BV1pm4y1z7Gm/
声音转换 (Voice Conversion) 任务
10 分钟样本

#GPT-SoVITS

B 站 UP 主花儿不哭开源 https://www.bilibili.com/video/BV12g4y1m7Uw/
声音转换 (Voice Conversion) 任务
RVC 延续, 只需 5 ~ 10s 样本

#fish-speech

https://github.com/fishaudio/fish-speech
Zero-shot TTS 任务
基于自回归模型
一段 5s 音频样本即可生成语音

#ChatTTS

AI 峰哥新版本 https://github.com/2noise/ChatTTS
TTS 任务, 对话场景, 支持细粒度控制和韵律
支持通过 LoRA 微调实现不同音色

0%