Skip to playerSkip to main content
Ever wanted to clone a specific voice for your AI projects without going through the tedious process of training a custom model from scratch? In this tutorial, I demonstrate how to use Tortoise-TTS to generate high-quality, expressive text-to-speech using just a short audio sample of the target voice.

I cover the technical setup—including essential Python requirements, DeepSpeed for faster generation, and the importance of using clean .wav files—to ensure your voice clones sound professional. Whether you're narrating books or creating unique AI content, this workflow is a game-changer.

Original YouTube Tutorial: https://youtu.be/EUdcmU6X8i8
How to Separate Voice with Instrument: https://youtu.be/POdHe42WQEE
Fast TTS for Slow GPU / No GPU: https://youtu.be/6DcSJWI32JY

Source/Model Resources:
* Tortoise-TTS Repository: https://github.com/neonbjb/tortoise-tts
* MrQ AI-Voice-Cloning: https://git.ecker.tech/mrq/ai-voice-cloning

Video Details:
* Original Publish Date: April 8, 2024
* Focus: Voice Cloning / Text-to-Speech / Tortoise-TTS / Workflow Optimization
* Test using RTX 4060 TI 16 GB of VRAM

Follow lordcaocao2025 on Dailymotion for more technical AI research and generative workflow guides!

---
Connect with me:
📺 YouTube: https://www.youtube.com/@CaoCao2025
📱 TikTok: https://www.tiktok.com/@caocao20250
💎 Patreon: https://www.patreon.com/cw/Caocao2025

#VoiceCloning #TortoiseTTS #AI #TextToSpeech #DeepSpeed #AITutorial #lordcaocao2025
Transcript
00:00Hello guys, today we're going to learn about how to make a textual voice with a very good pronounce
00:08and with any voices that you want, that you have, without even need to train it.
00:15So every voice is going to work as long as you have a WAV file of that voice,
00:21and it's only voice, really better, and you could have a random intonation and a bunch of setting.
00:30It's really work-wonder, and it's great to read a book, making a review, and other stuff.
00:37Here's an example.
00:39Though fierce as tiger soldiers be, battles are won by strategy.
00:45A hero comes, he gains renown, already destined for a crown.
00:52Though fierce as tiger soldiers be, battles are won by strategy.
00:58A hero comes, he gains renown, already destined for a crown.
01:07Though fierce as tiger soldiers be, battles are won by strategy.
01:13A hero comes, he gains renown, already destined for a crown.
01:22Though fierce as tiger soldiers be, battles are won by strategy.
01:27A hero comes, he gains renown, already destined for a crown.
01:34Though fierce as tiger soldiers be, battles are won by strategy.
01:39A hero comes, he gains renown, already destined for a crown.
01:48Though fierce as tiger soldiers be, battles are won by strategy.
01:54A hero comes, he gains renown, already destined for a crown.
02:02Though fierce as tiger soldiers be, battles are won by strategy.
02:07A hero comes, he gains renown, already destined for a crown.
02:14A hero comes, he gains renown, already destined for a crown.
02:15Okay, that's the example.
02:16What you need to do first is you need to install Pythons.
02:20I suggest you should use Pythons 3.9.13, but you know, it could work with either Pythons,
02:28but you will have a hard time to increase its speed without 3.9.
02:34Because if you want to use a wheel for deep speed, I only know what it works with Pythons 3
02:41.9.
02:42So I suggest you install that.
02:44And you could also have a bunch of Pythons in your PC, doesn't matter, but make sure install this.
02:51And then when you want to play or set up the AI voice coding for the first time, just change
02:58this Pythons name for the other Pythons.
03:00For example, I have Pythons 3.9 and Pythons 3.10, I just changed the name for the Pythons 3
03:07.10.
03:07So when the infinite variable access, they're not going to find 3.10 and use Pythons 3.9 instead.
03:15But if you want to do anything else later on, just repair this and turn off the 3.9 by
03:21changing its name.
03:22It's very simple as that.
03:24Or you could edit the infinite variable and delete the 3.9.
03:27But this way I think it's more simple.
03:30After Pythons, you also have to have Visual Studio's library that is for C++ for Pythons and other stuff.
03:40But make sure you have the Pythons stuff.
03:42I also have other videos for this Pythons and Visual Studio library.
03:48I put the link below and also for Git.
03:51So make sure you have 3.3 and maybe FFM pack.
03:55I don't think it's needed, but you know, it can be hurt if you have some trouble.
04:00Next, as usual, just copy paste this link.
04:04Now we're going to the command prompt.
04:07Access when you want to install it.
04:12As usual, Git clone.
04:17And it's going to create a folder.
04:23Done.
04:25Okay, now you will have this folder.
04:27First, what you need to do is to set up CUDA.
04:33Set up CUDA bot here.
04:35Just double click here.
04:38It's going to do the rest for you.
04:42And this thing, I installed it using RTX 4016 Ti Super.
04:49No, not super, 16 gigs.
04:51I used GTX 6060 Super for this.
04:54It's installed, it's working, but it's really slow.
04:57So I don't recommend it using with that.
05:07Okay, it's going to install.
05:10It's going to take quite a while, so.
05:14Okay, when you ask this, you could just select directory, I guess.
05:21Doesn't matter.
05:24And then it's done.
05:25You just start back here.
05:30For the very first time, it's going to do very slow.
05:35So don't worry.
05:36It's just how it's set up.
05:39Because it's new to download things.
05:42Maybe I just skip it.
05:43Let's download things.
05:44Okay, after it's finished, you know,
05:47the first time it's going to download a bunch of things.
05:50Pretty big for the models.
05:522 gigs and 1.18 gigs.
05:57So it's going to take some time after that.
05:59Download more stuff.
06:01And if it inform you it's finished,
06:04you're going to be able to access it in your local halls.
06:08So let's access local halls.
06:10Yeah, this is.
06:12You need to check the setting.
06:14Everything is fine for now.
06:16You don't need to change anything.
06:19And yeah, this is what I'm talking about.
06:22Tipsy.
06:22For now, you cannot use it.
06:24For now, just use whatever you have.
06:27Voice as random.
06:29But you need to use experimental setting.
06:34So I suggest.
06:36Well, now let's just test this out.
06:39Hello guys.
06:42The first time you do it,
06:44it's going to take some time too.
06:47Because it's going to download another diffuser.
06:52And then it's start generating things.
06:56Hello guys.
06:58Welcome back with me.
06:59Sawsaw 2025.
07:01Since you use random voice,
07:03it's going to change every time you generate.
07:06This on itself is pretty good.
07:08Hello guys.
07:10Welcome back with me.
07:11Sawsaw 2025.
07:13And it's going to keep using other voice.
07:17Because you don't decide what voice you use.
07:19Hello guys.
07:20Welcome back with me.
07:21Sawsaw 2025.
07:23And you could play it, you know,
07:25this setting.
07:27And set it to ultra fast, of course.
07:30Hello guys.
07:31Welcome back with me.
07:33Sawsaw 2025.
07:34And the more sample you use,
07:36it's going to be better.
07:37It's region also like that.
07:40But the more you use that,
07:43the longer it will become.
07:44Hello guys.
07:45Welcome back with me.
07:46Sawsaw 2025.
07:51Hello guys.
07:53Welcome back with me.
07:54Saw it.
07:56Yeah, sometimes it's generally a bad thing.
07:58So remember,
08:00to do set this up.
08:02Hello guys.
08:03Welcome back with me.
08:05Sawsaw 2025.
08:08And play the set up.
08:11If you want to fast,
08:12it's going to use a lot.
08:13So don't use fast, you know.
08:161630, I think it's pretty good.
08:18Or 1632.
08:20Hello guys.
08:21Welcome back with me.
08:22Sawsaw 2025.
08:25If you only need to take the speed,
08:27then we're done.
08:28But of course,
08:29we don't want to use this only.
08:31We need to use other people's voice, right?
08:34So the first step is done.
08:36If you able to do this and it's played,
08:39then you could move on to the second step.
08:41That is preparing a voice file.
08:44You could take the voice file from the internet,
08:47from YouTube and stuff,
08:48but make sure you convert it to WAF.
08:51And if the voice has like a background music and stuff,
08:54you could remove the background music using Megio RPC.
09:00You know, video in the description.
09:02Removing or separating between vocal and instruments.
09:07For example, I have a bunch of voices.
09:11Like Ryuji voices.
09:13I got it from YouTube here.
09:15Damn it.
09:16All this bumping around is hurting my ass.
09:19And I call for you to go up later.
09:21Or let me be frank.
09:23It's Frank Underwood voice.
09:27I mean, after all, we shared everything, you and I.
09:30And make sure you take out that water thing's voice.
09:35For me.
09:37The best bro I know.
09:39We understand some voice.
09:40Okay.
09:41What you need to do is go to the high-voice-coating folder,
09:45and here you will need a voice folder.
09:49Create a new folder here.
09:52For example, you name it Ronnie Stevenson.
09:56And then paste the WAF file here.
10:00And you could add another folder too for other voices.
10:04Okay.
10:05I add another, a bunch of voice here.
10:09You know, you could put like one voice,
10:12or you could like a bunch of voice like here, you know.
10:18I add, uh, it's going to generate models for them later on.
10:24So, for Bounty Stevenson, it's not yet generate models.
10:30So, we're going to use this for example.
10:33Okay.
10:34After you add the voice file, just service voice list.
10:37And you will have this folder that you created.
10:40For now, let's type Barney Stevenson.
10:44And let's generate here.
10:47Since this is the first time you're doing it,
10:50it's going to take some time for it to generate a voice file.
10:56But it's not going to be that long.
10:58And remember, the more you have, or the longest,
11:03and the lot of files that you have, it's going to take longer.
11:09So, make sure you remember that.
11:11I mean, two or three minute voice file is already generate quite a good voice.
11:20Longer, of course, it's going to generate better player,
11:23Indonesian and stuff.
11:25But remember, make sure you have a voice that is consistent.
11:29Okay, this is already done.
11:34Okay, it's a little like Barney.
11:35Maybe we could increase the samples.
11:41I mean, if you have to hit speed, you could use a larger sample with increased speed.
11:52And, you know, the intonation is going to be changed depending on the seat.
11:57Hello, guys. Welcome back with me, Saw Saw 2025.
12:06Sorry, P.
12:07Hello, guys. Welcome back with me, Saw Saw 2025.
12:11Yeah, I think P is better.
12:18It's going to be legendary.
12:21It's going to be, read for it, legendary.
12:31So, yeah, going to take some time if you don't have to speed.
12:35And don't use, just take six and six and super for this.
12:37It's working, but one, like Santos,
12:42it just takes too long to load.
12:44I already, you know, give up on this when I don't have RTX 4060.
12:52Hello, guys. Welcome back with me, Saw Saw 2025.
12:57It's going to be legend.
13:00Wait for it. Dairy, legendary.
13:04Legendary. Okay, whatever.
13:07And that's now Frank Underwood.
13:10Let's generate it.
13:11If you have good chip CPU, you could use standard.
13:15Like usual, this is the first time we do it, so it's going to take some time.
13:20And what it's better from other texture voice or is that it have random intonation.
13:32So you could keep generating until you want the kind of voice that you want.
13:38And it also could use emotion from here, but I don't find it's really exact, you know,
13:47especially if you have used this custom.
13:50Hello, guys.
13:53Welcome back with me, Saw Saw 2025.
13:59It's going to be legend.
14:03Wait for it.
14:05Dairy, legendary.
14:08Legendary.
14:09Okay, my Frank Underwood is not clean up.
14:14Unlike the Bonnie Stinson, so it's like generate noises.
14:18So make sure you clean up the voice before.
14:20First, remove the voice and the instrument using the RFC,
14:29Mangeo RFC.
14:31And then the second one is remove the noises that left using Audacity.
14:39That's what I usually do.
14:41So that's what you should do, or else you will have noise.
14:46If you have noise in your voices, you will have noise here.
14:50Hello, guys.
14:52Welcome back with me, Saw Saw 2025.
14:58It's going to be legend.
15:01Wait for it.
15:03Dairy, legendary.
15:06Let's try using emotion.
15:11I mean, we could increase its speed.
15:16Hello, guys.
15:18Hello, guys.
15:18Welcome back with me, Saw Saw 20 25.
15:23There's a noise.
15:27Please come at me.
15:33Okay, that's for Underwood.
15:35Oh, this is the setup I had for Ryuji.
15:38As usual, it's reading from Latent, because we already created a linen.
15:46Hello, guys.
15:47Welcome back with me, Saw Saw 2020.
15:55He's gonna...
15:57Wait.
15:59Wait for it.
16:01Dairy, legendary.
16:04Okay.
16:06That's really bad.
16:09Hello, guys.
16:10Welcome back with me, Saw Saw 2020.
16:15Let's gonna be legend.
16:18Wait for it.
16:19Dairy, legendary.
16:22Okay.
16:23Okay, guys.
16:24So you could use the speed to increase the speed here.
16:27Just check this speed for speed bumps.
16:31And you need to add a speed wheels, speed...
16:34Yeah, wheels.
16:36Deep speed wheels for Python 3.9.
16:38I'm not the one who created it.
16:40It's created by Jared.
16:42So you should check his YouTube.
16:46Go to his YouTube channel for deep speed and how to install it.
16:51But after that, I will go show you how the difference between having that and not having that.
16:57Okay, guys.
16:58Let's try it.
16:59The Z264.
17:02This using deep speed.
17:06And this is actually the setup that I usually use.
17:10So, yeah.
17:12And...
17:13Hello, guys.
17:14Welcome back with me, Saw Saw 2025.
17:18It's gonna be legend.
17:19Wait for it.
17:21Dairy, legendary.
17:23Okay, it's a very fast ride.
17:25It's way different from before.
17:28And, you know, I actually, with Deep State, I only use this setup.
17:34And it usually already generates good voices.
17:37Just remember the setup here.
17:418 seconds for 7 seconds voice.
17:44It's already pretty good.
17:45Guys, welcome back with me, Saw Saw 2025.
17:48It's gonna be legend.
17:50Wait for it.
17:51Dairy, legendary.
17:52Dairy, legendary.
17:54Okay.
17:55This is what I usually use.
17:58And, you know, let's use a proper sentence, not something like this.
18:05Welcome back with me, Saw Saw 2025.
18:08It's gonna be legend.
18:09Wait for it.
18:10Dairy, legendary.
18:12I generate this Frank Underwood quote.
18:1640 seconds, 18 seconds.
18:19Money is the McMansion in Sarasota that starts falling apart after 10 years.
18:26Power is the old stone building that stands for centuries.
18:31I cannot respect someone who doesn't see the difference.
18:34We don't try a woman voice yet, so I'm going to give you an example of a woman voice.
18:40Money is the McMansion in Sarasota that starts falling apart after 10 years.
18:47Power is the old stone building that stands for centuries.
18:57If you find that it's not really good, you could increase samples or iteration.
19:02It's the McMansion in Sarasota that starts falling apart after 10 years.
19:07Power is the old stone building that stands for centuries.
19:11I cannot respect someone who doesn't see the difference.
19:29Okay.
19:45Let's try my favorite voice.
19:49And, you know, you could do long sentence like this.
19:54I usually double this, you know, like 10 sentence, 11 sentence, it could work.
20:01So you don't need to set one sentence at a time.
20:05A lot of sentence is working.
20:10Depending on your VRAM, of course.
20:11So if you have very little VRAM, don't do it a lot because it's going to have a problem.
20:18And make sure you have WIFI file, WIFI file, WA file to use and not mp3 or other stuff.
20:26Because it's going to consume the VRAM.
20:29For very little mp3, it's going to cost a lot more VRAM.
20:34So don't use that.
20:36Of course, long sentence can generate a long time.
20:39But it's 50 seconds for 66 seconds.
20:41The inhabitants of China are known to the world as Chinese.
20:47They speak of themselves as the people of Han.
20:51As Han is name of a dynasty, it hardly denote the origin of the people.
20:56Many theories, based more or less upon religious myths, have been advanced to show whence the first inhabitants of China
21:04came.
21:05But their correctness must necessarily await further scientific discoveries.
21:11All accounts, however, agree that the basin of the Yellow River was the cradle of the Chinese culture.
21:16And that their ancestors were a nomadic people who, some 5,000 or 6,000 years ago, migrated from the
21:24northwestern part of Asia.
21:26And finally settled in the northern central part of what is now China.
21:33Okay, you could also use other language as your base model, but the result is not going to be really
21:40great.
21:40For example, this is not using English.
21:44The voice is using other language.
21:47And we're going to hear how you talk.
21:52And remember, this is using deep speed for speed bump.
21:57If you don't use deep speed, you know, don't generate long samples.
22:02It's going to take a very long time.
22:04And this speed only affect the samples.
22:07So if you already reduce, like you only use one or two samples, it's not going to increase your speed
22:13much.
22:15The inhabitants of China are known to the world as Chinese.
22:20They speak of themselves as the people of Han.
22:23As Han is name of the dynasty, it hardly denotes the origin of the people.
22:28Many theories, based more or less upon religious myths, have been advanced to show whence the first inhabitants of China
22:35came.
22:36But their correctness must necessarily await further scientific discoveries.
22:41All accounts, however, agree that the basin of the Yellow River was the cradle of the Chinese culture.
22:46And that their ancestors were a nomadic people who, some 5,000 or 6,000 years ago, migrated from the
22:53northwestern part of Asia and finally settled in the north and central part of Asia and finally settled in the
22:59northern central part of what is now China.
23:01Okay, it's not really that bad, right, using other languages and this is using English voice of woman.
23:10Okay, okay, that's our, that's this thing, because I have other models for this.
23:16You could actually train your own models here, but I think that's for another, another time.
23:26That's right.
23:27The inhabitants of China are known to the world as Chinese.
23:30They speak of themselves as the people of Han.
23:34As Han is name of a dynasty, it hardly denotes the origin of the people.
23:39Many theories, based more or less upon religious myths, have been advanced to show whence the first inhabitants of China
23:47came.
23:48But their correctness must necessarily await further scientific discoveries.
23:52All accounts, however, agree that the basin of the Yellow River was the cradle of the Chinese culture and that
23:59their ancestors were a nomadic people who, some 5,000 or 6,000 years ago, migrated from the northwestern part
24:06of Asia and finally settled in the northern central part of what is now China.
24:13Okay, we could also use others, uh, diffusion, aggressive VTH model if you want to train.
24:22For example, I trained Lisa and Air from 500 epochs or 200, I kind of forgot.
24:29So let's use these random models, not random, but original models that we create.
24:38So you will see the difference, or not, is it worthy to train new models, or simply use the original
24:49models?
24:50This is the same Lisa on, but using the models that are trained using her own voice.
24:58Because you could actually be training a model here using this.
25:12The inhabitants of China are known to the world as Chinese.
25:17They speak of themselves as the people of Han.
25:20As Han is the name of a dynasty, it hardly denotes the origin of the people.
25:25Many theories, based more or less upon religious myths, have been advanced to show whence the first inhabitants of China
25:32came.
25:33But their correctness must necessarily await further scientific discoveries.
25:38All accounts, however, agree that the basin of the Yellow River was the cradle of the Chinese culture,
25:44and that their ancestors were nomadic people who, some 5,000 or 6,000 years ago, migrated from the northwestern
25:51part of Asia
25:52and finally settled in the northern central part of Asia, and finally settled in the northern central part of what
25:59is now China.
26:01Yeah, okay.
26:04And that's pretty good with the noise, you know, you could actually handle the noise later using other stuff like
26:12decreasing sample,
26:14or just manually removing it using AutoCity, because, you know, increasing sample might take a lot of your GPU,
26:26but using AutoCity or other software, just, you know, faster.
26:32Anyways, I think that's it for today's video, hope it helped, thank you for watching,
26:38don't forget to like, share, and comment, and see you again on the next episode, have a nice day.
Comments

Recommended