This will serve as a guide for others to jump in and help or for future training. Maybe we add these services to our FLCCC ecosystem officially.
Step 1:
Source is AVI that is difficult to work with. Use HandBrake or others to convert to m4v format. Web 720p profile should be fine.
Step 2:
Extract just the audio, in mp3 format. I used ffmpeg for this.
ffmpeg -i ./my-story-maureen-volante.m4v -vn -acodec libmp3lame -q:a 2 my-story-maureen-volante.mp3
Step 3:
Audio needs to be cleaned up as best we can. It’s not critical for it to have zero noise and be “production perfect”. It is important for it to be LOUDER with as little noise as possible. The AI needs to be able to train on the voice, and the AI must be able to generate a good transcription. It can’t do that if the audio is a noisy whisper. It struggles if it’s a clean whisper.
Step 4:
Break the audio into 4 minute segments. This is for various AI tool limits.
ffmpeg -i my-story-don-cutter-cleaned.mp3 -f segment -segment_time 240 -c copy my-story-don-cutter-cleanee-cut%03d.mp3
Step 5:
Generate a new custom voice in Eleven Labs (https://elevenlabs.io). Upload your 4m segments as the training material and give it a decent prompt/description.
Step 6a:
One approach is to use Speech Synthesis tool of ElevenLabs. This allows you to take one of your trained voices and have to regenerate audio based on a script or an audio source. You’ll want to use audio source. This will make it match the current timings and match the lips of the video still. It should ignore some of the noise in these audio files as best it can. You can play with some of the generation settings to clean it up a bit further.
Step 6b:
The other approach is to use the Dubbing tool. This will take your source audio (in 4m segments) and let you translate from English to English. It will train the voice, extract a transcript, then re-dub it while keeping the timings in the trained voice.
Step 7:
Take the original m4v into an editor. Separate audio track, then delete it. Add in the re-dubbed audio tracks and export/remux the video again. I take it one final step through HandBrake just for a final optimization.