When we have particularly challenging videos, one approach would be to just generate a good transcript. Then we can still try to train the voice, but instead of doing a translation or a synth based on audio, we do it based on the transcript. The issue with that is timing to the existing video. That would take a lot of man-hours.
The alternative is custom avatars. You train based on the video to create an avatar. You then take the trained voice and add in the transcript. The pacing will be different, but it will be their words to their face. Still may be shocking as it’s not their cadence and exact mannerisms. It is further down the deep fake route.
Where it is impossible for the current batch is that these services require consent to do this. You have to publish proof of consent, and that is generally a video of that same person saying they do consent to using their likeness in this manner.
In the future, we could make sure to capture that consent.