Abstract: Many recent Text-to-Speech (TTS) models employing zero-shot voice cloning techniques are capable of reproducing the emotional tone present in the reference speech. However, they frequently ...