Jinzuomu Zhong

Research keywords: Controllable Speech Synthesis, Speech LLM, Accent, Prosody

Bio:

Jinzuomu has an interdisciplinary background bridging Engineering, Linguistics, and NLP/Speech, with a BEng in Automation Science and a BA in English from Beihang University, China, and an MSc in Speech and Language Processing from the University of Edinburgh, UK. Besides, he has spent four wonderful years in industry, woking on multilingual speech synthesis, where he has developed NLP/Speech technologies for over 40 languages.

Motivated by the challenge of reducing hallucination in speech LLMs, he is particularly interested in controllable speech synthesis, seeking to bridge the gap between industry and academia to advance responsible speech generation. His work includes building multilingual grapheme-to-phoneme pipelines for pronunciation control, developing automatic prosody annotation for prosody control, and researching speaker-accent disentanglement for accent control in speech synthesis.

PhD research:

Recent zero-shot text-to-speech systems demonstrate impressive capabilities to generate any content in virtually any voice with little speaker data. However, these systems raise significant ethical concerns, particularly in explainability and fairness.

It remains unclear how these models generate the uncontrollable aspects of speech, such as prosody, style, and accent. Such lack of transparency could lead to unintended biases, misrepresenting the intended speaker’s voice or emotion.

Additionally, these data-intensive systems often struggle to perform equitably across diverse speakers, especially for underrepresented accents, languages, or voice conditions. Ensuring fairness and accessibility for a wide range of linguistic and vocal identities is essential for the responsible deployment of these technologies.

Supervisors: Korin Richmond, Simon King