Evolution of Performance Metrics for Accurate Evaluation of Speech-to-Speech Translation Models: A Literature Review

Authors

  • Dr. Gabriel O. Sobola

Keywords:

Carbon nanotubes (CNTs), Functionalization of CNTs, Chirality, Arc discharge, laser ablation, Biosensors., manufacturing., industry, Technological tasks, metalworking, visualization of the axiom control CNC system, energy storage, hydroelectric power, pumped storage, hydropower station modernisation, infrastructure refurbishment., BERTScore, Bilingual Evaluation Understudy Scores (BLEUS), BLASER, Leaderboards, Mean Opinion Score Naturalness (MOSN), Mean Opinion Score Similarity (MOSS), Recall Oriented Understudy for Gisting Evaluation Longest Common Subsequence (ROUGE-L), Speech-to-speech metrics, Word Error Rate (WER)

Abstract

The translation of speech from a source to speech in a target language with generative artificial intelligence is an area of research that is presently being actively explored. This is aimed at solving global language barriers thereby ensuring seamless communication between the individuals involved. It has been well developed for high-resourced languages like English, Spanish, French and Chinese. Currently, objective evaluation metrics such as Bilingual Evaluation Understudy Scores (BLEUS), and subjective metrics such as Mean Opinion Score Naturalness (MOSN) and Mean Opinion Score Similarity (MOSS) are being used to evaluate the performance of the output of speech-to-speech models. However, low resourced languages are still undeveloped in the area of speech processing applications, especially the African indigenous languages. The output speech in the target language needs to be evaluated to determine the closeness to the ground truth, as well as how natural and intelligible it is to the intended listeners. This paper presents a review of trends from the current metrics to emerging ones such as Recall Oriented Understudy for Gisting Evaluation-L (ROUGE-L) and BLASER. The applications of speech models� metrics on various leaderboards and modern AI platforms were also discussed. The outcome shows that while BLEU score and MOSN metrics are prevalent for speech models, there is a need to explore metrics such as ROUGE-L, and BERTScore which are machine translation metric due to their benefits.

References

Evolution of Performance Metrics for Accurate Evaluation of Speech-to-Speech Translation Models: A Literature Review

Downloads

Published

2025-11-07

How to Cite

Evolution of Performance Metrics for Accurate Evaluation of Speech-to-Speech Translation Models: A Literature Review. (2025). London Journal of Engineering Research, 25(4), 65-91. https://journalspress.uk/index.php/LJER/article/view/1679