Slator Translation as a Feature (TaaF) report
The Slator Translation as a Feature (TaaF) Report is an essential and concise guide to how AI translation is becoming an integral part of enterprise technology.
Researchers and developers need tools that provide insights beyond quality scores, underscoring their importance explainability in automatic translation (MT) evaluation statistics.
Explainable statistics not only help researchers understand how well a Mt model performs, but also why it performs in a certain way, allowing targeted refinements and bridging the gap between model outputs and human interpretation.
Recent developments highlight the industry’s shift towards explainability. For example, InstructScoredeveloped by the University of California, GooglingAnd Carnegie Mellon UniversityAnd Unbelievable‘S xTOWER highlighting the value of explainable statistics. InstructScore uses major language models (LLMs) to provide quality scores and detailed error explanations, while xTOWER LLM produces high-quality error explanations and uses these explanations to suggest corrections.
Despite these advances, MT researchers still struggle to find tools that can fully interpret and evaluate the performance of MT models at a detailed level and that are also easy to use. On October 20, 2024 paperresearchers from the University of CaliforniaSanta Barbara, and Carnegie Mellon University underlined the “need for an integrated solution that combines comprehensive model evaluation with user-friendly interfaces and advanced analytical capabilities.”
To meet this need, they developed Translation clothan evaluation toolkit focused on explainability, accessibility and flexibility.
Translation Canvas provides an intuitive interface and supports fine-grained evaluations, identifying specific error ranges and providing natural language explanations. The toolkit currently includes three evaluation metrics – BLEU, COMET and InstructScore – giving researchers multiple perspectives on model performance.
The Translation Canvas dashboard provides a comprehensive overview of MT model performance, detailing the distribution of errors and enabling comparative analysis between MT systems. This helps researchers quickly identify areas where one model is underperforming others. Additionally, the tool includes a robust search function that allows researchers to filter results by error type, severity or content, making targeted analysis more efficient.
The researchers noted that Translation Canvas is specifically designed for the translation research community, “where understanding the nuances of model errors and performance is critical for further improvements.”
While previous tools, such as The MATEO project of Ghent University, Translation Canvas offers a web-based platform for various statistics. Translation Canvas builds on this with explanations of natural language errors, powered by InstructScore, and advanced instance-level analytics.
To assess the effectiveness of Translation Canvas, the researchers conducted a user evaluation study with participants who had experience with MT and existing MT evaluation metrics.
Users rated it highly for both its fun and usability, particularly appreciating its emphasis on error types and its quick analysis process. The graph presentations and error sorting saved significant time on fine-grained analyses, and the support for multi-system analysis was highlighted as an important ease of use.
“Our evaluation shows that users find the system useful, enjoyable, and as convenient as command-line evaluation tools,” the researchers said.
Even those new to MT evaluation were able to get started quickly: new users reported that it took just ten minutes to get started with a custom dataset. This ease of use reflects the system’s effective balance between functionality and usability, meeting the need for tools that support both rapid onboarding and advanced analytics.
The researchers recognize that human evaluation remains essential for capturing the subtleties of translation quality, and they plan to further improve Translation Canvas based on user feedback. With user permission, they collect feedback on source texts, references, model output, and rankings to continually refine the tool. Users can withdraw consent at any time, ensuring control over their data and feedback.
Authors: Chinmay Dandekar, Wenda Xu, Xi Xu, Siqi Ouyang and Lei Li