Skip to Main Content

[標題]最新消息

Global Multilingual AI Model Joint Testing Report

A recent Multilingual Joint Testing Exercise, jointly organized by AI Safety Institutes and government-mandated offices from Singapore, Japan, Australia, Canada, the European Union, France, Kenya, South Korea, and the UK, released its findings on improving evaluation methodologies for multilingual large language models (LLMs).

The exercise tested two open-weight models, Mistral Large and Gemma 2, across ten languages including Cantonese, French, Korean, Japanese, Mandarin Chinese, Malay, Farsi, and Telugu. It assessed five harm categories—privacy, intellectual property, non-violent crime, violent crime, and jailbreak robustness—while also comparing LLM-as-a-judge with human evaluations.

Results indicate that non-English safeguards generally lag behind English ones. Jailbreak protection was the weakest, while intellectual property protection showed relatively strong consistency. Cultural nuances such as indirect refusals in French or Korean and mixed-language outputs in Malay and Cantonese highlight the importance of language-sensitive testing.

The report emphasizes that literal translation alone is insufficient for multilingual safety evaluations; prompts must be culturally contextualized. It also calls for improved adversarial testing, human-in-the-loop oversight, and shared multilingual datasets to build a trustworthy global AI ecosystem.

Images:
Please refer to the image description below
Figure 1 illustrates the workflow of the multinational joint evaluation, which assesses the safety and accuracy of multilingual large language models.The process includes model selection (Mistral Large and Gemma 2), five categories of risk assessment, and subsequent result analysis
Please refer to the image description below
Figure 2 compares the safety performance across different languages, showing that non-English languages generally exhibit slightly lower safety acceptance rates than English.However, the performance gap between most languages and English remains within a margin of approximately 10%.
Please refer to the image description below
Figure 3 presents the divergence between LLM-based automated scoring and human review, showing that jailbreak-related items in non-English languages and privacy-related items in English exhibit larger discrepancies.This highlights the continued necessity of incorporating human oversight into AI evaluations to ensure accuracy
Downloads:
2025_June_Improving_Methodologies_for_LLM_Evaluations_Across_Global_Languages_Evaluation_Report01