MME Workshop

Workshop co-located with EACL 2026, in Rabat, Morocco

Introduction

Large language models are undeniably reshaping language technology. Yet as models are claimed to "support X languages", the research community still lacks clear answers to core questions, such as What does multilinguality really mean, and how should we evaluate it? Counting languages in training data or translating benchmarks likely isn't enough. Multilingual evaluation today suffers from duplicated efforts, inconsistent practices, limited comparability across works, and general poor understanding of theoretical and practical problems.

Workshop on Multilingual and Multicultural Evaluation (MME) aims to bring the community together with three goals:

Create a dedicated venue for multilingual evaluation resources and methodology.
Advance and standardize evaluation practices to improve accuracy, scalability, and fairness.
Integrate cultural and social perspectives into multilingual evaluation.

LLMs in every language? Prove it. Showcase your work on rigorous, efficient, scalable, culture-aware multilingual benchmarking.

Call for Papers▾

click to expand

We invite submissions on topics, including, but not limited to:

Evaluation resources beyond English or Western-centric perspectives and materials
Annotation methodology and procedures
Evaluation protocols: ranking vs direct assessment, rubric-based vs reference-based vs reference-free, prompt variations, etc
Complex and practical tasks: multimodality, fairness, long I/O, tool using, code-switching, literary, etc
Sociocultural and cognitive variation affecting the use and evaluation across languages
Scalable evaluation of cultural and factual knowledge
Efficient evaluation of a massive number of languages and tasks
Metrics, LLM judges, and reward models
Standardization in reporting and comparison of multilingual performance
AI-assisted evaluation: data, methods, metrics, and standards
Other position, application, or theory contributions

We welcome both archival and non-archival papers, resulting in presentations at the workshop. In addition, archival papers will be published in ACL Anthology. An archival submission cannot be under review or accepted at another archival venue. ARR-reviewed papers can be directly committed.

All archival submissions must follow the ACL style guidelines and be anonymized for double-blind review. Short papers may have up to 4 pages and long papers up to 8 pages, excluding references and appendices. Upon acceptance, one additional page will be allowed for the camera-ready version. Non-archival submissions have no formatting or anonymity requirements.

Please submit your work by December 19 2025 through this link. ARR (meta-)reviewed papers can be committed by January 5 2026 using this link.

Key Dates ▾

click to expand

All deadlines are 11:59PM UTC-12:00 ("Anywhere on Earth").

Direct submission deadline: December 19 2025, direct submission link
ARR-reviewed paper submission deadline: January 5 2026, commitment link
Notification of acceptance: January 23 2026
Camera-ready deadline: February 3 2026

Speakers

From Preferences to Proficiency: Evaluating Native-like Quality of LLMs

Sebastian Ruder
Meta Challenges and Opportunities of Language Models as Digital Field Linguists

Freda Shi
University of Waterloo How to Properly Evaluate LLMs at Multilinguality Space?

Wenda Xu
Google

Accepted Papers

LLMs as Span Annotators: A Comparative Study of LLMs and Humans; Zdeněk Kasner, Vilém Zouhar, Patrícia Schmidtová, Ivan Kartáč, Kristýna Onderková, Ondrej Platek, Dimitra Gkatzia, Saad Mahamood, Ondrej Dusek, Simone Balloccu
On the Credibility of Evaluating LLMs using Survey Questions; Jindřich Libovický
An improved Code-Switching Detection System for some Indic Languages; Karan Bhanushali, Fritz Hohl
Vinclat: Evaluating Reasoning, Cognition and Culture in One Game; Marc Pàmies, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Marta Villegas
Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality; Takumi Ohashi, Hitoshi Iyatomi
The Anthropology of Food: How NLP can Help us Unravel the Food cultures of the World; Arij Riabi, Sougata Saha, Monojit Choudhury
LLM-as-a-qualitative-judge: automating error analysis in natural language generation; Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian
Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages; Isaac Chung, Linda Freienthal
Cross-lingual and cross-country approaches to argument component detection: a comparative study.; Cecilia Graiff, Chloé Clavel, Benoît Sagot
UNSC-Bench: Evaluating LLM Diplomatic Role-Playing Through UN Security Council Vote Prediction; Ayush Nangia, Aman Gokrani, Ruggero Marino Lazzaroni
Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America; Yannis Karmim, Renato Pino, Hernan Contreras, Hernan Lira, Sebastian Cifuentes, Simon Escoffier, Luis Martí, Djamé Seddah, Valentin Barriere
Whom to Trust? Analyzing the Divergence Between User Satisfaction and LLM-as-a-Judge in E-Commerce RAG Systems; Arif Türkmen, Kaan Efe Keleş
Query-Following vs Context-Anchoring: How LLMs Handle Cross-Turn Language Switching; Kyuhee Kim, Chengheng Li Chen, Anna Sotnikova
Generating Difficult-to-Translate Texts; Vilém Zouhar, Wenda Xu, Parker Riley, Juraj Juraska, Mara Finkelstein, Markus Freitag, Daniel Deutsch
A Woman is More Culturally Knowledgeable than A Man?': The Effect of Personas on Cultural Norm Interpretation in LLMs; Mahammed Kamruzzaman, Hieu Minh Nguyen, Nazmul Hassan, Gene Louis Kim

Further, we welcome the non-archival presentation of the following works:

Ranking Entropy, Coverage Gap, and Support Set Index: Simple Signals for Multilingual Data Quality; Bodhisatta Maiti, Debshree Chowdhury
MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs; Raoyuan Zhao, Beiduo Chen, Barbara Plank, Michael A. Hedderich
CLM-Bench: Benchmarking and Analyzing Cross-lingual Misalignment of LLMs in Knowledge Editing; Yucheng Hu, Wei Zhou, Juesi Xiao
Pearmut: Human Evaluation of Translation Made Trivial; Vilém Zouhar, Tom Kocmi
Do LLMs Truly Benefit from Longer Context in Automatic Post-Editing?; Ahrii Kim, Seong-heum Kim
Evaluating the Evaluator: Human–LLM Alignment in Multilingual Cultural Reasoning; Shehenaz Hossain, Houda Bouamor, Haithem Afli
Multi-Agent Multimodal Models for Multicultural Text to Image Generation; Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat
MentorQA: Multi-Agent Multilingual Question Answering for Long-Form Mentorship Content; Parth Bhalerao, Diola Dsouza, Ruiwen Guan, Oana Ignat
Cross-Cultural Meme Transcreation with Vision-Language Models; Yuming Zhao, Peiyi Zhang, Oana Ignat
Robust Code-Switched Speech Recognition with Whisper: A Comparative Study of Full Fine-Tuning and LoRA on SEAME; Ting Wang
WORLDVIEW: A Multilingual Benchmark Revealing U.S. Cultural Dominance, Stereotyping and Diversity Failures in Text-to-Image Models; Aleksandra Urman, Elsa Lichtenegger, Salima Jaoua, Azza Bouleimen, Robin Forsberg, Corinna Hertweck, Desheng Hu, Stefania Ionescu, Kshitijaa Jaglan, Nicolò Pagan, Ancsa Hannak, Joachim Baumann
Beyond Character Matching: Proposing Psycholinguistically Informed Metrics for Chinese Radical Knowledge in LLMs; Yufei Liu, Poorvi Acharya
Glints of Gold or Troubling Waters? Can a School of Merged Monolingual Goldfish Models Swim in Bilingual Seas?; Suchir Salhan, Ej Zhou, Laura Barbenel, Aoife O’Driscoll, Lily Goulder, Lucas Resck, Catherine Arnett, Paula Buttery
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets; Hanna Yukhymenko, Anton Alexandrov, Martin Vechev
VietMix: A Naturally-Occurring Parallel Corpus and Augmentation Framework for Vietnamese-English Code-Mixed Machine Translation; Hieu Tran, Phuong-Anh Nguyen-Le, Huy Nghiem, Quang-Nhan Nguyen, Wei Ai, Marine Carpuat
Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties; Zhenglin Wang, Jialong Wu, Pengfei LI, Yong Jiang, Deyu Zhou
SwissGov-RSD: A Human-annotated, Cross-lingual, Document-level Benchmark for Recognition of Semantic Difference at the Token-level; Michelle Wastl, Jannis Vamvas, Rico Sennrich
When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation; David Tan, Pinzhen Chen, Josef van Genabith, Koel Dutta Chowdhury
How Important is ‘Perfect’ English for Machine Translation Prompts?; Patrícia Schmidtová, Niyati Bafna, Seth Aycock, Gianluca Vico, Wiktor Kamzela, Kathy Hämmerl, Vilém Zouhar
RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams; Andrei Vlad Man, Răzvan-Alexandru Smădu, Cristian-George Craciun, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel
Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks; Go Inoue, Bashar Alhafni, Nizar Habash, Timothy Baldwin
Form and Meaning in Intrinsic Multilingual Evaluations; Wessel Poelman, Miryam de Lhoneux
FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition; Jonas Golde, Patrick Haller, Alan Akbik
Measuring Linguistic Competence of LLMs on Indigenous Languages of the Americas; Justin Vasselli, Arturo MP, Frederikus Hudi, Haruki Sakajo, Taro Watanabe
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data; Jaap Jumelet, Abdellah Fourtassi, Akari Haga, Bastian Bunzeck, Bhargav Shandilya, Diana Galvan-Sosa, Faiz Ghifari Haznitrama, Francesca Padovani, Francois Meyer, Hai Hu, Julen Etxaniz, Laurent Prevot, Linyang He, María Grandury, Mila Marcheva, Negar Foroutan, Nikitas Theodoropoulos, Pouya Sadeghi, Siyuan Song, Suchir Salhan, Susana Zhou, Yurii Paniv, Ziyin Zhang, Arianna Bisazza, Alex Warstadt, Leshem Choshen
MAPS: A Multilingual Benchmark for Agent Performance and Security; Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Trisha Starostina, Kelly Marchisio, Seraphina Goldfarb-Tarrant, Roman Vainshtein
LLMs and Cultural Values: The Impact of Prompt Language and Explicit Cultural Framing; Bram Bulté, Ayla Rigouts Terryn
Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish; Yakup Abrek Er, Ilker Kesen, Gözde Gül Şahin, Aykut Erdem