← Back to publications

arXiv · 2025 Under review · ICLR 2026

MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao, Bryan Chen Zhengyu Tan, Bowei Zou, Chang Liu, Yujia Hu, Xing Xie, Xiaoyuan Yi, Jing Yao, Chaojun Wang, Long Li, Rui Liu, Huiyao Liu, Koji Inoue, Ryuichi Sumida, Tatsuya Kawahara, Fan Xu, Lingyu Ye, Wei Tian, Dongjun Kim, Jimin Jung, Jaehyung Seo, Nadya Yuki Wangsajaya, Pham Minh Duc, Ojasva Saxena, Palash Nandi, Xiyan Tao, Wiwik Karlina, Tuan Luong, Keertana Arun Vasan, Roy Ka-Wei Lee, Nancy F. Chen

MMA-ASIA benchmark overview

Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79% require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects "shortcut learning" by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.

Background
Multimodal understanding and reasoning often degrade outside Western, high-resource settings, and culture-aware evaluation has lagged behind
Problem
Quantify cultural awareness across Asian languages and modalities, and verify that models answer for the right reasons
Method
  • Human-curated benchmark across 8 Asian countries, 10 languages, 27,000 questions; 79% require multi-step cultural reasoning
  • First dataset input-aligned across text, image, and speech, enabling direct cross-modal transfer tests
  • Five-dimensional protocol: country disparities, cross-lingual and cross-modal consistency, generalization, grounding validity
  • A grounding validation module detects shortcut learning; Vision-ablated Prefix Replay (VPR) probes modality divergence
Results
  • Reveals cultural-awareness gaps across countries and languages
  • Demonstrates cross-modal inconsistency and shortcut-learning risks in current models
Role
  • Led the Korean subset: taxonomy design, image collection, and QA authoring guidelines
  • Evaluated diverse VLMs and calibrated quality-control criteria
  • Contributed analysis and writing