Efficient and Robust Training and Inference Techniques for Multimodal Large Language Models

发布者:梁慧丽发布时间:2024-05-26浏览次数:18


时间

TIME

2024年5月27日 8:30 - 10: 00

地点

VENUE

信息管理与工程学院308室

腾讯会议:148-343-905

主讲人

SPEAKER

Chaoya Jiang(蒋超亚)is a PHD student from the National Engineering Research Center  of Software Engineering, Peking University, supervised by Professor Shikun Zhang. His research topic include efficient training and inference of multimodal large language models, Hallucination mitigation of  multimodal large language models. He has published more than ten papers at top conferences such as CVPR, ICCV, ICLR, ACL, MM, AAAI, EMNLP,  etc.  He won the first prize of Beijing Science and Technology Progress Award  in 2023, and was supported by the National Natural Science Foundation's Young Student Basic Research Project  (Doctoral Student) in 2023.


主题

TITLE

Efficient and Robust Training and Inference Techniques for Multimodal Large Language Models


摘要

ABSTRACT

Large language models, exemplified by ChatGPT, have garnered significant attention for their powerful generalization capabilities and practical effectiveness. Multimodal large language models that can process both visual and textual data are becoming a new hot topic in research. However, they still face three major challenges: First, the massive volume of pre-training data is of mixed quality, leading to high training costs. Second, current training strategies are sensitive to data noise and lack sufficient optimization for cross-modal semantic representation distribution, which significantly impacts model robustness. Third, the enormous scale of model parameters and the complexity of decoding long sequence features create significant bottlenecks for inference efficiency. This project systematically investigates these three issues across the key stages of data preparation, model training, and model inference. Specifically, in the data preparation stage, we design methods for selecting and synthesizing high-quality data guided by multi-granularity cross-modal semantics to enhance the quality and efficiency of training data. During the model training stage, we propose an optimization mechanism for the representation space based on hallucination enhancement and cross-modal mutual information constraints to boost the model's robustness in noisy environments. In the model inference stage, we explore a lossless dynamic reduction mechanism for long feature sequences to reduce the computational overhead of model inference. Through these studies, it is expected to significantly improve model training and inference efficiency and enhance model robustness.


搜索
您想要找的