Vision-Language Multi-modal Segmentation: Referring Expression Segmentation

发布者：梁慧丽发布时间：2025-05-08浏览次数：10

报告人

Chang Liu

Machine Learning Research Scientist at TikTok, Singapore

时间

2025年2月26日星期三

上午 10:00-11:00

地点

学院308会议室

Abstract

Vision and language are fundamental ways humans interact with the world. Multi-modal tasks that integrate these two modalities, such as Referring Expression Segmentation (RES), are crucial for both foundational research and practical applications. Given an image and a textual expression referring to an object within it, RES aims to identify the target object by generating a segmentation mask for it. Combining two critical sub-tasks, language understanding and image segmentation, RES is one of the most challenging yet significant vision-language tasks. In this talk, I will discuss the background and challenge in RES, and introduce several of our works on this task, including: 1) a CNN-based method that integrates two different traditional pipelines, 2) the first Transformers-based RES framework and subsequent improvement, and 3) a generalized benchmark beyond classic RES to address its definition limitations. Additionally, I will also discuss some emerging related new tasks, datasets, open challenges and future directions of this field.

Biography

Liu Chang is a Machine Learning Research Scientist at TikTok, Singapore, specializing in research on multi-modal large language models (MLLMs). He received his B.Eng. degree from Harbin Institute of Technology (HIT), China, in 2018, and his M.S. and Ph.D. degrees from Nanyang Technological University (NTU), Singapore, in 2019 and 2024, respectively. From 2023 to 2024, he served as a Research Scientist at the Institute for Infocomm Research (I2R), A*STAR, Singapore. His research interests include complex visual understanding, image and video segmentation, multi-modal learning, generative models, etc. He has published more than 10 papers in top-tier international AI conferences and journals such as CVPR, ICCV, and TPAMI, with nearly a thousand citations. His first-author paper, GRES, was recognized as a CVPR 2023 Highlight. Additionally, he has been a primary organizer of workshops and competitions at conferences such as CVPR and ECCV, organizing challenges like MOSE, MeViS, and LSVOS, which have attracted hundreds of teams worldwide.

刘畅，新加坡TikTok算法研究员，负责多模态大模型相关的研究与应用。2018年于哈尔滨工业大学获得学士学位；2019与2024年于新加坡南洋理工大学获得硕士和博士学位，2023年至2024年于新加坡科技发展局（A*STAR）任研究科学家。主要研究方向包括视觉理解、图像视频分割、多模态学习、视觉生成等，于CVPR、ICCV、TPAMI等顶级人工智能国际会议和期刊发表论文10余篇，引用量近千次。一作论文GRES被评为CVPR 2023 Highlight。在CVPR、ECCV等会议中多次任国际竞赛的主要组织者，组织的MOSE、MeViS、LSVOS等竞赛吸引全球百余支队伍参加。

个人主页：

https://scholar.google.com/citations?user=XlQP0GIAAAAJ&hl=zh-CN

导航

学术交流

Vision-Language Multi-modal Segmentation: Referring Expression Segmentation

联系我们

友情链接

搜索
您想要找的

导航

学术交流

Vision-Language Multi-modal Segmentation: Referring Expression Segmentation

联系我们

友情链接

搜索您想要找的

搜索
您想要找的