Yidong Huang

I’m a incoming CS Ph.D. student at UNC Chapel Hill, advised by Prof. Mohit Bansal. I obtained my master degree from University of Michigan advised by Prof. Joyce Chai. In my master time, I also interned at Boson AI, supervised by Yi Zhu. Before that, I got my bachelor degree from University of Michigan and Shanghai Jiao Tong University.

I specialize in Embodied Artificial Intelligence and the Generative AI that interact with humans and their environments. My current research goal is to create AI agents beyond mere perception and reactive generation, to develop rich representations of the world and human partners, enabling deliberative planning and collaboration with humans.

Beyond work, I enjoy all kinds of sports, building and playing games, watching animations, and connecting with people from diverse backgrounds. If we share any interests, feel free to reach out!

I’m always open to research collaborations, project ideas, or just a good conversation. Feel free to contact me via email!

EDUCATION

2025-08-01 –

The University of North Carolina at Chapel Hill

Ph.D. in Computer Science
2023-09-01 – 2025-04-31

University of Michigan

MS in Computer Science
2021-09-01 – 2023-04-30

University of Michigan

B.S.E in Computer Science
2019-09-01 – 2023-08-01

Shanghai Jiao Tong Univeristy

B.S.E in Electronic and Computer Engineering
2016-09-01 – 2019-06-30

No. 2 High School of East China Normal University

RESEARCH EXPERIENCE

MURGe-Lab (UNC-NLP Group)

Graduate Research Assistant
Boson AI

Research Intern
Situated Language and Embodied Dialogue (SLED) lab

Research Assistant

RESEARCH INTERESTS

Embodied AI
Generative AI
Multimodal Learning

PROJECTS

10/2023 – 03/2024

DriVLMe, Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

Recent advancements in foundation models (FMs) have unlocked new prospects in autonomous driving, yet the experimental settings of these studies are preliminary, over-simplified, and fail to capture the complexity of real-world driving scenarios in human environments. It remains under-explored whether FM agents can handle long-horizon navigation tasks with free-from dialogue and deal with unexpected situations caused by environmental dynamics or task changes. To explore the capabilities and boundaries of FMs faced with the challenges above, we introduce DriVLMe, a video-language-model-based agent to facilitate natural and effective communication between humans and autonomous vehicles that perceive the environment and navigate. We develop DriVLMe from both embodied experiences in a simulated environment and social experiences from real human dialogue. While DriVLMe demonstrates competitive performance in both open-loop benchmarks and closed-loop human studies, we reveal several limitations and challenges, including unacceptable inference time, imbalanced training data, limited visual understanding, challenges with multi-turn interactions, simplified language generation from robotic experiences, and difficulties in handling on-the-fly unexpected situations like environmental dynamics and task changes.

09/2022 – 11/2023

Inversion-Free Image Editing with Natural Language

Despite recent advances in inversion-based editing, text-guided image manipulation remains challenging for diffusion models. The primary bottlenecks include 1) the time-consuming nature of the inversion process; 2) the struggle to balance consistency with accuracy; 3) the lack of compatibility with efficient consistency sampling methods used in consistency models. To address the above issues, we start by asking ourselves if the inversion process can be eliminated for editing. We show that when the initial sample is known, a special variance schedule reduces the denoising step to the same form as the multi-step consistency sampling. We name this Denoising Diffusion Consistent Model (DDCM), and note that it implies a virtual inversion strategy without explicit inversion in sampling. We further unify the attention control mechanisms in a tuning-free framework for text-guided editing. Combining them, we present inversion-free editing (InfEdit), which allows for consistent and faithful editing for both rigid and non-rigid semantic changes, catering to intricate modifications without compromising on the image's integrity and explicit inversion. Through extensive experiments, InfEdit shows strong performance in various editing tasks and also maintains a seamless workflow (less than 3 seconds on one single A40), demonstrating the potential for real-time applications.

03/2023 – 06/2023

CycleNet, Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation

Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, attention-based methods, and image-conditioning. However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency. This paper introduces Cyclenet, a novel but simple method that incorporates cycle consistency into DMs to regularize image manipulation. We validate Cyclenet on unpaired I2I tasks of different granularities. Besides the scene and object level translation, we additionally contribute a multi-domain I2I translation dataset to study the physical state changes of objects. Our empirical studies show that Cyclenet is superior in translation consistency and quality, and can generate high-quality images for out-of-domain distributions with a simple change of the textual prompt. Cyclenet is a practical framework, which is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train.

02/2022 – 06/2022

DOROTHIE, Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents

We tackled the limitations of vision-language navigation tasks, where most existing approaches were limited in their ability to navigate in a continuous and dynamic environment and communicate with humans in free-form. To address these limitations and collect data, we extended the traditional Wizard of Oz study and proposed the duo-wizard setup. This allowed us to add dynamic changes in the environment and tasks to the simulation, which provided a more realistic testbed for evaluating an agent's ability to communicate and navigate.

11/2021 – 12/2021

A-ESRGAN, Training Real-World Blind Super-Resolution with Attention U-Net Discriminators

In the field of Computer Vision, my work focuses on a novel approach to Blind Image Super-Resolution (SR), a task aimed at restoring low-resolution images affected by complex, unknown distortions. My key contribution is the development of A-ESRGAN, an innovative Generative Adversarial Network (GAN) model featuring an attention U-Net based, multi-scale discriminator. This model stands out as the first to integrate attention U-Net structure as a discriminator in GAN for addressing blind SR challenges. My research addresses the limitations of existing GAN structures that neglect an image's structural features, leading to issues like twisted lines and background anomalies. A-ESRGAN overcomes these through its unique design, enabling enhanced focus on structural details across multiple scales. The result is a breakthrough in generating more perceptually realistic high-resolution images. This work not only sets a new benchmark in the non-reference natural image quality evaluator (NIQE) metric but also demonstrates, through extensive ablation studies, how the RRDB-based generator in A-ESRGAN effectively leverages image structural features, outperforming previous models in blind SR tasks.

RECENT PUBLICATIONS

[All Publications]

Drivlme: Enhancing llm-based autonomous driving agents with embodied and social experiences

Huang, Yidong, Sansom, Jacob, Ma, Ziqiao, Gervits, Felix, and Chai, Joyce

In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3153–3160, 2024
A-ESRGAN: Training real-world blind super-resolution with attention U-Net Discriminators

Wei, Zihao, Huang, Yidong, Chen, Yuang, Zheng, Chenhao, and Gao, Jingnan

In Pacific Rim International Conference on Artificial Intelligence, pp. 16–27, 2023
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation

Xu, Sihan, Ma, Ziqiao, Huang, Yidong, Lee, Honglak, and Chai, Joyce

In Thirty-seventh Conference on Neural Information Processing Systems, 2023