Yidong Huang

Hi, my name is Yidong Huang, and I’m a senior student at University of Michigan, majoring in computer science.

I specialize in Embodied Artificial Intelligence and Generative Models. My research focuses on advancing AI agents’ ability to understand and interact with the world. By developing models and algorithms, I aim to enable these agents to perceive their environment, comprehend human instructions, and perform tasks autonomously

Aside from work, I’m interested in all kinds of sports, playing and developing games, watching animations, and connecting with all kinds of people. If you have any interest in common, please contact me via email!

I will be applying for Phd at Fall 2025. If you have any opportunity, please feel free to contact me.


  • 2023-09-01 – now

    University of Michigan

    MS in Computer Science

  • 2021-09-01 – 2023-04-30

    University of Michigan

    B.S.E in Computer Science

  • 2019-09-01 – 2023-08-01

    Shanghai Jiao Tong Univeristy

    B.S.E in Electronic and Computer Engineering

  • 2016-09-01 – 2019-06-30

    No. 2 High School of East China Normal University


  • Boson AI

    Machin Learning Intern

  • Situated Language and Embodied Dialogue (SLED) lab

    Undergraduate Research Assistant

  • Intelligent Networked Systems Lab

    Undergraduate Research Assistant


  • Embodied AI
  • Natural Language Processing
  • Diffusion Models
  • Reinforcement Learning for NLP


10/2023 – 03/2024

DriVLMe, Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences
Recent advancements in foundation models (FMs) have unlocked new prospects in autonomous driving, yet the experimental settings of these studies are preliminary, over-simplified, and fail to capture the complexity of real-world driving scenarios in human environments. It remains under-explored whether FM agents can handle long-horizon navigation tasks with free-from dialogue and deal with unexpected situations caused by environmental dynamics or task changes. To explore the capabilities and boundaries of FMs faced with the challenges above, we introduce DriVLMe, a video-language-model-based agent to facilitate natural and effective communication between humans and autonomous vehicles that perceive the environment and navigate. We develop DriVLMe from both embodied experiences in a simulated environment and social experiences from real human dialogue. While DriVLMe demonstrates competitive performance in both open-loop benchmarks and closed-loop human studies, we reveal several limitations and challenges, including unacceptable inference time, imbalanced training data, limited visual understanding, challenges with multi-turn interactions, simplified language generation from robotic experiences, and difficulties in handling on-the-fly unexpected situations like environmental dynamics and task changes.

09/2022 – 11/2023

Inversion-Free Image Editing with Natural Language
Despite recent advances in inversion-based editing, text-guided image manipulation remains challenging for diffusion models. The primary bottlenecks include 1) the time-consuming nature of the inversion process; 2) the struggle to balance consistency with accuracy; 3) the lack of compatibility with efficient consistency sampling methods used in consistency models. To address the above issues, we start by asking ourselves if the inversion process can be eliminated for editing. We show that when the initial sample is known, a special variance schedule reduces the denoising step to the same form as the multi-step consistency sampling. We name this Denoising Diffusion Consistent Model (DDCM), and note that it implies a virtual inversion strategy without explicit inversion in sampling. We further unify the attention control mechanisms in a tuning-free framework for text-guided editing. Combining them, we present inversion-free editing (InfEdit), which allows for consistent and faithful editing for both rigid and non-rigid semantic changes, catering to intricate modifications without compromising on the image's integrity and explicit inversion. Through extensive experiments, InfEdit shows strong performance in various editing tasks and also maintains a seamless workflow (less than 3 seconds on one single A40), demonstrating the potential for real-time applications.

03/2023 – 06/2023

CycleNet, Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, attention-based methods, and image-conditioning. However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency. This paper introduces Cyclenet, a novel but simple method that incorporates cycle consistency into DMs to regularize image manipulation. We validate Cyclenet on unpaired I2I tasks of different granularities. Besides the scene and object level translation, we additionally contribute a multi-domain I2I translation dataset to study the physical state changes of objects. Our empirical studies show that Cyclenet is superior in translation consistency and quality, and can generate high-quality images for out-of-domain distributions with a simple change of the textual prompt. Cyclenet is a practical framework, which is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train.
The Double Wizard Setup

02/2022 – 06/2022

DOROTHIE, Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents
We tackled the limitations of vision-language navigation tasks, where most existing approaches were limited in their ability to navigate in a continuous and dynamic environment and communicate with humans in free-form. To address these limitations and collect data, we extended the traditional Wizard of Oz study and proposed the duo-wizard setup. This allowed us to add dynamic changes in the environment and tasks to the simulation, which provided a more realistic testbed for evaluating an agent's ability to communicate and navigate.

11/2021 – 12/2021

A-ESRGAN, Training Real-World Blind Super-Resolution with Attention U-Net Discriminators
In the field of Computer Vision, my work focuses on a novel approach to Blind Image Super-Resolution (SR), a task aimed at restoring low-resolution images affected by complex, unknown distortions. My key contribution is the development of A-ESRGAN, an innovative Generative Adversarial Network (GAN) model featuring an attention U-Net based, multi-scale discriminator. This model stands out as the first to integrate attention U-Net structure as a discriminator in GAN for addressing blind SR challenges. My research addresses the limitations of existing GAN structures that neglect an image's structural features, leading to issues like twisted lines and background anomalies. A-ESRGAN overcomes these through its unique design, enabling enhanced focus on structural details across multiple scales. The result is a breakthrough in generating more perceptually realistic high-resolution images. This work not only sets a new benchmark in the non-reference natural image quality evaluator (NIQE) metric but also demonstrates, through extensive ablation studies, how the RRDB-based generator in A-ESRGAN effectively leverages image structural features, outperforming previous models in blind SR tasks.


[All Publications]
  1. DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences
    Huang, Yidong, Sansom, Jacob, Ma, Ziqiao, Gervits, Felix, and Chai, Joyce
    arXiv e-prints, pp. arXiv–2406, 2024
  2. A-ESRGAN: Training real-world blind super-resolution with attention U-Net Discriminators
    Wei, Zihao, Huang, Yidong, Chen, Yuang, Zheng, Chenhao, and Gao, Jingnan
    In Pacific Rim International Conference on Artificial Intelligence, pp. 16–27, 2023
  3. CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
    Xu, Sihan, Ma, Ziqiao, Huang, Yidong, Lee, Honglak, and Chai, Joyce
    In Thirty-seventh Conference on Neural Information Processing Systems, 2023