user_image

Yidong Huang

Hi, my name is Yidong Huang, and I’m a senior student at University of Michigan, majoring in computer science.

I specialize in Embodied Artificial Intelligence and the development of Task-Learning Agents that interact with humans and environments. My focus is on enabling these agents to make autonomous decisions through observation of different modalities.

Aside from work, I’m interested in all kinds of sports, playing and developing games, watching animations, and connecting with all kinds of people. If you have any interest in common, please contact me via email!

I’m actively seeking summer intern for 2024, and I will be applying for Phd at Fall 2025. If you have any opportunity, please feel free to contact me.

EDUCATION

  • 2023-09-01 – now

    University of Michigan

    MS in Computer Science

  • 2021-09-01 – 2023-04-30

    University of Michigan

    B.S.E in Computer Science

  • 2019-09-01 – 2023-08-01

    Shanghai Jiao Tong Univeristy

    B.S.E in Electronic and Computer Engineering

  • 2016-09-01 – 2019-06-30

    No. 2 High School of East China Normal University

RESEARCH EXPERIENCE

  • Situated Language and Embodied Dialogue (SLED) lab

    Undergraduate Research Assistant

  • Intelligent Networked Systems Lab

    Undergraduate Research Assistant

RESEARCH INTERESTS

  • Embodied AI
  • Natural Language Processing
  • Diffusion Models
  • Reinforcement Learning for NLP

PROJECTS

09/2022 – 11/2023

Inversion-Free Image Editing with Natural Language
Despite recent advances in inversion-based editing, text-guided image manipulation remains challenging for diffusion models. The primary bottlenecks include 1) the time-consuming nature of the inversion process; 2) the struggle to balance consistency with accuracy; 3) the lack of compatibility with efficient consistency sampling methods used in consistency models. To address the above issues, we start by asking ourselves if the inversion process can be eliminated for editing. We show that when the initial sample is known, a special variance schedule reduces the denoising step to the same form as the multi-step consistency sampling. We name this Denoising Diffusion Consistent Model (DDCM), and note that it implies a virtual inversion strategy without explicit inversion in sampling. We further unify the attention control mechanisms in a tuning-free framework for text-guided editing. Combining them, we present inversion-free editing (InfEdit), which allows for consistent and faithful editing for both rigid and non-rigid semantic changes, catering to intricate modifications without compromising on the image's integrity and explicit inversion. Through extensive experiments, InfEdit shows strong performance in various editing tasks and also maintains a seamless workflow (less than 3 seconds on one single A40), demonstrating the potential for real-time applications.

03/2023 – 06/2023

CycleNet, Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, attention-based methods, and image-conditioning. However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency. This paper introduces Cyclenet, a novel but simple method that incorporates cycle consistency into DMs to regularize image manipulation. We validate Cyclenet on unpaired I2I tasks of different granularities. Besides the scene and object level translation, we additionally contribute a multi-domain I2I translation dataset to study the physical state changes of objects. Our empirical studies show that Cyclenet is superior in translation consistency and quality, and can generate high-quality images for out-of-domain distributions with a simple change of the textual prompt. Cyclenet is a practical framework, which is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train.
The Double Wizard Setup

02/2022 – 06/2022

DOROTHIE, Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents
We tackled the limitations of vision-language navigation tasks, where most existing approaches were limited in their ability to navigate in a continuous and dynamic environment and communicate with humans in free-form. To address these limitations and collect data, we extended the traditional Wizard of Oz study and proposed the duo-wizard setup. This allowed us to add dynamic changes in the environment and tasks to the simulation, which provided a more realistic testbed for evaluating an agent's ability to communicate and navigate.

11/2021 – 12/2021

A-ESRGAN, Training Real-World Blind Super-Resolution with Attention U-Net Discriminators
In the field of Computer Vision, my work focuses on a novel approach to Blind Image Super-Resolution (SR), a task aimed at restoring low-resolution images affected by complex, unknown distortions. My key contribution is the development of A-ESRGAN, an innovative Generative Adversarial Network (GAN) model featuring an attention U-Net based, multi-scale discriminator. This model stands out as the first to integrate attention U-Net structure as a discriminator in GAN for addressing blind SR challenges. My research addresses the limitations of existing GAN structures that neglect an image's structural features, leading to issues like twisted lines and background anomalies. A-ESRGAN overcomes these through its unique design, enabling enhanced focus on structural details across multiple scales. The result is a breakthrough in generating more perceptually realistic high-resolution images. This work not only sets a new benchmark in the non-reference natural image quality evaluator (NIQE) metric but also demonstrates, through extensive ablation studies, how the RRDB-based generator in A-ESRGAN effectively leverages image structural features, outperforming previous models in blind SR tasks.

RECENT PUBLICATIONS

[All Publications]
  1. A-ESRGAN: Training real-world blind super-resolution with attention U-Net Discriminators
    Wei, Zihao, Huang, Yidong, Chen, Yuang, Zheng, Chenhao, and Gao, Jingnan
    In Pacific Rim International Conference on Artificial Intelligence, pp. 16–27, 2023
  2. CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
    Xu, Sihan, Ma, Ziqiao, Huang, Yidong, Lee, Honglak, and Chai, Joyce
    In Thirty-seventh Conference on Neural Information Processing Systems, 2023
  3. Inversion-Free Image Editing with Natural Language
    Xu, Sihan, Huang, Yidong, Pan, Jiayi, Ma, Ziqiao, and Chai, Joyce
    arXiv preprint arXiv:2312.04965, 2023