Research

For a complete list of publications, please see the Google Scholar page.

2024

  1. FlexAttention for Efficient High-Resolution Vision-Language Models
  2. RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation
  3. CoNav: A Benchmark for Human-Centered Collaborative Navigation
  4. COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
  5. SALMON: Self-Alignment with Instructable Reward Models
  6. Thin-Shell Object Manipulations With Differentiable Physics Simulations
  7. Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning
    In AAAI Conference on Artificial Intelligence
  8. 3D-VLA: A 3D Vision-Language-Action Generative World Model
  9. Building Cooperative Embodied Agents Modularly with Large Language Models
  10. EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
  11. HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments
  12. MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

2023

  1. DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning
  2. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
    Advances in Neural Information Processing Systems
  3. CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
  4. A^2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models
  5. Learning Vision-and-Language Navigation from YouTube Videos
  6. 3D-LLM: Injecting the 3D World into Large Language Models
    Advances in Neural Information Processing Systems
  7. Masked Motion Encoding for Self-Supervised Video Representation Learning

2022

  1. Learning Active Camera for Multi-Object Navigation
  2. Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

2021

  1. RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

2020

  1. Relation Attention for Temporal Action Localization
    IEEE Transactions on Multimedia
  2. Location-aware Graph Convolutional Networks for Video Question Answering
  3. Generating Visually Aligned Sound From Videos
    IEEE Transactions on Image Processing
  4. Foley Music: Learning to Generate Music from Videos
  5. Dense Regression Network for Video Grounding

2019

  1. Breaking Winner-Takes-All: Iterative-Winners-Out Networks for Weakly Supervised Temporal Action Localization
    IEEE Transactions on Image Processing
  2. Self-supervised Moving Vehicle Tracking with Stereo Sound

2018

  1. soundofpixels.png
    The Sound of Pixels
    In The European Conference on Computer Vision (ECCV)