Research
Our long-term career goal is to develop general-purpose embodied agents that could understand and interact with the physical world and other intelligent beings as flexibly as humans. Ultimately, we aim to bring embodied general intelligence to both virtual and physical environments.
Lab Publications
2024
FlexAttention for Efficient High-Resolution Vision-Language Models
arXiv
·
30 Jul 2024
·
arXiv:2407.20228
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation
arXiv
·
18 Jun 2024
·
arXiv:2311.01455
CoNav: A Benchmark for Human-Centered Collaborative Navigation
arXiv
·
05 Jun 2024
·
arXiv:2406.02425
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
arXiv
·
17 Apr 2024
·
arXiv:2404.10775
SALMON: Self-Alignment with Instructable Reward Models
arXiv
·
11 Apr 2024
·
arXiv:2310.05910
Thin-Shell Object Manipulations With Differentiable Physics Simulations
arXiv
·
02 Apr 2024
·
arXiv:2404.00451
Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning
Proceedings of the AAAI Conference on Artificial Intelligence
·
24 Mar 2024
·
doi:10.1609/aaai.v38i2.27888
3D-VLA: A 3D Vision-Language-Action Generative World Model
arXiv
·
15 Mar 2024
·
arXiv:2403.09631
Building Cooperative Embodied Agents Modularly with Large Language Models
arXiv
·
20 Feb 2024
·
arXiv:2307.02485
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
arXiv
·
07 Feb 2024
·
arXiv:2205.14756
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments
arXiv
·
24 Jan 2024
·
arXiv:2401.12975
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
arXiv
·
17 Jan 2024
·
arXiv:2401.08577
2023
DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning
arXiv
·
12 Dec 2023
·
arXiv:2312.05783
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
arXiv
·
05 Dec 2023
·
arXiv:2305.03047
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
arXiv
·
07 Nov 2023
·
arXiv:2311.03354
$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models
arXiv
·
17 Aug 2023
·
arXiv:2308.07997
Learning Vision-and-Language Navigation from YouTube Videos
arXiv
·
25 Jul 2023
·
arXiv:2307.11984
3D-LLM: Injecting the 3D World into Large Language Models
arXiv
·
25 Jul 2023
·
arXiv:2307.12981
Generating Visually Aligned Sound from Videos
arXiv
·
19 Jul 2023
·
arXiv:2008.00820
Masked Motion Encoding for Self-Supervised Video Representation Learning
arXiv
·
24 Mar 2023
·
arXiv:2210.06096
2022
Learning Active Camera for Multi-Object Navigation
arXiv
·
17 Oct 2022
·
arXiv:2210.07505
Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation
arXiv
·
17 Oct 2022
·
arXiv:2210.07506
2021
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning
arXiv
·
16 Mar 2021
·
arXiv:2011.07949
2020
Relation Attention for Temporal Action Localization
IEEE Transactions on Multimedia
·
01 Oct 2020
·
doi:10.1109/TMM.2019.2959977
Location-aware Graph Convolutional Networks for Video Question Answering
arXiv
·
21 Aug 2020
·
arXiv:2008.09105
Foley Music: Learning to Generate Music from Videos
arXiv
·
22 Jul 2020
·
arXiv:2007.10984
Dense Regression Network for Video Grounding
arXiv
·
08 Apr 2020
·
arXiv:2004.03545
2019
Breaking Winner-Takes-All: Iterative-Winners-Out Networks for Weakly Supervised Temporal Action Localization
IEEE Transactions on Image Processing
·
01 Dec 2019
·
doi:10.1109/TIP.2019.2922108
Self-supervised Moving Vehicle Tracking with Stereo Sound
arXiv
·
28 Oct 2019
·
arXiv:1910.11760