White-Box Transformers via Sparse Rate Reduction
Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin D. Haeffele, Yi Ma
Conference on Neural Information Processing Systems (NeurIPS) 2023
Paper / arXiv / Code
In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens. This leads to a family of white-box transformer-like deep network architectures which are mathematically fully interpretable. Despite their simplicity, experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers such as ViT.






Emergence of Segmentation with Minimalistic White-Box Transformers
Yaodong Yu, Tianzhe Chu, Shengbang Tong, Ziyang Wu, Druv Pai, Sam Buchanan, Yi Ma
Conference on Parsimony and Learning (CPAL) 2024
Paper / arXiv / Code
Transformer-like models for vision tasks have recently proven effective for a wide range of downstream applications such as segmentation and detection. Previous works have shown that segmentation properties emerge in vision transformers (ViTs) trained using self-supervised methods such as DINO, but not in those trained on supervised classification tasks. In this study, we probe whether segmentation emerges in transformer-based models solely as a result of intricate self-supervised learning mechanisms, or if the same emergence can be achieved under much broader conditions through proper design of the model architecture. Through extensive experimental results, we demonstrate that when employing a white-box transformer-like architecture known as CRATE, whose design explicitly models and pursues low-dimensional structures in the data distribution, segmentation properties, at both the whole and parts levels, already emerge with a minimalistic supervised training recipe. Layer-wise finer-grained analysis reveals that the emergent properties strongly corroborate the designed mathematical functions of the white-box network. Our results suggest a path to design white-box foundation models that are simultaneously highly performant and mathematically fully interpretable.






Masked Completion via Structured Diffusion with White-Box Transformers
Druv Pai, Ziyang Wu, Sam Buchanan, Yaodong Yu, Yi Ma
International Conference on Learning Representation (ICLR) 2024
Paper / arXiv / Code
Modern learning frameworks often train deep neural networks with massive amounts of unlabeled data to learn representations by solving simple pretext tasks, then use the representations as foundations for downstream tasks. These networks are empirically designed; as such, they are usually not interpretable, their representations are not structured, and their designs are potentially redundant. White-box deep networks, in which each layer explicitly identifies and transforms structures in the data, present a promising alternative. However, existing white-box architectures have only been shown to work at scale in supervised settings with labeled data, such as classification. In this work, we provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning. We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture, called CRATE-MAE, in which the role of each layer is mathematically fully interpretable: they transform the data distribution to and from a structured representation. Extensive empirical evaluations confirm our analytical insights. CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets while using only ~30% of the parameters compared to the standard masked autoencoder with the same model configuration. The representations learned by CRATE-MAE have explicit structure and also contain semantic meaning.






RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
Yao Mu*, Tianxing Chen*, Zanxin Chen*, Shijia Peng*, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding and Ping Luo
European Conference on Computer Vision (ECCV) 2024 Workshop, Best Paper Award
Project Page / Paper / arXiv / Code
Using the COBOT Magic platform, we have collected diverse data on tool usage, human-robot interaction, and mobile manipulation. We present a cost-effective approach to creating digital twins using AI-generated content, transforming 2D images into detailed 3D models. Furthermore, we utilize large language models to generate expert-level training data and task-specific pose sequences oriented towards functionality.






A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
Maomao Li, Yu Li, Tianyu Yang, Yunfei Liu, Dongxu Yue, Zhihui Lin, Dong Xu
The Conference on Computer Vision and Pattern Recognition (CVPR) 2024
Project Page / Paper / Code
This paper presents a video inversion approach for zero-shot video editing, which models the input video with low-rank representation during the inversion process. The existing video editing methods usually apply the typical 2D DDIM inversion or naive spatial-temporal DDIM inversion before editing, which leverages time-varying representation for each frame to derive noisy latent. Unlike most existing approaches, we propose a Spatial-Temporal Expectation-Maximization (STEM) inversion, which formulates the dense video feature under an expectation-maximization manner and iteratively estimates a more compact basis set to represent the whole video. Each frame applies the fixed and global representation for inversion, which is more friendly for temporal consistency during reconstruction and editing. Extensive qualitative and quantitative experiments demonstrate that our STEM inversion can achieve consistent improvement on two state-of-the-art video editing methods.






CO3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving
Runjian Chen, Yao Mu, Runsen Xu, Wenqi Shao, Chenhan Jiang, Hang Xu, Zhenguo Li, Ping Luo
International Conference on Learning Representation (ICLR) 2023
Paper / Code
Unsupervised contrastive learning for indoor-scene point clouds has achieved great successes. However, unsupervised learning point clouds in outdoor scenes remains challenging because previous methods need to reconstruct the whole scene and capture partial views for the contrastive objective. This is infeasible in outdoor scenes with moving objects, obstacles, and sensors. In this paper, we propose CO^3, namely Cooperative Contrastive Learning and Contextual Shape Prediction, to learn 3D representation for outdoor-scene point clouds in an unsupervised manner.






OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
Wenqi Shao*, Mengzhao Chen*, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo
International Conference on Learning Representation (ICLR) 2024, Spotlight
arXiv / Code
We propose OmniQuant, a novel post-training quantization (PTQ) technique for large language models (LLMs) that enhances performance in diverse quantization settings, particularly in extremely low-bit quantization. It introduces two key innovations: Learnable Weight Clipping (LWC) to optimize weight clipping thresholds and Learnable Equivalent Transformation (LET) to handle activation outliers by shifting quantization challenges to weights. OmniQuant operates within a differentiable framework using block-wise error minimization, enabling efficient optimization for both weight-only and weight-activation quantization.






CLOVER: Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation
Qingwen Bu*, Jia Zeng*, Li Chen*, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, Yi Ma, Hongyang Li
Conference on Neural Information Processing Systems (NeurIPS) 2024
arXiv / Code
Inspired by classic closed-loop control systems, we propose CLOVER, a closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. CLOVER consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions from feedback and initiates replans as needed.






Towards Synergistic, Generalized and Efficient Dual-System for Robotic Manipulation
Qingwen Bu, Hongyang Li, Li Chen, Jisong Cai, Jia Zeng, Heming Cui, Maoqing Yao, Yu Qiao
Preprint
arXiv / Page
We introduce RoboDual, a synergistic dual-system that supplements the merits of both generalist and specialist policy. A diffusion transformer-based specialist is devised for multi-step action rollouts, exquisitely conditioned on the high-level task understanding and discretized action output of a vision-language- action (VLA) based generalist.






DriveLM: Driving with Graph Visual Question Answering
Chonghao Sima*, Katrin Renz*, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, Hongyang Li
European Conference on Computer Vision (ECCV) 2024, Oral
Project Page / Paper / arXiv / Code
We explore how vision-language models (VLMs) trained on web-scale data can enhance generalization and interactivity in end-to-end driving systems. Unlike recent single-round VQA approaches, human drivers reason in multiple steps, starting from object localization to estimating interactions and planning actions. To mimic this process, we propose Graph VQA, a task that models graph-structured reasoning across perception, prediction, and planning. We introduce DriveLM-Data, a dataset based on nuScenes and CARLA, and a VLM-based baseline, DriveLM-Agent, for jointly addressing Graph VQA and autonomous driving. Our work aims to advance the integration of VLMs into driving systems and provides publicly available resources to support future research.






An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models
Yunzhe Hu, Difan Zou, Dong Xu
Conference on Neural Information Processing Systems (NeurIPS) 2024
arXiv
We investigate the properties and limitations of a recently proposed Transformer-like model, Coding Rate Reduction Transformer (CRATE), that is designed by unrolling optimization on Sparse Rate Reduction (SRR) and believed to be interpretable by construction. We also unveil the causal relationship between SRR and the generalization of this model family. Our findings reveal that SRR indeed has a postive and relatively strong correlation to generalization, and can be incorporated as training regularization. However, it is still far from being principled guidance to design better models.






Learning 3D Garment Animation from Trajectories of A Piece of Cloth
Yidi Shao, Chen Change Loy, Bo Dai
Conference on Neural Information Processing Systems (NeurIPS) 2024
Project Page / Code
In this paper, instead of using garment-wise supervised-learning we adopt a disentangled scheme to learn how to animate observed garments: 1). learning constitutive behaviors from the observed cloth; 2). dynamically animate various garments constrained by the learned constitutive laws. Specifically, we propose Energy Unit network (EUNet) to model the constitutive relations in the format of energy directly from the observed trajectories of piece of cloth. We further apply the pre-trained EUNet to animate various garments based on energy optimizations. The disentangled scheme alleviates the need of garment data and enables us to robustly animate garments.