1Center for AI Research, VinUniversity
2VinRobotics
3University of Arkansas
4Technical University of Denmark
5Hanoi University of Science and Technology
6KAIST
7Monash University
8Oldenburg University
9DFKI
10University of Stuttgart
11IMPRS-IS
12Stanford University
13Technische Universität Darmstadt
†Project Leads.
Abstract
Validated on
GR00T-N1.5
π0
SmolVLA
Inference
Up to 66%
layers pruned
Memory
↓33%
footprint reduced
Training
40–50%
time & cost saved
Despite their remarkable capabilities, Vision-Language-Action (VLA) models impose prohibitive computational costs at fine-tuning and deployment time. We reveal a highly non-trivial architectural property of leading continuous-control foundations (π0, GR00T-N1.5): despite training on diverse physical trajectories, these models exhibit severe layer-wise representational redundancy.
To exploit this, we introduce CLP — a training-free structural compression pipeline. Using a single forward pass via Centered Kernel Alignment (CKA), CLP identifies and removes redundant twin layers, permanently compressing model depth by up to 50% across both the VLM backbone and the continuous-control policy head, without any auxiliary routing or early-exit modules.
Downstream fine-tuning of the streamlined architecture delivers a dual benefit: 40–50% faster training and up to 30% faster real-time inference, while matching or exceeding full-scale performance. We validate across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 real-world manipulation tasks spanning 4 robotic embodiments — proving that advanced VLAs require significantly fewer layers than previously assumed.
Contributions
Accessible adaptation of SOTA VLAs
π0 and GR00T-N1.5 fine-tuned at reduced depth — lower memory, training, and inference cost with no auxiliary modules.
Pre-finetuning compression via CLP
CKA identifies redundant transformer blocks and removes them entirely before adaptation — no retraining required.
Multi-embodiment validation
Validated on 3 sim benchmarks and 10 real-world tasks across 4 robots. CLP is platform-agnostic and acts as an effective structural regularizer under limited data.
Method
CKA-Guided Layer Pruning (CLP)
Key Findings
Finding 01
VLA layers are highly redundant
CKA heatmaps reveal large contiguous blocks in π0 and GR00T-N1.5 producing nearly identical representations — spanning the VLM backbone, action expert, and DiT heads.
Finding 02
Up to 2/3 of layers can be pruned with minimal loss
Both π0 and GR00T-N1.5 maintain flat performance up to 50% pruning, achieving ×1.39–1.42 FLOPs reduction with negligible success rate drop.
Finding 03
Trains faster — and often performs better
CLP trains significantly faster while matching or exceeding baseline success rates — e.g. 75% vs 65% on BananaToPot in less time.
Finding 04
Pruned models recover base model behavior after finetuning
Hidden state PCA shows representations realign closely with the base model post-finetuning — confirming CLP preserves representational structure despite removing up to half the layers.
Experiments & Results
Real-world Manipulation
UR10 — Single Arm
Groceries ToBasket
Open Kettle
Close Kettle
UR5 — Single Arm
Serve Napkin
ScrewdriverToBasket
ALOHA — Single Arm
BananaToPot
CubeToDrawer
Block Stacking
ALOHA — Bimanual
Fold Shorts
Fly Towel
Task success rate (%) and training efficiency across 4 robot embodiments.
| Robot | Task | SR (%) | Time (h) | SR-CLP (%) | Time-CLP (h) | Saved |
|---|---|---|---|---|---|---|
| UR10 Single Arm | Groceries ToBasket | 90 | 11.8 | 89 | 8.0 | |
| Open Kettle | 100 | 2.8 | 95 | 1.4 | ||
| Close Kettle | 100 | 3.0 | 100 | 1.5 | ||
| UR5 Single Arm | Serve Napkin | 45 | 1.1 | 65 | 0.7 | |
| ScrewdriverToBasket | 15 | 1.5 | 30 | 1.1 | ||
| ALOHA Single Arm | BananaToPot | 65 | 5.1 | 75 | 2.9 | |
| CubeToDrawer | 75 | 5.6 | 60 | 3.2 | ||
| Block Stacking | 80 | 2.8 | 75 | 1.4 | ||
| ALOHA Bimanual | Fold Shorts | 90 | 6.5 | 95 | 4.4 | |
| Fly Towel | 75 | 3.2 | 70 | 2.1 | ||
| Average / Total | 73.5 | 43.4h | 75.9 | 27.7h | ||
Simulation Experiments
LIBERO
RoboCasa
SimplerEnv
LIBERO benchmark — comparison with training-free acceleration methods.
| Method | Spatial | Object | Goal | Long | Avg SR | Speedup ↑ |
|---|---|---|---|---|---|---|
| OpenVLA-OFT group | ||||||
| FastV | 94.6 | 95.8 | 94.0 | 88.8 | 93.3 | 1.44× |
| DivPrune | 92.4 | 91.2 | 89.0 | 84.8 | 89.4 | 1.46× |
| EfficientVLA | 96.5 | 91.1 | 96.0 | 72.1 | 88.9 | 1.52× |
| π0 group | ||||||
| π0 | 94.6 | 98.2 | 95.4 | 90.0 | 94.6 | 1.00× |
| π0-SpecPrune-VLA | 96.6 | 98.0 | 95.2 | 84.2 | 93.5 | 1.31× |
| π0-CLP Ours | 95.0 | 99.2 | 95.0 | 86.4 | 93.9 | 1.39× |
| GR00T-N1.5 group | ||||||
| GR00T-N1.5 | 90.8 | 98.4 | 95.4 | 91.0 | 93.9 | 1.00× |
| GR00T-N1.5-CLP Ours | 89.4 | 98.8 | 95.8 | 88.6 | 93.0 | 1.42× |
| SmolVLA group | ||||||
| SmolVLA | 71.8 | 92.2 | 87.4 | 57.2 | 77.15 | 1.00× |
| SmolVLA-CLP Ours | 75.6 | 93.0 | 81.6 | 56.2 | 76.75 | 1.47× |
GR00T-N1.5 on SimplerEnv (WidowX) — task success rate (%).
| Task | GR00T-N1.5 | GR00T-N1.5-CLP Ours |
|---|---|---|
| Carrot Plate | 26 | 34 ↑8% |
| Eggplant Basket | 34 | 14 |
| Spoon Towel | 18 | 38 ↑20% |
| Stack Cube | 8 | 4 |
| Eggplant Sink | 8 | 16 ↑8% |
| Close Drawer | 12 | 24 ↑12% |
| Open Drawer | 10 | 10 |
| Average | 16.57 | 20.0 ↑3.43% |
| Training Time (hrs) | 22.9 | 15.7 ↓31% |
π0 on LIBERO — trained on 10% data (% Success Rate).
π0 on RoboCasa — 30 demonstrations (% Success Rate).
Citation
@article{NguyenGiaBinh2026FewerLayers,
title={Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think},
author={Nguyen Gia Binh and Trong-Bao Ho and Thien-Loc Ha and Khoa Vo and Philip Lund M{\o}ller and Quang Tan Nguyen and Long Dinh and Tung Minh Luu and Tuan Quang Dam and Vu N. Duong and Trung Le and Nghi D. Q. Bui and Minh Nhat Vu and Tran Nguyen Le and An Thai Le and Ngan Le and Daniel Sonntag and James Zou and Jan Peters and Duy Minh Ho Nguyen and Vien Anh Ngo},
journal={arXiv preprint},
year={2026},
url={https://clpvla.github.io/}
}