CLP-VLA

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

Gia-Binh Nguyen^1,2,*, Trong-Bao Ho², Thien-Loc Ha², Khoa Vo³, Philip Lund Møller⁴, Quang T. Nguyen², Long Dinh^1,2, Tung M. Luu⁶, Tuan Dam⁵, Vu Duong¹, Trung Le⁷, Nghi D. Q. Bui¹, Minh Vu^1,2, Tran Nguyen Le⁴, An Thai Le^1,2,13, Ngan Le³, Daniel Sonntag^8,9, James Zou¹², Jan Peters^9,13, Duy M. H. Nguyen^†,9,10,11, Ngo Anh Vien^†,1,2

¹Center for AI Research, VinUniversity ²VinRobotics ³University of Arkansas ⁴Technical University of Denmark ⁵Hanoi University of Science and Technology ⁶KAIST ⁷Monash University ⁸Oldenburg University ⁹DFKI ¹⁰University of Stuttgart ¹¹IMPRS-IS ¹²Stanford University ¹³Technische Universität Darmstadt
^†Project Leads.

Paper Coming Soon Code Coming Soon arXiv

Abstract

VLA models are far more compressible than we assumed.

Validated on

GR00T-N1.5

π0

SmolVLA

Inference

Up to 66%

layers pruned

Memory

↓33%

footprint reduced

Training

40–50%

time & cost saved

Despite their remarkable capabilities, Vision-Language-Action (VLA) models impose prohibitive computational costs at fine-tuning and deployment time. We reveal a highly non-trivial architectural property of leading continuous-control foundations (π0, GR00T-N1.5): despite training on diverse physical trajectories, these models exhibit severe layer-wise representational redundancy.

To exploit this, we introduce CLP — a training-free structural compression pipeline. Using a single forward pass via Centered Kernel Alignment (CKA), CLP identifies and removes redundant twin layers, permanently compressing model depth by up to 50% across both the VLM backbone and the continuous-control policy head, without any auxiliary routing or early-exit modules.

Downstream fine-tuning of the streamlined architecture delivers a dual benefit: 40–50% faster training and up to 30% faster real-time inference, while matching or exceeding full-scale performance. We validate across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 real-world manipulation tasks spanning 4 robotic embodiments — proving that advanced VLAs require significantly fewer layers than previously assumed.

Contributions

Three concrete advances in efficient VLA adaptation.

Accessible adaptation of SOTA VLAs

π0 and GR00T-N1.5 fine-tuned at reduced depth — lower memory, training, and inference cost with no auxiliary modules.

Pre-finetuning compression via CLP

CKA identifies redundant transformer blocks and removes them entirely before adaptation — no retraining required.

Multi-embodiment validation

Validated on 3 sim benchmarks and 10 real-world tasks across 4 robots. CLP is platform-agnostic and acts as an effective structural regularizer under limited data.

Method

One forward pass. No extra modules. Up to 50% faster training.

0:00

▮▮

CKA-Guided Layer Pruning (CLP)

CKA Profiling

Feed sample batches through the pretrained VLA and compute pairwise CKA similarity across all layers.

Block Identification

Group consecutive high-similarity layers (CKA ≥ τ) into redundant blocks; keep only the first layer of each.

Training-free Compression

Prune 33–66% of transformer layers with ~30% FLOP reduction — no routing, no auxiliary modules.

Fine-tune & Deploy

Fine-tune the pruned model on the target task. Training time cuts up to 50%; inference up to 30–50% faster.

Key Findings

What we discovered inside large VLA models.

Finding 01

VLA layers are highly redundant

CKA heatmaps reveal large contiguous blocks in π0 and GR00T-N1.5 producing nearly identical representations — spanning the VLM backbone, action expert, and DiT heads.

Finding 02

Up to 2/3 of layers can be pruned with minimal loss

Both π0 and GR00T-N1.5 maintain flat performance up to 50% pruning, achieving ×1.39–1.42 FLOPs reduction with negligible success rate drop.

Finding 03

Trains faster — and often performs better

CLP trains significantly faster while matching or exceeding baseline success rates — e.g. 75% vs 65% on BananaToPot in less time.

Finding 04

Pruned models recover base model behavior after finetuning

Hidden state PCA shows representations realign closely with the base model post-finetuning — confirming CLP preserves representational structure despite removing up to half the layers.

Experiments & Results

Validated across simulation and 4 real-world robot embodiments.

Real-world Manipulation

UR10 — Single Arm

Groceries ToBasket

Open Kettle

Close Kettle

UR5 — Single Arm

Serve Napkin

ScrewdriverToBasket

ALOHA — Single Arm

BananaToPot

CubeToDrawer

Block Stacking

ALOHA — Bimanual

Fold Shorts

Fly Towel

Task success rate (%) and training efficiency across 4 robot embodiments.

Robot	Task	SR (%)	Time (h)	SR-CLP (%)	Time-CLP (h)	Saved
UR10 Single Arm	Groceries ToBasket	90	11.8	89	8.0	32%
	Open Kettle	100	2.8	95	1.4	50%
	Close Kettle	100	3.0	100	1.5	50%
UR5 Single Arm	Serve Napkin	45	1.1	65	0.7	36%
UR5 Single Arm	ScrewdriverToBasket	15	1.5	30	1.1	27%
ALOHA Single Arm	BananaToPot	65	5.1	75	2.9	43%
	CubeToDrawer	75	5.6	60	3.2	43%
	Block Stacking	80	2.8	75	1.4	50%
ALOHA Bimanual	Fold Shorts	90	6.5	95	4.4	32%
ALOHA Bimanual	Fly Towel	75	3.2	70	2.1	34%
Average / Total		73.5	43.4h	75.9	27.7h	36%

Simulation Experiments

LIBERO

RoboCasa

SimplerEnv

LIBERO benchmark — comparison with training-free acceleration methods.

Method	Spatial	Object	Goal	Long	Avg SR	Speedup ↑
OpenVLA-OFT group
FastV	94.6	95.8	94.0	88.8	93.3	1.44×
DivPrune	92.4	91.2	89.0	84.8	89.4	1.46×
EfficientVLA	96.5	91.1	96.0	72.1	88.9	1.52×
π0 group
π0	94.6	98.2	95.4	90.0	94.6	1.00×
π0-SpecPrune-VLA	96.6	98.0	95.2	84.2	93.5	1.31×
π0-CLP Ours	95.0	99.2	95.0	86.4	93.9	1.39×
GR00T-N1.5 group
GR00T-N1.5	90.8	98.4	95.4	91.0	93.9	1.00×
GR00T-N1.5-CLP Ours	89.4	98.8	95.8	88.6	93.0	1.42×
SmolVLA group
SmolVLA	71.8	92.2	87.4	57.2	77.15	1.00×
SmolVLA-CLP Ours	75.6	93.0	81.6	56.2	76.75	1.47×

GR00T-N1.5 on SimplerEnv (WidowX) — task success rate (%).

Task	GR00T-N1.5	GR00T-N1.5-CLP Ours
Carrot Plate	26	34 ↑8%
Eggplant Basket	34	14
Spoon Towel	18	38 ↑20%
Stack Cube	8	4
Eggplant Sink	8	16 ↑8%
Close Drawer	12	24 ↑12%
Open Drawer	10	10
Average	16.57	20.0 ↑3.43%
Training Time (hrs)	22.9	15.7 ↓31%

π0 on LIBERO — trained on 10% data (% Success Rate).

π0 on RoboCasa — 30 demonstrations (% Success Rate).

Citation

BibTeX

@article{NguyenGiaBinh2026FewerLayers,
  title={Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think},
  author={Nguyen Gia Binh and Trong-Bao Ho and Thien-Loc Ha and Khoa Vo and Philip Lund M{\o}ller and Quang Tan Nguyen and Long Dinh and Tung Minh Luu and Tuan Quang Dam and Vu N. Duong and Trung Le and Nghi D. Q. Bui and Minh Nhat Vu and Tran Nguyen Le and An Thai Le and Ngan Le and Daniel Sonntag and James Zou and Jan Peters and Duy Minh Ho Nguyen and Vien Anh Ngo},
  journal={arXiv preprint},
  year={2026},
  url={https://clpvla.github.io/}
}