Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

1Center for AI Research, VinUniversity   2VinRobotics   3University of Arkansas   4Technical University of Denmark   5Hanoi University of Science and Technology   6KAIST   7Monash University   8Oldenburg University   9DFKI   10University of Stuttgart   11IMPRS-IS   12Stanford University   13Technische Universität Darmstadt  
Project Leads.

Stanford University
DFKI
Technical University of Darmstadt
IMPRS-IS
VinRobotics
VinUniversity
Overview of the proposed CLP framework

Overview of the proposed CLP framework. CLP prunes representationally redundant transformer layers via CKA, reducing network depth by up to 66% and training/inference cost by up to 50%. Fine-tuning restores the latent geometry of the compressed model, enabling competitive performance across three simulation benchmarks, 10 real-world tasks, and four robotic embodiments.

Abstract

VLA models are far more compressible than we assumed.

Validated on

NVIDIA GR00T GR00T-N1.5
pi0 π0
SmolVLA SmolVLA

Inference

Up to 66%

layers pruned

Memory

↓33%

footprint reduced

Training

40–50%

time & cost saved

Despite their remarkable capabilities, Vision-Language-Action (VLA) models impose prohibitive computational costs at fine-tuning and deployment time. We reveal a highly non-trivial architectural property of leading continuous-control foundations (π0, GR00T-N1.5): despite training on diverse physical trajectories, these models exhibit severe layer-wise representational redundancy.

To exploit this, we introduce CLP — a training-free structural compression pipeline. Using a single forward pass via Centered Kernel Alignment (CKA), CLP identifies and removes redundant twin layers, permanently compressing model depth by up to 50% across both the VLM backbone and the continuous-control policy head, without any auxiliary routing or early-exit modules.

Downstream fine-tuning of the streamlined architecture delivers a dual benefit: 40–50% faster training and up to 30% faster real-time inference, while matching or exceeding full-scale performance. We validate across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 real-world manipulation tasks spanning 4 robotic embodiments — proving that advanced VLAs require significantly fewer layers than previously assumed.

Contributions

Three concrete advances in efficient VLA adaptation.

1

Accessible adaptation of SOTA VLAs

π0 and GR00T-N1.5 fine-tuned at reduced depth — lower memory, training, and inference cost with no auxiliary modules.

2

Pre-finetuning compression via CLP

CKA identifies redundant transformer blocks and removes them entirely before adaptation — no retraining required.

3

Multi-embodiment validation

Validated on 3 sim benchmarks and 10 real-world tasks across 4 robots. CLP is platform-agnostic and acts as an effective structural regularizer under limited data.

Method

One forward pass. No extra modules. Up to 50% faster training.

0:00
▮▮

CKA-Guided Layer Pruning (CLP)

01

CKA Profiling

Feed sample batches through the pretrained VLA and compute pairwise CKA similarity across all layers.

02

Block Identification

Group consecutive high-similarity layers (CKA ≥ τ) into redundant blocks; keep only the first layer of each.

03

Training-free Compression

Prune 33–66% of transformer layers with ~30% FLOP reduction — no routing, no auxiliary modules.

04

Fine-tune & Deploy

Fine-tune the pruned model on the target task. Training time cuts up to 50%; inference up to 30–50% faster.

Key Findings

What we discovered inside large VLA models.

CKA similarity heatmaps

Finding 01

VLA layers are highly redundant

CKA heatmaps reveal large contiguous blocks in π0 and GR00T-N1.5 producing nearly identical representations — spanning the VLM backbone, action expert, and DiT heads.

Layer pruning ratio vs success rate

Finding 02

Up to 2/3 of layers can be pruned with minimal loss

Both π0 and GR00T-N1.5 maintain flat performance up to 50% pruning, achieving ×1.39–1.42 FLOPs reduction with negligible success rate drop.

Training time comparison

Finding 03

Trains faster — and often performs better

CLP trains significantly faster while matching or exceeding baseline success rates — e.g. 75% vs 65% on BananaToPot in less time.

Hidden state PCA

Finding 04

Pruned models recover base model behavior after finetuning

Hidden state PCA shows representations realign closely with the base model post-finetuning — confirming CLP preserves representational structure despite removing up to half the layers.

×

Experiments & Results

Validated across simulation and 4 real-world robot embodiments.

Real-world Manipulation

UR10 — Single Arm

Groceries ToBasket

Open Kettle

Close Kettle

UR5 — Single Arm

Serve Napkin

ScrewdriverToBasket

ALOHA — Single Arm

BananaToPot

CubeToDrawer

Block Stacking

ALOHA — Bimanual

Fold Shorts

Fly Towel

Task success rate (%) and training efficiency across 4 robot embodiments.

Robot Task SR (%) Time (h) SR-CLP (%) Time-CLP (h) Saved
UR10
Single Arm
Groceries ToBasket9011.8898.032%
Open Kettle1002.8951.450%
Close Kettle1003.01001.550%
UR5
Single Arm
Serve Napkin451.1650.736%
ScrewdriverToBasket151.5301.127%
ALOHA
Single Arm
BananaToPot655.1752.943%
CubeToDrawer755.6603.243%
Block Stacking802.8751.450%
ALOHA
Bimanual
Fold Shorts906.5954.432%
Fly Towel753.2702.134%
Average / Total 73.5 43.4h 75.9 27.7h 36%

Simulation Experiments

LIBERO

RoboCasa

SimplerEnv

LIBERO benchmark — comparison with training-free acceleration methods.

Method Spatial Object Goal Long Avg SR Speedup ↑
OpenVLA-OFT group
FastV94.695.894.088.893.31.44×
DivPrune92.491.289.084.889.41.46×
EfficientVLA96.591.196.072.188.91.52×
π0 group
π094.698.295.490.094.61.00×
π0-SpecPrune-VLA96.698.095.284.293.51.31×
π0-CLP Ours 95.099.295.086.4 93.9 1.39×
GR00T-N1.5 group
GR00T-N1.590.898.495.491.093.91.00×
GR00T-N1.5-CLP Ours 89.498.895.888.6 93.0 1.42×
SmolVLA group
SmolVLA71.892.287.457.277.151.00×
SmolVLA-CLP Ours 75.693.081.656.2 76.75 1.47×

GR00T-N1.5 on SimplerEnv (WidowX) — task success rate (%).

Task GR00T-N1.5 GR00T-N1.5-CLP Ours
Carrot Plate2634 ↑8%
Eggplant Basket3414
Spoon Towel1838 ↑20%
Stack Cube84
Eggplant Sink816 ↑8%
Close Drawer1224 ↑12%
Open Drawer1010
Average 16.57 20.0 ↑3.43%
Training Time (hrs) 22.9 15.7 ↓31%

π0 on LIBERO — trained on 10% data (% Success Rate).

π0 on RoboCasa — 30 demonstrations (% Success Rate).

Citation

BibTeX

@article{NguyenGiaBinh2026FewerLayers,
  title={Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think},
  author={Nguyen Gia Binh and Trong-Bao Ho and Thien-Loc Ha and Khoa Vo and Philip Lund M{\o}ller and Quang Tan Nguyen and Long Dinh and Tung Minh Luu and Tuan Quang Dam and Vu N. Duong and Trung Le and Nghi D. Q. Bui and Minh Nhat Vu and Tran Nguyen Le and An Thai Le and Ngan Le and Daniel Sonntag and James Zou and Jan Peters and Duy Minh Ho Nguyen and Vien Anh Ngo},
  journal={arXiv preprint},
  year={2026},
  url={https://clpvla.github.io/}
}