MiniVLA-(01):基于 Qwen2.5-VL 的端到端自动驾驶轨迹预测 Demo
问题解释
我还是不太明白,上边的案例,怎么用 qwen-VLA 预测或其他模型预测的未来轨迹?未来轨迹的距离是怎么推断算出来的,请详细解释。
这个问题问到了核心。我来从最直觉的层面讲清楚"轨迹是怎么算出来的"。
先澄清一个关键误解
模型不是在"推算距离",它是在做模式匹配。
打个比方:你教一个小孩骑自行车。你不会给他讲物理公式(速度×时间=距离),你只是让他反复骑,摔了爬起来再骑。骑了几百次后,他看到前方有个弯道,身体会"自动"知道该怎么转弯——他不是在计算,他是在回忆过去类似场景下"正确的动作是什么"。
这段代码的训练过程和这个道理完全一样。
一、训练阶段:老师怎么教的
训练的本质是一个"出题→答题→对答案→纠错"的循环。
出题:给模型看一张前方道路的照片,加一句指令"直行",问它"接下来车该怎么走?"
答题:模型输出 6 个坐标点,比如 [(0.5, 0.0), (1.1, 0.0), (1.8, 0.1), ...]
对答案:从 nuScenes 数据集里拿出人类司机当时实际走过的轨迹,比如 [(0.8, 0.0), (1.6, 0.0), (2.4, 0.0), ...]
纠错:计算预测和真实之间的差距(loss),然后调整模型参数,让下次预测更接近真实。
重复这个过程成千上万次,模型就学会了:"看到这种场景+这种指令,应该输出什么样的轨迹坐标。"
二、真实轨迹从哪来?这才是关键
你问的"距离怎么算出来",答案藏在数据集的真实轨迹里。我把代码这段展开讲:
# nuScenes 数据里记录了采集车在每个时刻的全局位置
# 比如:
# t=0 时刻,车在 (1000.0, 500.0) ← 当前位置
# t=0.5s 时,车在 (1001.2, 500.1) ← 0.5秒后
# t=1.0s 时,车在 (1002.5, 500.3) ← 1.0秒后
# t=1.5s 时,车在 (1003.9, 500.8)
# t=2.0s 时,车在 (1005.4, 501.5)
# t=2.5s 时,车在 (1007.0, 502.4)
# t=3.0s 时,车在 (1008.8, 503.6) ← 3.0秒后
这些全局坐标是怎么来的?是真实采集车上的高精度 GPS + IMU(惯性测量单元)在实际行驶过程中记录下来的。采集车在波士顿和新加坡街头开了好几百公里,每一帧都记录了精确的位置。
然后代码做了归一化——减掉当前位置,变成相对坐标:
# 减去当前位置 (1000.0, 500.0)
# 相对轨迹变成:
# (0.0, 0.0) ← 原点(当前位置)
# (1.2, 0.1) ← 前方1.2米,右偏0.1米
# (2.5, 0.3) ← 前方2.5米,右偏0.3米
# (3.9, 0.8)
# (5.4, 1.5)
# (7.0, 2.4)
# (8.8, 3.6) ← 前方8.8米,右偏3.6米
所以这些数字的单位是米,代表的是"相对于当前车辆位置,未来每个时间步车应该在哪里"。
三、模型预测时到底发生了什么
训练完成后,模型推理的过程是这样的:
输入:一张新的道路照片 + "turn left"
│
▼
┌─────────────────────────┐
│ Qwen2.5-VL(冻结) │
│ │
│ "我看到一条右弯的路, │
│ 路上有两辆车, │
│ 指令说要左转" │
│ │
│ → 压缩成 2048 个数字 │ ← 特征向量
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ 轨迹头(训练好的 MLP) │
│ │
│ 2048 个数字 │
│ → 512 个数字 │
│ → 256 个数字 │
│ → 12 个数字 │ ← 就是 6×2 的轨迹坐标
└─────────────────────────┘
│
▼
输出:[(0.8, 0.3), (1.5, 0.9), (2.1, 1.8),
(2.6, 2.9), (3.0, 4.1), (3.2, 5.4)]
含义:未来6个时间步,车应该
往前走3.2米,同时向左偏5.4米(左转)
轨迹头本质上就是一个函数:它接收 2048 个数字(场景理解的浓缩),输出 12 个数字(6个点的 xy 坐标)。这个函数的参数是通过训练学来的。
四、一个具体的直觉类比
想象你是出租车调度员,桌上有一块屏幕(摄像头画面)和一个对讲机(驾驶指令)。
训练阶段:你跟车跑了 10000 趟,每趟都记下"看到这种路况 + 听到这种指令 → 司机实际走了什么路线"。跑多了以后,你脑子里形成了模式——"噢,看到前方有红灯+指令说右转,司机一般会先减速直行 3 米,然后右偏 5 米绕过去"。
推理阶段:新来一个路况画面+一条指令,你不需要计算物理公式,你直接从经验里调出最匹配的模式,说出"往前走 X 米,往右偏 Y 米"。
模型做的事情和你一模一样,只不过你的"经验"存在神经元里,模型的"经验"存在轨迹头的权重矩阵里。
五、那具体的距离数字是怎么"冒出来的"?
轨迹头最后一层是:
nn.Linear(256, 12) # 256个输入 → 12个输出
这个 Linear 层的数学本质是:
output[0] = w[0,0]*x[0] + w[0,1]*x[1] + ... + w[0,255]*x[255] + b[0]
output[1] = w[1,0]*x[0] + w[1,1]*x[1] + ... + w[1,255]*x[255] + b[1]
...
output[11] = w[11,0]*x[0] + ... + w[11,255]*x[255] + b[11]
其中 w 和 b 是训练学来的参数(共 256×12 + 12 = 3084 个数字),x 是上一层传来的 256 维特征。
所以 output[0] 就是第一个轨迹点的 x 坐标,output[1] 是第一个轨迹点的 y 坐标,以此类推。每个坐标值都是 256 个特征的加权求和——哪些特征权重大,取决于训练过程中哪些特征和"正确轨迹"的相关性最强。
距离数值不是"推算"出来的,是"拟合"出来的。 模型通过调整几千个权重参数,让这个加权求和的结果尽可能接近真实轨迹的坐标值。
六、和 Tesla FSD 的关系
Tesla FSD 的核心思路和这段代码是一脉相承的,只是规模和复杂度完全不同:
| 维度 | 这段代码 | Tesla FSD |
|---|---|---|
| 输入 | 1个摄像头 | 8个摄像头+时序 |
| 视觉模型 | Qwen2.5-VL (通用VLM) | 定制视觉骨干网络 |
| 指令 | 3种文字指令 | 导航路径+交通规则 |
| 轨迹输出 | 6个点 | 连续轨迹+速度+加速度 |
| 训练数据 | 100条 mini | 数十亿帧真实驾驶 |
| 安全机制 | 无 | 碰撞检测+规则约束 |
但本质都是:看到场景 → 理解场景 → 输出轨迹。这段代码用不到 200 行展示了这个核心 pipeline,这就是它作为教学示例的价值。
代码
使用方法:
# 1. 先安装 bitsandbytes(4-bit 量化依赖)
pip install bitsandbytes
# 2. 修改代码中的数据集路径
# NUSCENES_ROOT = "/home/lionsking/data/nuscenes"
# 3. 训练
python mini_vla_nuscenes_fixed.py
# 4. 训练完成后,推理演示
python mini_vla_nuscenes_fixed.py demo
mini_vla_nuscenes_fixed.py 代码:
"""
╔══════════════════════════════════════════════════════════════════════════╗
║ MiniVLA:基于 Qwen2.5-VL 的端到端自动驾驶轨迹预测 ║
║ Vision-Language-Action (VLA) 教学示例 ║
║ ║
║ 功能:用视觉语言模型理解道路场景 + 驾驶指令 → 预测未来行驶轨迹 ║
║ 适配:8GB 显卡(RTX 5060 / RTX 3060 / RTX 4060 等) ║
║ ║
║ 核心修复(相对原版): ║
║ 1. 4-bit 量化加载 VLM,解决 8GB 显存 OOM ║
║ 2. 动态获取 hidden_size,不再硬编码错误维度 ║
║ 3. 修复轨迹加载 bug(原版 token 查找逻辑错误导致轨迹全零) ║
║ 4. 修复指令生成逻辑(改用相邻帧航向差而非绝对航向角) ║
║ 5. 添加速度估算(基于相邻帧位移,不再硬编码 5.0) ║
║ 6. 添加 GPU/CPU 自适应的 device_map 策略 ║
╚══════════════════════════════════════════════════════════════════════════╝
"""
import os
# =====================================================================
# 【第零步:环境变量 —— 必须在所有 import 之前设置】
#
# 为什么要放最前面?
# 因为 transformers 库在 import 时就会读取这些环境变量来决定
# 从哪个服务器下载模型。如果放在 import 之后,就来不及了。
# =====================================================================
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com' # 国内镜像,解决 HuggingFace 下载慢/断连
os.environ["TRANSFORMERS_OFFLINE"] = "0" # 允许在线下载(设为 "1" 则纯离线模式)
os.environ['HUGGINGFACE_HUB_CACHE'] = './model_cache' # 模型缓存目录,避免重复下载
os.environ['TRANSFORMERS_CACHE'] = './model_cache' # 同上,兼容旧版 transformers
os.environ['HF_HUB_DOWNLOAD_TIMEOUT'] = '600' # 下载超时 10 分钟,防止大模型下载中断
# =====================================================================
# 【第一步:导入依赖库】
# =====================================================================
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
Qwen2_5_VLForConditionalGeneration,
AutoProcessor,
BitsAndBytesConfig, # ← 新增:4-bit 量化配置
)
from nuscenes import NuScenes
from pyquaternion import Quaternion
import numpy as np
from PIL import Image
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
# =====================================================================
# 【第二步:数据集类 —— 从 nuScenes 加载训练数据】
#
# 核心职责:
# 给定一个索引 idx,返回 (图像, 驾驶指令, 车速, 真实轨迹)
# 这四样东西构成一条完整的训练样本
#
# 数据流:
# nuScenes 数据集(真实道路采集)
# ↓
# CAM_FRONT 前视摄像头图片 → 模型的"眼睛"
# ego_pose 自车位姿 → 提取指令/速度/轨迹
# =====================================================================
class NuScenesTrajDataset(Dataset):
"""
nuScenes 驾驶数据集封装
每条样本包含:
- image: 前视摄像头 RGB 图像(PIL Image)
- command: 驾驶指令字符串("drive straight" / "turn left" / "turn right")
- speed: 自车速度 (m/s)
- traj: 未来轨迹,形状 (seq_len, 2),单位:米,相对于当前位置
"""
def __init__(self, nusc_root, nusc_version='v1.0-mini', seq_len=6, max_samples=100):
"""
参数:
nusc_root: nuScenes 数据集根目录
nusc_version: 数据集版本('v1.0-mini' 只有 10 个场景,适合调试)
seq_len: 预测未来多少个轨迹点(默认 6 个,约 3 秒)
max_samples: 最多加载多少条数据(调试用,正式训练可设为 None)
"""
# ------- 加载 nuScenes 元数据(索引表,不是图片本身) -------
self.nusc = NuScenes(version=nusc_version, dataroot=nusc_root, verbose=True)
self.seq_len = seq_len
# ------- 构建 token 列表和快速查找集合 -------
# token 是 nuScenes 中每条数据的唯一 ID(32位十六进制字符串)
# 类似数据库的主键,所有数据通过 token 互相关联
self.sample_tokens = [s['token'] for s in self.nusc.sample]
# 【修复】构建 token 集合,用于 O(1) 快速查找
# 原版错误:用 `token in self.nusc.sample`(在 list of dict 中查字符串,永远 False)
self.token_set = set(self.sample_tokens)
# 【修复】构建 token → sample 字典,避免重复调用 nusc.get()
self.token_to_sample = {s['token']: s for s in self.nusc.sample}
# 限制数据量(调试用)
if max_samples is not None:
self.sample_tokens = self.sample_tokens[:max_samples]
print(f"✅ 数据集加载完成,共 {len(self.sample_tokens)} 条样本")
def __len__(self):
"""DataLoader 需要知道数据集总共有多少条"""
return len(self.sample_tokens)
def _get_ego_pose(self, sample_token):
"""
辅助方法:根据 sample token 获取自车位姿
返回:
ego_pose dict,包含:
- translation: [x, y, z] 全局坐标(米)
- rotation: [w, x, y, z] 四元数表示的朝向
什么是四元数(Quaternion)?
三维空间中表示旋转的方式之一。相比欧拉角(yaw/pitch/roll),
四元数没有万向节死锁(Gimbal Lock)问题,数学运算也更稳定。
你可以简单理解为:它用 4 个数字编码了"物体面朝哪个方向"。
"""
sample = self.token_to_sample[sample_token]
cam_data = self.nusc.get('sample_data', sample['data']['CAM_FRONT'])
ego_pose = self.nusc.get('ego_pose', cam_data['ego_pose_token'])
return ego_pose
def __getitem__(self, idx):
"""
核心方法:返回第 idx 条训练样本
PyTorch 的 DataLoader 会反复调用这个方法:
- 训练时,DataLoader 自动处理 batch、shuffle、多进程加载
- 每次调用返回一条样本,DataLoader 把多条样本拼成一个 batch
"""
sample_token = self.sample_tokens[idx]
sample = self.token_to_sample[sample_token]
# ================================================================
# 2.1 加载前视摄像头图像
# ================================================================
# nuScenes 数据组织方式:
# sample(一个时间戳的完整数据)
# └── sample_data(某个传感器在该时间戳的数据)
# └── filename(图片文件相对路径)
#
# CAM_FRONT 是前视摄像头,分辨率 1600×900
cam_data = self.nusc.get('sample_data', sample['data']['CAM_FRONT'])
cam_path = os.path.join(self.nusc.dataroot, cam_data['filename'])
image = Image.open(cam_path).convert('RGB')
# ================================================================
# 2.2 生成驾驶指令(直行 / 左转 / 右转)
# ================================================================
# 【修复】原版用当前帧的绝对 yaw 角判断指令,这是错误的——
# 绝对 yaw 表示车头在全局坐标系中的朝向,跟"当前是否在转弯"无关。
# 正确做法:用相邻帧的 yaw 差值(航向变化率)来判断。
#
# 举例:车头一直朝东(yaw≈0)在直行 → 原版判断为直行 ✓
# 车头一直朝北(yaw≈π/2)在直行 → 原版判断为左转 ✗
# 用 yaw 差值的话,两种情况差值都≈0,正确判断为直行 ✓
current_pose = self._get_ego_pose(sample_token)
current_yaw = Quaternion(current_pose['rotation']).yaw_pitch_roll[0]
# 检查下一帧是否存在
next_token = sample.get('next', '')
if next_token and next_token in self.token_set:
next_pose = self._get_ego_pose(next_token)
next_yaw = Quaternion(next_pose['rotation']).yaw_pitch_roll[0]
# 计算航向变化量(弧度)
# yaw_diff > 0 表示逆时针旋转(左转)
# yaw_diff < 0 表示顺时针旋转(右转)
yaw_diff = next_yaw - current_yaw
# 处理角度跨越 ±π 的情况(比如从 179° 转到 -179° 实际只转了 2°)
if yaw_diff > np.pi:
yaw_diff -= 2 * np.pi
elif yaw_diff < -np.pi:
yaw_diff += 2 * np.pi
if yaw_diff > 0.05:
command = "turn left"
elif yaw_diff < -0.05:
command = "turn right"
else:
command = "drive straight"
else:
command = "drive straight" # 最后一帧默认直行
# ================================================================
# 2.3 估算自车速度
# ================================================================
# 【修复】原版硬编码 speed = 5.0,这里用相邻帧的位移估算
#
# 速度 = 距离 / 时间
# nuScenes 的采样间隔约 0.5 秒(2Hz)
if next_token and next_token in self.token_set:
next_pos = np.array(next_pose['translation'][:2])
curr_pos = np.array(current_pose['translation'][:2])
distance = np.linalg.norm(next_pos - curr_pos) # 欧几里得距离
dt = 0.5 # nuScenes 采样间隔约 0.5 秒
speed = distance / dt
else:
speed = 0.0
# ================================================================
# 2.4 加载未来真实轨迹(标签 / Ground Truth)
# ================================================================
#
# 这是"正确答案"——人类司机实际走过的路线。
# 训练时,模型的预测轨迹会和这个真实轨迹对比,算出误差(loss)。
#
# 数据结构:(seq_len, 2) = (6, 2)
# 6 个未来时间步,每步一个 (x, y) 坐标
# 单位:米,相对于当前车辆位置
#
# nuScenes 的 sample 通过 'next' 字段形成链表:
# sample_0 --next--> sample_1 --next--> sample_2 --next--> ...
# 每个 sample 间隔约 0.5 秒,6 个点覆盖约 3 秒的未来
traj = np.zeros((self.seq_len, 2), dtype=np.float32)
walk_token = sample_token
for i in range(self.seq_len):
# 沿链表往后走一步
if walk_token not in self.token_set:
break # 到达数据集末尾,剩余轨迹点保持为零
walk_sample = self.token_to_sample[walk_token]
next_walk = walk_sample.get('next', '')
if next_walk and next_walk in self.token_set:
future_pose = self._get_ego_pose(next_walk)
traj[i] = [
future_pose['translation'][0],
future_pose['translation'][1]
]
walk_token = next_walk
else:
# 没有下一帧了,用当前位置填充(轨迹不再延伸)
if i > 0:
traj[i] = traj[i - 1]
break
# ------- 轨迹归一化:全局坐标 → 相对坐标 -------
#
# 为什么要归一化?
# 全局坐标可能是 (1035.2, 567.8) 这样的大数
# 但模型不需要知道"我在地球上的哪个位置"
# 它只需要知道"接下来往哪走"(相对位移)
#
# 做法:每个轨迹点减去当前位置
# (1036.4, 567.9) - (1035.2, 567.8) = (1.2, 0.1)
# 含义:往前 1.2 米,往右偏 0.1 米
current_xy = np.array(current_pose['translation'][:2], dtype=np.float32)
traj = traj - current_xy[None, :] # None 增加一个维度用于广播
return image, command, np.float32(speed), torch.from_numpy(traj)
# =====================================================================
# 【第三步:模型定义 —— MiniVLA】
#
# 架构:
# ┌─────────────────────────────────────────────────┐
# │ Qwen2.5-VL(冻结,4-bit 量化) │
# │ ↓ 理解图像 + 文字指令 │
# │ ↓ 输出 hidden_size 维的特征向量 │
# ├─────────────────────────────────────────────────┤
# │ 轨迹预测头 traj_head(可训练的 MLP) │
# │ ↓ hidden_size → 512 → 256 → 12 │
# │ ↓ 12 个数字 = 6 个轨迹点 × 2 个坐标 (x, y) │
# └─────────────────────────────────────────────────┘
#
# 为什么要冻结 VLM?
# 1. 省显存:不需要存储 VLM 参数的梯度(几个 GB 的差距)
# 2. 防遗忘:VLM 在海量数据上学到的视觉理解能力不会被破坏
# 3. 收敛快:只训练几千个参数的轨迹头,比训练几十亿参数快得多
# =====================================================================
class MiniVLA(nn.Module):
def __init__(self, vlm_name='Qwen/Qwen2.5-VL-3B-Instruct',
num_waypoints=6, freeze_vlm=True, use_4bit=True):
"""
参数:
vlm_name: HuggingFace 上的模型名称
num_waypoints: 预测的轨迹点数量
freeze_vlm: 是否冻结 VLM 参数
use_4bit: 是否使用 4-bit 量化(8GB 显卡必须开启)
"""
super().__init__()
print(f"✅ 加载模型: {vlm_name}")
print(f" 量化模式: {'4-bit (省显存)' if use_4bit else 'fp16 (需要大显存)'}")
# ---------------------------------------------------------------
# 3.1 加载预训练 VLM
# ---------------------------------------------------------------
if use_4bit:
# 【核心修复】4-bit 量化配置
#
# 什么是量化?
# 正常模型参数用 float16 存储,每个参数占 2 字节
# 3B 参数 × 2 字节 = 6GB 显存
# 4-bit 量化:每个参数只用 0.5 字节
# 3B 参数 × 0.5 字节 = 1.5GB 显存
#
# NF4 (NormalFloat 4-bit):
# 专门为神经网络设计的 4-bit 数据类型
# 因为神经网络的权重通常呈正态分布,NF4 针对这种分布优化了精度
#
# double_quant(双重量化):
# 对量化参数本身再做一次量化,进一步节省显存
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # 启用 4-bit 量化
bnb_4bit_compute_dtype=torch.float16, # 计算时用 fp16(精度和速度的平衡)
bnb_4bit_use_double_quant=True, # 双重量化,额外省约 0.4GB
bnb_4bit_quant_type="nf4", # NF4 量化类型
)
self.vlm = Qwen2_5_VLForConditionalGeneration.from_pretrained(
vlm_name,
quantization_config=quantization_config,
device_map="auto", # 自动分配到 GPU/CPU
trust_remote_code=True,
)
else:
# 不量化,直接 fp16 加载(需要 >= 16GB 显存)
self.vlm = Qwen2_5_VLForConditionalGeneration.from_pretrained(
vlm_name,
torch_dtype=torch.float16,
trust_remote_code=True,
)
# 加载对应的数据预处理器(Processor)
# Processor 负责:
# - 图片:缩放、归一化像素值到 [0,1]
# - 文字:分词(tokenize),把自然语言变成数字 ID 序列
# - 对齐:把图片 token 和文字 token 拼在一起
self.processor = AutoProcessor.from_pretrained(
vlm_name, trust_remote_code=True
)
# ---------------------------------------------------------------
# 3.2 冻结 VLM 参数
# ---------------------------------------------------------------
if freeze_vlm:
for p in self.vlm.parameters():
p.requires_grad = False
print(" VLM 参数已冻结,只训练轨迹头")
# ---------------------------------------------------------------
# 3.3 构建轨迹预测头(这是唯一需要训练的部分)
# ---------------------------------------------------------------
# 【修复】动态获取 hidden_size,而不是硬编码
#
# 原版的 bug:
# 代码注释写着 "3B = 2048",但实际设置的是 hidden = 3584(7B 的维度)
# 加载 3B 模型 + 3584 维度 = 维度不匹配,运行时报错
#
# 正确做法:从模型配置中自动读取
# - Qwen2.5-VL-3B: hidden_size = 2048
# - Qwen2.5-VL-7B: hidden_size = 3584
hidden = self.vlm.config.hidden_size
print(f" VLM hidden_size = {hidden}")
# MLP(多层感知机)结构:
# hidden → 512:降维,把高维特征压缩
# ReLU:激活函数,引入非线性(没有非线性的话,多层线性等价于一层)
# Dropout(0.1):训练时随机丢弃 10% 的神经元,防止过拟合
# 512 → 256:进一步降维
# 256 → num_waypoints*2:输出层,6个点×2个坐标 = 12 个数字
self.traj_head = nn.Sequential(
nn.Linear(hidden, 512),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, num_waypoints * 2), # 输出 12 个数字
)
self.num_waypoints = num_waypoints
# 统计可训练参数量
trainable = sum(p.numel() for p in self.traj_head.parameters())
total = sum(p.numel() for p in self.parameters())
print(f" 可训练参数: {trainable:,} / 总参数: {total:,}")
print(f" 训练比例: {trainable/total*100:.4f}%")
def forward(self, images, driving_commands, ego_speeds):
"""
前向传播:图像 + 指令 → 预测轨迹
参数:
images: PIL Image 列表,长度 = batch_size
driving_commands: 字符串列表,如 ["drive straight", "turn left"]
ego_speeds: 速度数组,形状 (batch_size,)
返回:
pred_traj: 预测轨迹,形状 (batch_size, 6, 2)
数据流:
图片 + 文字 → Processor → token IDs
token IDs → VLM → 每层的隐藏状态
最后一层最后一个 token → 轨迹头 → 12 个数字 → reshape 成 (6, 2)
"""
# ---- 构建文本 prompt ----
# 把驾驶指令和速度拼成自然语言,这就是 VLA 中 "Language" 的部分
# VLM 会同时理解这段文字和图片,产生融合的特征
prompts = [
f'Command: {cmd}. Speed: {sp:.1f}m/s.'
for cmd, sp in zip(driving_commands, ego_speeds)
]
# ---- 数据预处理 ----
# Processor 把图片和文字统一转换成模型能接受的数字格式
# - padding=True:批次中不同长度的序列补齐到相同长度
# - return_tensors='pt':返回 PyTorch tensor
device = next(self.vlm.parameters()).device
inputs = self.processor(
text=prompts,
images=images,
return_tensors='pt',
padding=True,
).to(device)
# ---- VLM 前向推理(不计算梯度) ----
# output_hidden_states=True:返回每层 Transformer 的隐藏状态
# 我们需要最后一层的特征,所以必须开启
with torch.no_grad():
out = self.vlm(**inputs, output_hidden_states=True)
# ---- 提取特征向量 ----
# out.hidden_states 是一个元组,长度 = 层数+1
# [-1] 取最后一层(包含最丰富的语义信息)
# [:, -1, :] 取序列中最后一个 token(在自回归模型中,
# 最后一个 token 通过 attention 机制聚合了前面所有信息)
# .float() 从 fp16 转 fp32(轨迹头用 fp32 更稳定)
h = out.hidden_states[-1][:, -1, :].float()
# h 的形状: (batch_size, hidden_size),如 (1, 2048)
# ---- 轨迹预测 ----
# 特征向量 → MLP → 12 个数字 → reshape 成 (batch, 6, 2)
traj = self.traj_head(h)
return traj.view(-1, self.num_waypoints, 2)
# =====================================================================
# 【第四步:训练函数】
#
# 深度学习训练的标准流程:
# 1. 准备数据(Dataset + DataLoader)
# 2. 构建模型
# 3. 定义损失函数(衡量"错了多少")
# 4. 定义优化器(决定"怎么改参数")
# 5. 循环训练:前向 → 算损失 → 反向传播 → 更新参数
# =====================================================================
def train():
"""完整的训练流程"""
# ======================== 配置参数 ========================
# 【修改这里】把路径改成你自己的 nuScenes 数据集路径
NUSCENES_ROOT = "/home/lionsking/data/nuscenes"
BATCH_SIZE = 1 # 每次送 1 张图给模型(8GB 显存只能 batch=1)
EPOCHS = 20 # 完整遍历数据集的次数
LR = 1e-3 # 学习率:每次更新参数时的步长
MAX_SAMPLES = 100 # 只用前 100 条数据(调试用,正式训练可增大)
USE_4BIT = True # 是否使用 4-bit 量化(8GB 显卡必须 True)
# 自动选择设备
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"\n{'='*60}")
print(f"设备: {DEVICE}")
if DEVICE == "cuda":
gpu_name = torch.cuda.get_device_name(0)
gpu_mem = torch.cuda.get_device_properties(0).total_mem / 1024**3
print(f"GPU: {gpu_name} ({gpu_mem:.1f} GB)")
print(f"{'='*60}\n")
# ======================== 加载数据集 ========================
print("📦 加载 nuScenes 数据集...")
dataset = NuScenesTrajDataset(
nusc_root=NUSCENES_ROOT,
max_samples=MAX_SAMPLES,
)
# DataLoader:自动处理 batch、shuffle、多进程加载
# shuffle=True:每个 epoch 打乱数据顺序(防止模型记住顺序而非学习规律)
# num_workers=0:数据加载的子进程数(0 = 主进程加载,最稳定)
# pin_memory=True:锁页内存,加速 CPU→GPU 数据传输
# collate_fn:自定义 batch 组装方式(因为 PIL Image 不能直接 stack)
dataloader = DataLoader(
dataset,
batch_size=BATCH_SIZE,
shuffle=True,
num_workers=0, # 避免多进程和 nuScenes 的冲突
pin_memory=(DEVICE == "cuda"),
collate_fn=custom_collate, # 自定义 batch 拼接(见下方定义)
)
# ======================== 构建模型 ========================
print("\n🧠 构建 MiniVLA 模型...")
model = MiniVLA(freeze_vlm=True, use_4bit=USE_4BIT)
# 4-bit 量化模型用了 device_map="auto",VLM 已经在 GPU 上了
# 只需要把轨迹头也搬到 GPU
model.traj_head = model.traj_head.to(DEVICE)
# ======================== 损失函数和优化器 ========================
# SmoothL1Loss(也叫 Huber Loss):
# - 当预测和真实差距大时,用 L1(|pred - true|),梯度恒定,不会爆炸
# - 当差距小时,用 L2((pred - true)²),更平滑,有利于精细调整
# - 对轨迹预测这种回归任务来说是最佳选择之一
loss_fn = nn.SmoothL1Loss()
# Adam 优化器:只优化轨迹头的参数
# Adam = Adaptive Moment Estimation
# 它为每个参数自适应调整学习率:
# - 变化大的参数:自动降低学习率(避免震荡)
# - 变化小的参数:自动提高学习率(加速收敛)
optimizer = torch.optim.Adam(model.traj_head.parameters(), lr=LR)
# ======================== 训练循环 ========================
print(f"\n🚀 开始训练({EPOCHS} 个 epoch,每 epoch {len(dataloader)} 个 batch)...\n")
model.traj_head.train() # 轨迹头设为训练模式(启用 Dropout)
best_loss = float('inf')
for epoch in range(EPOCHS):
total_loss = 0.0
pbar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{EPOCHS}")
for batch_idx, (images, commands, speeds, gt_traj) in enumerate(pbar):
# gt_traj: 真实轨迹 (Ground Truth),形状 (batch, 6, 2)
gt_traj = gt_traj.to(DEVICE)
# ---- 前向传播:图像+指令 → 预测轨迹 ----
pred_traj = model(images, commands, speeds)
# ---- 计算损失:预测轨迹 vs 真实轨迹 ----
loss = loss_fn(pred_traj, gt_traj)
# ---- 反向传播三步曲 ----
optimizer.zero_grad() # 1. 清空旧梯度(PyTorch 默认累加梯度)
loss.backward() # 2. 计算每个参数对 loss 的贡献(梯度)
optimizer.step() # 3. 按梯度方向更新参数(让 loss 变小)
total_loss += loss.item()
pbar.set_postfix(loss=f"{loss.item():.4f}")
# 本 epoch 的平均损失
avg_loss = total_loss / len(dataloader)
print(f" Epoch {epoch+1} 平均损失: {avg_loss:.4f}")
# 保存最优模型
if avg_loss < best_loss:
best_loss = avg_loss
torch.save(model.traj_head.state_dict(), 'vla_traj_head_best.pth')
print(f" 💾 最优模型已保存 (loss={best_loss:.4f})")
# 保存最终模型
torch.save(model.traj_head.state_dict(), 'vla_traj_head_final.pth')
print(f"\n✅ 训练完成!最终模型: vla_traj_head_final.pth")
print(f" 最佳模型: vla_traj_head_best.pth (loss={best_loss:.4f})")
def custom_collate(batch):
"""
自定义 batch 组装函数
为什么需要这个?
默认的 collate_fn 会尝试把所有数据 stack 成 tensor,
但 PIL Image 不能直接 stack(大小可能不同、类型不是 tensor)。
所以我们手动处理:
- 图像:保持为 PIL Image 列表(Processor 会统一处理)
- 指令:保持为字符串列表
- 速度和轨迹:正常 stack 成 tensor
"""
images, commands, speeds, trajs = zip(*batch)
return (
list(images), # PIL Image 列表
list(commands), # 字符串列表
torch.tensor(speeds), # (batch,)
torch.stack(trajs), # (batch, 6, 2)
)
# =====================================================================
# 【第五步:推理演示 —— 用训练好的模型预测轨迹】
# =====================================================================
def demo_inference():
"""
加载训练好的模型,对单张图片做轨迹预测(演示用)
"""
NUSCENES_ROOT = "/home/lionsking/data/nuscenes"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# 加载模型
model = MiniVLA(freeze_vlm=True, use_4bit=True)
model.traj_head = model.traj_head.to(DEVICE)
# 加载训练好的轨迹头权重
model.traj_head.load_state_dict(
torch.load('vla_traj_head_best.pth', map_location=DEVICE)
)
model.traj_head.eval() # 设为推理模式(关闭 Dropout)
# 加载一张测试图片
dataset = NuScenesTrajDataset(nusc_root=NUSCENES_ROOT, max_samples=10)
image, command, speed, gt_traj = dataset[0]
print(f"\n指令: {command}")
print(f"速度: {speed:.1f} m/s")
# 预测
with torch.no_grad():
pred = model([image], [command], [speed])
pred = pred.cpu().numpy()[0] # (6, 2)
print(f"\n预测轨迹(相对坐标,单位:米):")
for i, (x, y) in enumerate(pred):
print(f" 第 {i+1} 个点: 前方 {x:.2f}m, 侧向 {y:.2f}m")
gt = gt_traj.numpy()
print(f"\n真实轨迹:")
for i, (x, y) in enumerate(gt):
print(f" 第 {i+1} 个点: 前方 {x:.2f}m, 侧向 {y:.2f}m")
# =====================================================================
# 【入口】
# =====================================================================
if __name__ == "__main__":
import sys
if len(sys.argv) > 1 and sys.argv[1] == "demo":
# 推理演示:python mini_vla_nuscenes_fixed.py demo
demo_inference()
else:
# 默认:训练
train()
上边的代码执行报错,报错信息如下:
model.visual.blocks.{0...31}.mlp.down_proj.weight | CONVERSION |
Traceback (most recent call last):
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/core_model_loading.py", line 1005, in log_conversion_errors
yield
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/core_model_loading.py", line 787, in convert
collected_tensors = self.quantization_operation.convert(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 56, in convert
new_value = bnb.nn.Params4bit(value, requires_grad=False, **old_value.__dict__).to(value.device)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 345, in to
return self._quantize(device)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 300, in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/functional.py", line 924, in quantize_4bit
_out, _absmax = torch.ops.bitsandbytes.quantize_4bit.default(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/_ops.py", line 865, in __call__
return self._op(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner
return disable_fn(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
return fn(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/library.py", line 751, in func_no_dynamo
return func(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/backends/cuda/ops.py", line 339, in _
lib.cquantize_blockwise_bf16_nf4(*args)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 269, in throw_on_call
raise RuntimeError(f"{self.formatted_error}Native code method attempted to call: lib.{name}()")
RuntimeError:
🚨 CUDA SETUP ERROR: Missing dependency: libnvJitLink.so.13 🚨
CUDA 13.x runtime libraries were not found in the LD_LIBRARY_PATH.
To fix this, make sure that:
1. You have installed CUDA 13.x toolkit on your system
2. The CUDA runtime libraries are in your LD_LIBRARY_PATH
You can add them with (and persist the change by adding the line to your .bashrc):
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/cuda-13.x/ lib64
Original error: libnvJitLink.so.13: cannot open shared object file: No such file or directory
🔍 Run this command for detailed diagnostics:
python -m bitsandbytes
If you've tried everything and still have issues:
1. Include ALL version info (operating system, bitsandbytes, pytorch, cuda, python)
2. Describe what you've tried in detail
3. Open an issue with this information:
https://github.com/bitsandbytes-foundation/bitsandbytes/issues
Native code method attempted to call: lib.cquantize_blockwise_bf16_nf4()
🚨 CUDA SETUP ERROR: Missing dependency: libnvJitLink.so.13 🚨
CUDA 13.x runtime libraries were not found in the LD_LIBRARY_PATH.
To fix this, make sure that:
1. You have installed CUDA 13.x toolkit on your system
2. The CUDA runtime libraries are in your LD_LIBRARY_PATH
You can add them with (and persist the change by adding the line to your .bashrc):
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/cuda-13.x/ lib64
Original error: libnvJitLink.so.13: cannot open shared object file: No such file or directory
🔍 Run this command for detailed diagnostics:
python -m bitsandbytes
If you've tried everything and still have issues:
1. Include ALL version info (operating system, bitsandbytes, pytorch, cuda, python)
2. Describe what you've tried in detail
3. Open an issue with this information:
https://github.com/bitsandbytes-foundation/bitsandbytes/issues
Native code method attempted to call: lib.cquantize_blockwise_bf16_nf4()
Error: Bnb4bitQuantize on tensors destined for model.visual.blocks.31.mlp.down_proj.weight. Ckpt contains: 1
model.visual.blocks.{0...31}.mlp.gate_proj.weight | CONVERSION |
Traceback (most recent call last):
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/core_model_loading.py", line 1005, in log_conversion_errors
yield
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/core_model_loading.py", line 787, in convert
collected_tensors = self.quantization_operation.convert(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 56, in convert
new_value = bnb.nn.Params4bit(value, requires_grad=False, **old_value.__dict__).to(value.device)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 345, in to
return self._quantize(device)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 300, in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/functional.py", line 924, in quantize_4bit
_out, _absmax = torch.ops.bitsandbytes.quantize_4bit.default(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/_ops.py", line 865, in __call__
return self._op(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner
return disable_fn(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
return fn(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/library.py", line 751, in func_no_dynamo
return func(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/backends/cuda/ops.py", line 339, in _
lib.cquantize_blockwise_bf16_nf4(*args)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 269, in throw_on_call
raise RuntimeError(f"{self.formatted_error}Native code method attempted to call: lib.{name}()")
RuntimeError:
🚨 CUDA SETUP ERROR: Missing dependency: libnvJitLink.so.13 🚨
CUDA 13.x runtime libraries were not found in the LD_LIBRARY_PATH.
To fix this, make sure that:
1. You have installed CUDA 13.x toolkit on your system
2. The CUDA runtime libraries are in your LD_LIBRARY_PATH
You can add them with (and persist the change by adding the line to your .bashrc):
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/cuda-13.x/ lib64
Original error: libnvJitLink.so.13: cannot open shared object file: No such file or directory
🔍 Run this command for detailed diagnostics:
python -m bitsandbytes
If you've tried everything and still have issues:
1. Include ALL version info (operating system, bitsandbytes, pytorch, cuda, python)
2. Describe what you've tried in detail
3. Open an issue with this information:
https://github.com/bitsandbytes-foundation/bitsandbytes/issues
Native code method attempted to call: lib.cquantize_blockwise_bf16_nf4()
🚨 CUDA SETUP ERROR: Missing dependency: libnvJitLink.so.13 🚨
CUDA 13.x runtime libraries were not found in the LD_LIBRARY_PATH.
To fix this, make sure that:
1. You have installed CUDA 13.x toolkit on your system
2. The CUDA runtime libraries are in your LD_LIBRARY_PATH
You can add them with (and persist the change by adding the line to your .bashrc):
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/cuda-13.x/ lib64
Original error: libnvJitLink.so.13: cannot open shared object file: No such file or directory
🔍 Run this command for detailed diagnostics:
python -m bitsandbytes
If you've tried everything and still have issues:
1. Include ALL version info (operating system, bitsandbytes, pytorch, cuda, python)
2. Describe what you've tried in detail
3. Open an issue with this information:
https://github.com/bitsandbytes-foundation/bitsandbytes/issues
Native code method attempted to call: lib.cquantize_blockwise_bf16_nf4()
Error: Bnb4bitQuantize on tensors destined for model.visual.blocks.31.mlp.up_proj.weight. Ckpt contains: 1
model.visual.merger.mlp.{0, 2}.weight | CONVERSION |
Traceback (most recent call last):
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/core_model_loading.py", line 1005, in log_conversion_errors
yield
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/core_model_loading.py", line 787, in convert
collected_tensors = self.quantization_operation.convert(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 56, in convert
new_value = bnb.nn.Params4bit(value, requires_grad=False, **old_value.__dict__).to(value.device)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 345, in to
return self._quantize(device)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 300, in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/functional.py", line 924, in quantize_4bit
_out, _absmax = torch.ops.bitsandbytes.quantize_4bit.default(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/_ops.py", line 865, in __call__
return self._op(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner
return disable_fn(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
return fn(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/torch/library.py", line 751, in func_no_dynamo
return func(*args, **kwargs)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/backends/cuda/ops.py", line 339, in _
lib.cquantize_blockwise_bf16_nf4(*args)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 269, in throw_on_call
raise RuntimeError(f"{self.formatted_error}Native code method attempted to call: lib.{name}()")
RuntimeError:
🚨 CUDA SETUP ERROR: Missing dependency: libnvJitLink.so.13 🚨
CUDA 13.x runtime libraries were not found in the LD_LIBRARY_PATH.
To fix this, make sure that:
1. You have installed CUDA 13.x toolkit on your system
2. The CUDA runtime libraries are in your LD_LIBRARY_PATH
You can add them with (and persist the change by adding the line to your .bashrc):
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/cuda-13.x/ lib64
Original error: libnvJitLink.so.13: cannot open shared object file: No such file or directory
🔍 Run this command for detailed diagnostics:
python -m bitsandbytes
If you've tried everything and still have issues:
1. Include ALL version info (operating system, bitsandbytes, pytorch, cuda, python)
2. Describe what you've tried in detail
3. Open an issue with this information:
https://github.com/bitsandbytes-foundation/bitsandbytes/issues
Native code method attempted to call: lib.cquantize_blockwise_bf16_nf4()
Notes:
- MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
- CONVERSION: originate from the conversion scheme
Traceback (most recent call last):
File "/home/lionsking/Code/auto_self/mini_vla_nusce_claude.py", line 639, in <module>
train()
File "/home/lionsking/Code/auto_self/mini_vla_nusce_claude.py", line 504, in train
model = MiniVLA(freeze_vlm=True, use_4bit=USE_4BIT)
File "/home/lionsking/Code/auto_self/mini_vla_nusce_claude.py", line 323, in __init__
self.vlm = Qwen2_5_VLForConditionalGeneration.from_pretrained(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4211, in from_pretrained
loading_info = cls._finalize_model_loading(model, load_config, loading_info)
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4382, in _finalize_model_loading
log_state_dict_report(
File "/home/lionsking/miniconda3/envs/carla/lib/python3.10/site-packages/transformers/utils/loading_report.py", line 273, in log_state_dict_report
raise RuntimeError(
RuntimeError: We encountered some issues during automatic conversion of the weights. For details look at the `CONVERSION` entries of the above report!
(carla) lionsking@ai-dev:~/Code/auto_self$
请看下边一篇章。
为者常成,行者常至
自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)