Drifting Field Policy (DFP) is a novel one-step generative policy that avoids the trajectory-level credit assignment of diffusion policies in RL fine-tuning.
Drifting Field Policy (DFP) is a one-step generative policy framework for reinforcement learning. Instead of relying on iterative diffusion or ODE-based sampling, DFP represents the policy as a direct pushforward map from noise to actions. Policy improvement is formulated as a Wasserstein gradient flow that moves the action distribution toward high-value regions under a critic. This enables fast, multimodal action generation while keeping the training objective directly aligned with reward-guided policy improvement.
This release contains:
drift: Drifting Field Policy.meanflow: Mean Velocity Policy comparison backbone.acfql: QC/FQL baseline retained from the original action-chunking codebase.
- Python: 3.10
- CUDA: 12.x
- Benchmarks: Robomimic, OGBench
conda env create -f environment.yml
conda activate dfpOr install the pip dependencies manually:
conda create -n dfp python=3.10 pip -y
conda activate dfp
pip install -r requirements.txtPlace the low-dimensional Robomimic datasets under the standard Robomimic directory:
~/.robomimic/lift/mh/low_dim_v15.hdf5
~/.robomimic/can/mh/low_dim_v15.hdf5
~/.robomimic/square/mh/low_dim_v15.hdf5
If your datasets live elsewhere, set:
export ROBOMIMIC_DATASET_DIR=/path/to/robomimicThe datasets can be downloaded from the Robomimic dataset page: https://robomimic.github.io/docs/datasets/robomimic_v0.1.html
For cube-quadruple, we use the 100M-size offline dataset:
wget -r -np -nH --cut-dirs=2 -A "*.npz" \
https://rail.eecs.berkeley.edu/datasets/ogbench/cube-quadruple-play-100m-v0/Pass the downloaded directory with:
--ogbench_dataset_dir=/path/to/cube-quadruple-play-100m-v0MVP is our main comparison baseline. Since
no official implementation was available, we implemented the MVP
baseline ourselves for reproduction. Most hyperparameters follow the MVP paper,
but for cube-triple experiments we set ivc_lambda=0 because it gave the
strongest performance in our runs.
We were not able to fully reproduce the reported paper performance across all settings, so the MVP results in our experiments use the best-performing configuration we found.
The main results are offline-to-online runs. Each command first trains on the offline dataset and then continues online fine-tuning in the same run.
# DFP
MUJOCO_GL=egl python main.py --agent_config=drift --run_group=reproduce --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5
# MVP
MUJOCO_GL=egl python main.py --agent_config=meanflow --run_group=reproduce --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5
# QC-BFN
MUJOCO_GL=egl python main.py --run_group=reproduce --agent.actor_type=best-of-n --agent.actor_num_samples=32 --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5
# QC-FQL
MUJOCO_GL=egl python main.py --run_group=reproduce --agent.alpha=100 --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5
# BFN
MUJOCO_GL=egl python main.py --run_group=reproduce --agent.actor_type=best-of-n --agent.actor_num_samples=4 --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=1
# FQL
MUJOCO_GL=egl python main.py --run_group=reproduce --agent.alpha=100 --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=1The default agent is acfql, so the QC-BFN, QC-FQL, BFN, and FQL commands do not need an explicit --agent_config=acfql. Override the environment when needed:
MUJOCO_GL=egl python main.py \
--agent_config=drift \
--run_group=reproduce \
--env_name=cube-quadruple-play-100m-singletask-task3-v0 \
--ogbench_dataset_dir=/path/to/cube-quadruple-play-100m-v0 \
--seed=42To skip offline training and start online fine-tuning from a saved offline checkpoint, pass the checkpoint and set restore_epoch to the offline training horizon:
MUJOCO_GL=egl python main.py \
--agent_config=drift \
--run_group=reproduce \
--env_name=cube-triple-play-singletask-task3-v0 \
--restore_path=/path/to/params_offline_final.pkl \
--restore_epoch=1000000 \
--seed=42| Path | Description |
|---|---|
agents/ |
DFP, MVP, and QC/FQL baseline agents |
config/ |
Main, evaluation, optimizer, and agent configs |
envs/ |
Robomimic, OGBench, and D4RL environment utilities |
utils/ |
Datasets, networks, drifting loss, logging, and Flax utilities |
If you find our work useful, please consider citing:
@article{koo2026drifting,
title={Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow},
author={Koo, Juil and Park, Mingue and Choi, Jiwon and Min, Yunhong and Sung, Minhyuk},
journal={arXiv preprint arXiv:2605.07727},
year={2026}
}This repository builds on the Q-chunking/FQL codebase.