FlowBender: Feedback-Aware Training
for Self-Correcting Conditional Flows

Daniel Gilo1 Sven Elflein2,3,4 Ido Sobol1 Or Litany1,2
1Technion 2NVIDIA 3University of Toronto 4Vector Institute
TL;DR

FlowBender trains conditional flow models to use their own alignment error as a first-class input — improving condition fidelity and sample plausibility simultaneously across 2D and 3D tasks, rather than trading one for the other.

Abstract

Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator — the depth predictor defining the constraint — is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample.

We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity.

We propose several variants, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other.

The Problem, Step by Step

A 2D toy example demonstrates where existing conditional generation paradigms fall short — and what a closed-loop correction policy achieves instead.

Ground-truth spiral distribution

The Goal

The target is a 2D Archimedean spiral, partitioned into four arcs — one per quadrant — each spanning a distinct radius range. Each arc is a class, distinguished by color. Given a class label, a conditional flow model must generate samples that lie on the spiral and fall within the correct arc.

Standard Conditional FM Fails on Both Counts

A model trained on supervised pairs approximates the conditional distribution, but acts as a black box at inference — never verifying its output. The failures are twofold: samples, whose color marks the requested class, frequently land on the wrong quadrant (violating the class constraint) and often miss the spiral distribution altogether, failing on both fidelity and plausibility.

Standard conditional FM results

Inference-Time Guidance Trades One Failure for Another

Guidance uses the gradient of a radial penalty (measuring how far each sample strays from the target arc) to steer unconditional trajectories toward the correct quadrant. But the correction strength α must be hand-tuned, and no setting satisfies both criteria: weak guidance leaves samples on the wrong arc; strong guidance satisfies the radial constraint but pulls them entirely off the spiral. Choosing α means choosing which failure to accept.

Guidance α=0.5

α = 0.5 — too weak, effectively unconditional

Guidance α=2.0

α = 2.0 — radial constraint met, but off-manifold

Guidance α=4.0

α = 4.0 — structure collapses entirely

FlowBender results

FlowBender: Both, Not Either

Trained to read its own alignment error at each step, FlowBender learns a nonlinear correction policy, not a scalar guidance weight. Samples land in the correct arc and remain on the spiral, both constraints satisfied. Fidelity and plausibility, simultaneously, with nothing to tune.

Method

FlowBender transforms conditional generation into a closed-loop system by training the model to act on its own alignment error as a first-class input.

FlowBender method diagram: two-pass feedback-aware training loop

Feedback-Aware Training

FlowBender uses a two-pass strategy at each denoising step. An unguided look-ahead pass estimates the clean signal; the forward operator is then applied to this estimate to measure its deviation from the condition signal, yielding the feedback. A second, refinement pass consumes this feedback to produce the corrected velocity, closing the loop.

Feedback Variants

We propose two feedback variants. For differentiable operators, the first-order variant uses the gradient of the alignment loss — pointing in the direction of error correction. For non-differentiable or black-box operators (e.g. JPEG compression), the zero-order variant uses the measurement residual directly, requiring no gradients. A hybrid option combines both.

Inference Prior-Step Shortcut

The two-pass strategy doubles inference cost to 2N NFEs for an N-step trajectory. To alleviate this, we exploit the temporal correlation of feedback signals, which grows stronger as sampling progresses. Above a time threshold, the look-ahead pass is skipped and the prior step's estimate is reused instead, cutting cost from 2N to as few as N+1 evaluations while retaining the closed-loop corrective benefits.

Results: Image-to-Image Tasks

We fine-tune Stable Diffusion 3.5 with ControlNet for depth-to-RGB and edge-to-RGB translation. FlowBender consistently outperforms supervised baselines and inference-time guidance on both fidelity and plausibility. Super-resolution results and quantitative comparisons are available in the paper.

Input Depth Ours (Zero-Order) Ours (First-Order) Standard FT FT + ℒalign
Input depth, example 1 Ours Zero-Order, example 1 Ours First-Order, example 1 Standard FT, example 1 FT + Align, example 1
Input depth, example 2 Ours Zero-Order, example 2 Ours First-Order, example 2 Standard FT, example 2 FT + Align, example 2

Depth → RGB. Both FlowBender variants closely follow the input depth condition, while baselines show structural misalignments highlighted by red boxes.

Input Edges Ours (Zero-Order) Ours (First-Order) Standard FT FT + ℒalign
Input edges, example 1 Ours Zero-Order, example 1 Ours First-Order, example 1 Standard FT, example 1 FT + Align, example 1
Input edges, example 2 Ours Zero-Order, example 2 Ours First-Order, example 2 Standard FT, example 2 FT + Align, example 2

Edge → RGB. Both FlowBender variants adhere to the input edge structure, while baselines show misalignments with the given edges, highlighted by red boxes.

JPEG Restoration

JPEG compression is non-differentiable, making gradient-based guidance inapplicable. FlowBender's zero-order variant uses the pixel-space residual between the re-compressed estimate and the input as feedback — no gradients required. It consistently outperforms standard fine-tuning.

Ground Truth JPEG Condition Ours (Zero-Order) Standard FT
Ground truth, example 1 JPEG condition, example 1 Ours Zero-Order, example 1 Standard FT, example 1
Ground truth, example 2 JPEG condition, example 2 Ours Zero-Order, example 2 Standard FT, example 2
Ground truth, example 3 JPEG condition, example 3 Ours Zero-Order, example 3 Standard FT, example 3
Ground truth, example 4 JPEG condition, example 4 Ours Zero-Order, example 4 Standard FT, example 4
Ground truth, example 5 JPEG condition, example 5 Ours Zero-Order, example 5 Standard FT, example 5

JPEG Restoration. FlowBender's zero-order variant recovers clean images from heavily compressed inputs, reducing color banding and quantization artifacts that standard fine-tuning cannot resolve.

Results: 3D Mesh Texturing

We fine-tune TRELLIS-2's texture transformer with LoRA for 3D mesh texturing. FlowBender outperforms supervised baselines and inference-time guidance on both condition fidelity and multi-view plausibility; red boxes highlight regions where baselines diverge from the provided condition. Videos show 360° rotations of each asset across all methods.

Condition Ours Standard FT FT + ℒalign IT Guidance
Condition (OXL 1) Ours (OXL 1) Standard FT (OXL 1) FT+align (OXL 1) IT Guidance (OXL 1)
Condition (OXL 2) Ours (OXL 2) Standard FT (OXL 2) FT+align (OXL 2) IT Guidance (OXL 2)
Condition (OXL 3) Ours (OXL 3) Standard FT (OXL 3) FT+align (OXL 3) IT Guidance (OXL 3)
Condition (OXL 4) Ours (OXL 4) Standard FT (OXL 4) FT+align (OXL 4) IT Guidance (OXL 4)
Condition (T4K 1) Ours (T4K 1) Standard FT (T4K 1) FT+align (T4K 1) IT Guidance (T4K 1)
Condition (T4K 2) Ours (T4K 2) Standard FT (T4K 2) FT+align (T4K 2) IT Guidance (T4K 2)
Condition (T4K 3) Ours (T4K 3) Standard FT (T4K 3) FT+align (T4K 3) IT Guidance (T4K 3)
Condition (T4K 4) Ours (T4K 4) Standard FT (T4K 4) FT+align (T4K 4) IT Guidance (T4K 4)

BibTeX

@misc{gilo2026flowbenderfeedbackawaretrainingselfcorrecting,
  title         = {FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows},
  author        = {Daniel Gilo and Sven Elflein and Ido Sobol and Or Litany},
  year          = {2026},
  eprint        = {2606.20404},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.20404},
}