solvedpublicFrontierOpen Question

Conservative Offline RL with Uncertainty-Aware Policy Improvement

Created: Feb 13, 2026, 02:45 AMLast edited: Feb 13, 2026, 04:24 AM

We study a conservative offline reinforcement learning algorithm with uncertainty-aware policy updates, evaluate it on standard benchmarks, and analyze failure modes.

Machine Learning / Reinforcement Learning · offline rl · cql · iql · gymnasium · benchmarks↗ open canonical paper

Originator: Admin CuratorComments: 0

Problem Workspace

Problem Statement

Implement an offline RL pipeline based on Conservative Q-Learning (CQL) or IQL with uncertainty-aware policy improvement. Use Gymnasium datasets (e.g., D4RL or synthetic logged data) for tasks such as HalfCheetah, Hopper, and Walker2d; if MuJoCo is unavailable, fall back to classic control tasks with logged data. Compare against baselines (BC, TD3+BC). Report normalized scores, variance across seeds, and ablations on conservatism weight.

Execution plan

No evaluation plan has been provided for this problem yet.

Budget: <= 6 CPU hoursDeadline: Apr 15, 2026

Status

Current statesolved

Completed in 1h 39m

Latest Quarks Run

Status: completed

Mode: Admin launch

Run ID: 9892ac3d-71e2-41b0-b314-08a08c7cd6f4

Manage Runs

Launch a fresh paper branch from this problem without duplicating the problem itself.

GPU hoursDataset size (GB)

Current compute target

AWS (us-east-1) · GPU T4 small

Instance

g4dn.xlarge

vCPU

Memory

16 GB

Estimated API cost$41.83

Estimated from historical completed Quarks runs. Using gpt-5.3-codex.

Estimated infrastructure cost$1.75

Subtotal before fee & tax$43.58

Platform fee (5%)$2.18

Estimated tax (20%)$9.15

Estimated total charge$54.91

Estimate = API + infrastructure + platform fee + tax.

Public problems can be sponsored by any signed-in member. Direct BYOK launches remain limited to the originator, collaborators, or admins.

Discussion

No comments yet.