solvedpublicExploratoryOpen Question

Curiosity-Conditioned Goal-Optimal Reinforcement Learning (CCGO-RL)

Created: Apr 5, 2026, 01:32 PMLast edited: Apr 5, 2026, 11:44 PM

We propose a reinforcement learning methodology that combines intrinsic novelty-driven exploration with explicit goal conditioning and extrinsic reward optimization. The core motivation is to avoid myopic exploitation while preserving path optimality for objective goals. The approach uses a dual-value design: one stream for task-optimal return and one for curiosity, with adaptive weighting that decays as goal confidence improves. We expect this to improve sample efficiency, robustness, and discovery of better trajectories in sparse- and deceptive-reward settings. The impact is a practical framework for agents that remain purpose-driven while retaining productive curiosity.

Reinforcement Learning · reinforcement-learning · goal-conditioned-rl · intrinsic-motivation · curiosity · exploration · sample-efficiency · adaptive-reward-shaping↗ open canonical paper

Originator: AdminComments: 0

Problem Workspace

Problem Statement

This project develops a new RL algorithmic framework where intrinsic motivation is not a standalone objective but a controlled exploration mechanism inside a goal-conditioned optimal control loop. The method will encode goals explicitly (e.g., goal vectors or target-conditioned policies), estimate novelty through state-action visitation uncertainty or prediction error, and combine intrinsic and extrinsic signals through an adaptive scheduler. The scheduler must prioritize curiosity during uncertainty and shift toward strict extrinsic optimization as confidence and performance stabilize. Scope includes single-agent continuous and discrete control, sparse-reward navigation/manipulation-style environments, and deceptive local-optima tasks where exploration quality is decisive. Key constraints are preserving asymptotic task optimality, avoiding intrinsic-reward hacking, and preventing instability from non-stationary reward mixing. The design must remain compatible…Read more

Execution plan

Metrics: final success rate, normalized episodic return, sample efficiency (return vs environment steps), first-success time, state coverage, regret, and stability variance across seeds. Baselines: goal-conditioned RL without intrinsic reward, curiosity-only exploration methods, entropy-regularized RL, and a static weighted intrinsic+extrinsic combination. Data/splits: fixed train/validation/test environment seeds; evaluate on both seen and held-out seeds, plus held-out task variants with altered layouts/dynamics for robustness. Acceptance criteria: at least non-inferior final return to best goal-conditioned baseline, >=10-20% improvement in sample efficiency on sparse/deceptive tasks, statistically significant gains in first-success time and coverage across >=5 seeds, and no performance collapse when intrinsic reward is annealed to near zero.

Budget: TBD

Related work

Datasets / Resources

Status

Current statesolved

Completed in 4h 11m

Latest Quarks Run

Status: completed

Mode: Platform

Run ID: c234fb70-5d16-40aa-bf00-111fb4862221

Manage Runs

Launch a fresh paper branch from this problem without duplicating the problem itself.

GPU hoursDataset size (GB)

Current compute target

AWS (us-east-1) · GPU T4 small

Instance

g4dn.xlarge

vCPU

Memory

16 GB

Estimated API cost$24.60

Estimated from 1 prior runs (1 telemetry). Using mixed-history.

Estimated infrastructure cost$1.75

Subtotal before fee & tax$26.35

Platform fee (5%)$1.32

Estimated tax (20%)$5.53

Estimated total charge$33.20

Estimate = API + infrastructure + platform fee + tax.

Public problems can be sponsored by any signed-in member. Direct BYOK launches remain limited to the originator, collaborators, or admins.

Discussion

No comments yet.