ropedia-xperience-10m-task-baselines / results /omni_finetune /MULTI_EPISODE_ACCESS_STATUS.md
cy0307's picture
Publish Ropedia Xperience-10M task baseline cards
eeac43c verified

Multi-Episode Access Status

Current status: access to the gated full ropedia-ai/xperience-10m dataset is granted, and a metadata-only Hugging Face audit has been completed. A selected 128-episode pilot has produced a verified diagnostic Qwen3-Omni LoRA package with held-out evaluation. The result is useful as a pipeline and error-analysis baseline, not as a strong final model.

This file records the public data-access status and pilot requirements. It does not include local-machine aliases, private paths, SSH hosts, or token locations.

Selection Plan

Item Value
Dataset ropedia-ai/xperience-10m
Minimum pilot gate 32 complete leaf episodes
Strategy stratified round-robin across top-level session UUIDs
Metadata-audited visible complete episodes 12,102
Metadata-audited complete sessions 802
Current selected pilot 128 source-balanced episodes
Recommended split 96 train / 16 val / 16 test
Recommended estimated download 277.71 GiB excluding visualization.rrd
Representative 32-episode estimate ~70.5 GiB at median episode size
Smallest one-per-session 32-episode estimate 35.35 GiB
Excluded file visualization.rrd

Current Stage

The current Qwen3-Omni artifacts include a verified validation-monitored diagnostic held-out run: 96/16/16 selected train/val/test episodes, 3,808 exported windows, 2,848 train examples, 512 validation examples, and 448 held-out test predictions from 14 exported test episodes. Training used eight distributed accelerator processes for one epoch with LoRA rank 16 and recorded a final train loss of 0.4130 plus a validation loss of 0.0331. The result verifies the multi-episode pipeline and gives a real error-analysis baseline; it is still not a strong final model.

A stronger model-quality pilot should be claimed only after:

  • selected valid episodes are available locally,
  • the manifest builder confirms complete held-out episode splits,
  • training finishes with recorded metadata and progress logs,
  • evaluation runs on held-out test episodes,
  • predictions, metrics, confusion matrices, and a run report are committed.
  • JSON validity and action/subtask metrics improve beyond the current diagnostic baseline.

Current diagnostic metrics:

  • JSON validity: 87.50%
  • action macro-F1: 0.0027
  • subtask accuracy: 0.0067
  • transition accuracy: 0.8504
  • next-action accuracy: 0.0246
  • contact accuracy: 0.6451
  • object micro-F1: 0.2230

The public data access summary is:

results/omni_finetune/DATA_ACCESS_STATUS.md

The current metadata-only full dataset audit is:

results/omni_finetune/FULL_DATASET_METADATA_AUDIT.md

The current 128-episode source-balanced download plan is:

results/omni_finetune/XPERIENCE10M_128_EPISODE_SELECTION.md

The current verified diagnostic package is:

docs/data/omni_finetune_verified_result.json

results/omni_finetune/verified_public/

The older machine-generated source discovery report remains a pre-access local planning record:

results/omni_finetune/DATA_BLOCKER_REPORT.md