Handling Failures#

notata makes it easy to track failed runs explicitly — including crash reasons, partial outputs, and logs. This ensures reproducibility even when a run doesn’t complete successfully.

Automatic Failure Detection#

When using a context manager:

with Logbook("unstable") as log:
    raise RuntimeError("Something broke")

The run will automatically be marked as failed in metadata.json if any exception is raised. No need to call mark_failed() manually.

Manual Failure Capture#

If you’re not using a context manager — or want to catch and report failures explicitly — use:

try:
    log = Logbook("run_may_fail")
    simulate()
    log.mark_complete()
except Exception as e:
    log.mark_failed(str(e))

This records the failure reason and prevents partial runs from being marked as complete.

Failure Metadata#

When a run fails, metadata.json includes:

{
  "status": "failed",
  "start_time": "...",
  "end_time": "...",
  "runtime_sec": ...,
  "failure_reason": "Something broke",
  "run_id": "unstable"
}

This allows you to grep or filter failed runs easily.

Capturing Tracebacks#

You can log a full traceback for postmortem analysis:

import traceback

try:
    ...
except Exception as e:
    tb = traceback.format_exc()
    log.text("artifacts/debug/traceback", tb)
    log.mark_failed(str(e))

This saves:

artifacts/debug/traceback.txt

Searching for Failed Runs#

Use standard tools to identify failed runs:

grep -l '"status": "failed"' outputs/log_*/metadata.json

Or extract failure reasons:

jq '.failure_reason' outputs/log_*/metadata.json

Use Cases#

Simulation crashes
Numerical instability (e.g., nan, inf)
Invalid inputs or out-of-bounds parameters
Runtime exceptions from third-party libraries

Best Practices#

Always call mark_failed() when catching exceptions manually
Include a meaningful reason message
Save a traceback or diagnostic artifact if debugging is needed
Never mark_complete() if results are invalid

Next Steps#

For saving diagnostics and tracebacks: see Saving Artifacts
For programmatically resuming failed runs: see Parameter Sweeps

Handling Failures

Contents