Signal handling and sweep runs - Weights & Biases Documentation

This page provides details about how W&B Sweeps handle system signals and process exit codes. Use this information to run sweeps reliably in preemptible environments such as SLURM, EC2 Spot, or Google Cloud preemptible VMs. The following sections describe how to interrupt runs cleanly from the keyboard and give details to help you understand and predict run requeue behavior. This page targets users who run sweeps on preemptible infrastructure or who need fine-grained control over run lifecycle and cleanup. For details about how W&B requeues runs when they’re preempted, see Resume preemptible Sweeps runs.

Exit status and signals

W&B uses the training process exit status to decide whether a run is requeued and how run state is recorded. Exit code contract:

Exit code 0: W&B considers the run to have completed successfully and doesn’t requeue it.
Non-zero exit code: W&B treats the run as failed or preempted. When you use mark_preempting(), W&B requeues the run so another agent (or the same agent after restart) can resume it.

This applies whether the process exits from a signal handler, from an exception, or from an explicit sys.exit() call. Understanding and relying on this contract matters in preemptible or cluster environments. When the process exits due to a catchable signal, your handler can run, call wandb.run.mark_preempting() if you want the run requeued, perform cleanup (for example, save a checkpoint), then exit with a non-zero code. A common convention is sys.exit(128 + signum) for termination by signal. W&B records that exit code and the same requeue rules apply. When the operating system kernel kills the process with SIGKILL, the process can’t run exit hooks, so W&B doesn’t write a final summary and the run might appear as crashed or killed. The agent still starts the next run.

Stale runs and server-side timeouts

Exit codes aren’t the only way W&B determines run state. The W&B server also infers run state from activity. If a run neither finishes nor posts new metrics for about 5 minutes, the W&B server marks the run as crashed. That often happens when the training process becomes unresponsive, stops logging, or terminates without a clean exit (for example, after SIGKILL). Logging metrics on a steady cadence or exiting with a defined code helps keep run state aligned with what happened.

Catchable signals and preemption

Most signals you’ll encounter in preemptible environments are catchable, meaning your training script can intercept them and shut down cleanly. You can register custom signal handlers in your training script. When the system delivers a catchable signal, your handler runs. W&B preserves metrics that were already sent, and the agent detects the process exit and starts the next run. Best practices:

Register handlers early (for example, before entering the main training loop).
In the handler, call wandb.run.mark_preempting() when you intend the run to be requeued after preemption, perform cleanup (for example, save a checkpoint), then exit with a non-zero code.

The following example registers handlers for SIGUSR1 (a typical cluster preemption signal) and SIGTERM. It leaves SIGINT free for interactive use (for example, manual cancellation from the terminal). The handler calls wandb.run.mark_preempting() and exits using 128 + signum:

import signal
import sys
import wandb


def signal_handler(signum, frame):
    if wandb.run is not None:
        # Optional: save a model checkpoint, flush buffers, and so on.
        print(f"Preempted with signal: {signal.Signals(signum).name}.")
        wandb.run.mark_preempting()
    sys.exit(128 + signum)


def train():
    signal.signal(signal.SIGUSR1, signal_handler)
    signal.signal(signal.SIGTERM, signal_handler)

    with wandb.init() as run:
        config = wandb.config
        for epoch in range(100):
            # Training step; wandb.log(...) as needed
            pass


if __name__ == "__main__":
    train()

`SIGKILL` (uncatchable)

SIGKILL can’t be caught or ignored. The process terminates immediately with no chance to run handlers or atexit callbacks. W&B can’t write a final summary for the run. The agent still recovers and continues the sweep, but run data for that run is incomplete. Use SIGKILL only as a last resort. Prefer SIGTERM or SIGINT when you need graceful shutdown.

Signal forwarding from agent to child

When you use the wandb agent CLI, the agent runs your training script as a child process. When you interrupt the agent (for example, with Ctrl+C or when a scheduler sends SIGTERM to the job), the child (training process) doesn’t receive the signal by default. The training script can’t run its handler or call mark_preempting(). For more information, see wandb GitHub issue #3667. To let the child shut down gracefully and call wandb.run.mark_preempting() in a handler, run the CLI agent with --forward-signals:

wandb agent --forward-signals entity/project/sweep_ID

W&B doesn’t support signal forwarding for wandb.agent() in the Python API. That path runs your training function in a thread, not as a separate child process, so the same forwarding behavior doesn’t apply. When the CLI agent receives SIGINT or SIGTERM with forwarding enabled, it relays the signal to the child. Your training script’s handler can then run, call wandb.run.mark_preempting() and wandb.finish() with a non-zero exit code if needed, and exit with a non-zero code. If you press Ctrl+C twice on the agent process, the agent receives SIGTERM by default. With --forward-signals, the agent can forward SIGINT to the child so your handler runs. For more information, see the wandb agent CLI reference.

Preemptible clusters like SLURM

This section describes how to configure sweeps so that runs survive preemption on clusters such as SLURM, EC2 Spot, or Google Cloud preemptible VMs. On preemption, the training process must receive the signal, mark the run as preempting, and exit with a non-zero code so W&B requeues the run. A new agent (or the same agent after the job is requeued) can then resume the run. Ensure the training process receives the signal:

When the scheduler signals the agent: Run the agent with wandb agent --forward-signals so that when the scheduler (or user) sends a signal to the agent, the agent forwards it to the child. The child’s handler can then call wandb.run.mark_preempting(), wandb.finish(exit_code=...) with a non-zero code, and sys.exit(128 + signum) (or another non-zero exit code).
When the scheduler signals the launch script (not the agent directly): Have the launch script send the preemption signal directly to the training process. For example, the training script writes its process ID to a file. The launch script traps the cluster signal (for example, SIGUSR1) and runs kill -SIGUSR1 $(cat $PID_FILE) so the training process’s handler runs.

In the training script: Register a handler for the signal your cluster uses (for example, SIGTERM or SIGUSR1). In the handler, call wandb.run.mark_preempting() if a run is active, then finish the run with a non-zero exit code and sys.exit(128 + signum) (or another non-zero code) so W&B requeues the run. For more information about when W&B requeues runs and how that interacts with mark_preempting(), see Resume preemptible Sweeps runs. Sweep state: Run wandb sweep entity/project/sweep_ID --resume before starting the agent so the sweep is in resume mode and hands out requeued runs. Multi-agent coordination: When many agents run at once (such as SLURM array jobs), they can race to claim the same preempted run. This is a known limitation. To work around it, stagger agent startup or use external coordination mechanisms such as locks.

`wandb sweep --cancel`

This section describes how the --cancel command interacts with signals and child processes, since cancellation behaves differently from sending an OS signal directly. You cancel a sweep using the W&B API, not an OS signal. Run a command such as wandb sweep --cancel entity/project/sweep_ID. The server tells the agent to exit, and the agent then terminates running child processes and stops. A short delay (on the order of the agent’s API polling interval) can occur before cancellation takes effect. Cancellation delivers SIGKILL to runs. Child processes have no chance to run user-defined signal handlers. The same applies when you use the Cancel control on the Sweeps UI. Use --cancel when you want to stop the entire sweep and mark it canceled. For graceful shutdown of the current run, send a catchable signal to the run (or use --forward-signals with the CLI agent and signal the agent). For graceful sweep completion, use wandb sweep --stop instead of --cancel. For more information about pause, resume, stop, and cancel options, see Manage sweeps.

Signals to the agent versus signals to the run

Understanding the distinction between signaling the agent and signaling the training run helps you avoid orphaned processes and unexpected behavior. If you send a signal to the agent process (not the child training process), the agent might exit while the child continues running as an orphan. The orphan might keep printing to your terminal, and the shell might not show a new prompt until you press Enter. Unless you use --forward-signals with the CLI agent, stopping the agent doesn’t guarantee the child training process stops. To confirm the agent has exited, use an OS command like ps -p [AGENT-PID] or pgrep -f "wandb agent" instead of relying on prompt appearance.

Reference: `mark_preempting()` and final run state

The following table summarizes how run state depends on when you call mark_preempting() and how the process exits. It assumes you use the wandb agent CLI with your training program as a subprocess.

Scenario	No `mark_preempting()`	Signal handler calls `mark_preempting()` and exits non-zero	`mark_preempting()` always called right after `init()`
Run completes normally with exit code 0	FINISHED	FINISHED	FINISHED
Run fails with non-zero exit code	FAILED	FAILED	PREEMPTED
Run receives `SIGKILL`	CRASHED after about 5 minutes	CRASHED after about 5 minutes (uncatchable)	PREEMPTED after about 5 minutes
Run receives `SIGINT`	KILLED	PREEMPTED (with a `SIGINT` handler)	PREEMPTED
Run receives another signal (for example, `SIGTERM` or `SIGUSR1`)	CRASHED after about 5 minutes	PREEMPTED (with a matching handler)	PREEMPTED after about 5 minutes

If you only call mark_preempting() inside a signal handler, you don’t cover cases where the handler never runs, such as SIGKILL. If you always call mark_preempting() immediately after wandb.init(), W&B can treat any failure as preemption and might requeue the run repeatedly, including for bugs or bad configuration. For environments with a well-defined preemption signal, the usual approach is a signal handler that calls mark_preempting() and exits non-zero, not an unconditional call after init().

​Exit status and signals

​Stale runs and server-side timeouts

​Catchable signals and preemption

​SIGKILL (uncatchable)

​Signal forwarding from agent to child

​Preemptible clusters like SLURM

​wandb sweep --cancel

​Signals to the agent versus signals to the run

​Reference: mark_preempting() and final run state