Add W&B (wandb) to your code - Weights & Biases Documentation

This guide provides recommendations on how to integrate W&B into your Python training script or notebook for hyperparameter search optimization. By following these recommendations, you can use W&B Sweeps to explore hyperparameter values, log training and validation metrics, and identify the configuration that produces strong model performance. This guide is for machine learning practitioners who already have a Python training script and want to add hyperparameter sweep support. The following sections walk through an example training script, then show how to update it to work with W&B Sweeps.

Original training script

Suppose you have a Python script that trains a model (see the following code). Your goal is to find the hyperparameters that maximize the validation accuracy (val_acc). In your Python script, you define two functions: train_one_epoch and evaluate_one_epoch. The train_one_epoch function simulates training for one epoch and returns the training accuracy and loss. The evaluate_one_epoch function simulates evaluation of the model on the validation data set and returns the validation accuracy and loss. You define a configuration dictionary (config) that contains hyperparameter values such as the learning rate (lr), batch size (batch_size), and number of epochs (epochs). The values in the configuration dictionary control the training process. Next, you define a function called main that mimics a typical training loop. For each epoch, the script computes the accuracy and loss on the training and validation data sets.

This code is a mock training script. It doesn’t train a model, but simulates the training process by generating random accuracy and loss values. The purpose of this code is to demonstrate how to integrate W&B into your training script.

import random
import numpy as np

def train_one_epoch(epoch, lr, batch_size):
    acc = 0.25 + ((epoch / 30) + (random.random() / 10))
    loss = 0.2 + (1 - ((epoch - 1) / 10 + random.random() / 5))
    return acc, loss

def evaluate_one_epoch(epoch):
    acc = 0.1 + ((epoch / 20) + (random.random() / 10))
    loss = 0.25 + (1 - ((epoch - 1) / 10 + random.random() / 6))
    return acc, loss

# config variable with hyperparameter values
config = {"lr": 0.0001, "batch_size": 16, "epochs": 5}

def main():
    lr = config["lr"]
    batch_size = config["batch_size"]
    epochs = config["epochs"]

    for epoch in np.arange(1, epochs):
        train_acc, train_loss = train_one_epoch(epoch, lr, batch_size)
        val_acc, val_loss = evaluate_one_epoch(epoch)

        print("epoch: ", epoch)
        print("training accuracy:", train_acc, "training loss:", train_loss)
        print("validation accuracy:", val_acc, "validation loss:", val_loss)

if __name__ == "__main__":
    main()

The following section shows how to add W&B to your Python script to track hyperparameters and metrics during training. You want to use W&B to find the best hyperparameters that maximize the validation accuracy (val_acc).

Add W&B to your training script

This section shows how to modify the original training script so that the sweep agent can pass hyperparameter values into each run and W&B can record the resulting metrics. How you integrate W&B into your Python script or notebook depends on how you manage sweeps. To use the W&B Python SDK to start, stop, and manage sweeps, follow the instructions in the Python script or notebook tab. To use the W&B CLI instead, follow the instructions in the CLI tab.

CLI
Python script or notebook

Create a YAML configuration file with your sweep configuration. The configuration file contains the hyperparameters you want the sweep to explore. In the following example, the sweep varies the batch size (batch_size), epochs (epochs), and learning rate (lr) hyperparameters during each run.

# config.yaml
program: train.py
method: random
name: sweep
metric:
  goal: maximize
  name: val_acc
parameters:
  batch_size:
    values: [16, 32, 64]
  lr:
    min: 0.0001
    max: 0.1
  epochs:
    values: [5, 10, 15]

For more information, see Define sweep configuration.You must provide the name of your Python script for the program key in your YAML file.Next, add the following to the code example:

Import the W&B Python SDK (wandb) and PyYAML (yaml). Use PyYAML to read in your YAML configuration file.
Read in the configuration file.
Use wandb.init() to start a background process to sync and log data as a W&B Run. Pass the config object to the config parameter.
Define hyperparameter values from wandb.Run.config instead of using hardcoded values.
Log the metric you want to optimize with wandb.Run.log(). You must log the metric defined in your configuration. Within the configuration dictionary (sweep_configuration in this example), you define the sweep to maximize the val_acc value.

import wandb
import yaml
import random
import numpy as np


def train_one_epoch(epoch, lr, batch_size):
    acc = 0.25 + ((epoch / 30) + (random.random() / 10))
    loss = 0.2 + (1 - ((epoch - 1) / 10 + random.random() / 5))
    return acc, loss


def evaluate_one_epoch(epoch):
    acc = 0.1 + ((epoch / 20) + (random.random() / 10))
    loss = 0.25 + (1 - ((epoch - 1) / 10 + random.random() / 6))
    return acc, loss


def main():
    # Set up your default hyperparameters
    with open("./config.yaml") as file:
        config = yaml.load(file, Loader=yaml.FullLoader)

    with wandb.init(config=config) as run:
        for epoch in np.arange(1, run.config['epochs']):
            train_acc, train_loss = train_one_epoch(epoch, run.config['lr'], run.config['batch_size'])
            val_acc, val_loss = evaluate_one_epoch(epoch)
            run.log(
                {
                    "epoch": epoch,
                    "train_acc": train_acc,
                    "train_loss": train_loss,
                    "val_acc": val_acc,
                    "val_loss": val_loss,
                }
            )

# Call the main function.
main()

After you update your training script, initialize and start the sweep from your CLI:

Optionally, set a maximum number of runs for the sweep agent to try. This example sets the maximum to five:
```
NUM=5
```
Initialize the sweep with the wandb sweep command. Provide the name of the YAML file. Optionally, provide the name of the project for the project flag (--project):
```
wandb sweep --project sweep-demo-cli config.yaml
```
This returns a sweep ID. For more information, see Initialize sweeps.
Copy the sweep ID and replace [SWEEP-ID] in the following command to start the sweep job with the wandb agent command. Replace [YOUR-ENTITY] with your W&B entity name:
```
wandb agent --count $NUM [YOUR-ENTITY]/sweep-demo-cli/[SWEEP-ID]
```

The sweep agent runs your training script repeatedly, each time with a different combination of hyperparameter values from your YAML configuration, and logs the results to W&B. For more information, see Start sweep jobs.

Follow these steps to add W&B to your Python script:

Create a dictionary object where the key-value pairs define a sweep configuration. The sweep configuration defines the hyperparameters you want W&B to explore on your behalf along with the metric you want to optimize. Continuing from the previous example, the batch size (batch_size), epochs (epochs), and the learning rate (lr) are the hyperparameters to vary during each sweep. You want to maximize the accuracy of the validation score, so set "goal": "maximize" and the name of the variable you want to optimize for, in this case val_acc ("name": "val_acc").
Pass the sweep configuration dictionary to wandb.sweep(). This initializes the sweep and returns a sweep ID (sweep_id). For more information, see Initialize sweeps.
At the top of your script, import the W&B Python SDK (wandb).
Within your main function, use wandb.init() to generate a background process to sync and log data as a W&B Run. Pass the project name as a parameter to the wandb.init() method. If you don’t pass a project name, W&B uses the default project name.
Fetch the hyperparameter values from the wandb.Run.config object. This lets you use the hyperparameter values defined in the sweep configuration dictionary instead of hardcoded values.
Log the metric you’re optimizing for to W&B using wandb.Run.log(). You must log the metric defined in your configuration. For example, if you define the metric to optimize as val_acc, you must log val_acc. If you don’t log the metric, W&B can’t perform optimization. Within the configuration dictionary (sweep_configuration in this example), you define the sweep to maximize the val_acc value.
Start the sweep with wandb.agent(). Provide the sweep ID and the name of the function the sweep executes (function=main), and set the maximum number of runs to four (count=4).

To put this together, your script might look similar to the following:

import wandb # Import the W&B Python SDK
import numpy as np
import random
import argparse

def train_one_epoch(epoch, lr, batch_size):
    acc = 0.25 + ((epoch / 30) + (random.random() / 10))
    loss = 0.2 + (1 - ((epoch - 1) / 10 + random.random() / 5))
    return acc, loss

def evaluate_one_epoch(epoch):
    acc = 0.1 + ((epoch / 20) + (random.random() / 10))
    loss = 0.25 + (1 - ((epoch - 1) / 10 + random.random() / 6))
    return acc, loss

def main(args=None):
    # When called by sweep agent, args is None,
    # so use the project from sweep config
    project = args.project if args else None
    
    with wandb.init(project=project) as run:
        # Fetches the hyperparameter values from `wandb.Run.config` object
        lr = run.config["lr"]
        batch_size = run.config["batch_size"]
        epochs = run.config["epochs"]

        # Execute the training loop and log the performance values to W&B
        for epoch in np.arange(1, epochs):
            train_acc, train_loss = train_one_epoch(epoch, lr, batch_size)
            val_acc, val_loss = evaluate_one_epoch(epoch)
            run.log(
                {
                    "epoch": epoch,
                    "train_acc": train_acc,
                    "train_loss": train_loss,
                    "val_acc": val_acc, # Metric optimized
                    "val_loss": val_loss,
                }
            )

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--project", type=str, default="sweep-example", help="W&B project name")
    args = parser.parse_args()

    # Define a sweep config dictionary
    sweep_configuration = {
        "method": "random",
        "name": "sweep",
        # Metric that you want to optimize
        # For example, if you want to maximize validation
        # accuracy set "goal": "maximize" and the name of the variable 
        # you want to optimize for, in this case "val_acc"
        "metric": {
            "goal": "maximize",
            "name": "val_acc"
            },
        "parameters": {
            "batch_size": {"values": [16, 32, 64]},
            "epochs": {"values": [5, 10, 15]},
            "lr": {"max": 0.1, "min": 0.0001},
        },
    }

    # Initialize the sweep by passing in the config dictionary
    sweep_id = wandb.sweep(sweep=sweep_configuration, project=args.project)

    # Start the sweep job
    wandb.agent(sweep_id, function=main, count=4)

When you run this script, W&B starts the sweep, executes the main function up to four times with different hyperparameter combinations, and logs each run’s metrics so you can compare results in the W&B App.

Logging metrics to W&B in a sweepYou must log the metric you define and are optimizing for in both your sweep configuration and with wandb.Run.log(). For example, if you define the metric to optimize as val_acc within your sweep configuration, you must also log val_acc to W&B. If you don’t log the metric, W&B can’t perform optimization.

with wandb.init() as run:
    val_loss, val_acc = train()
    run.log(
        {
            "val_loss": val_loss,
            "val_acc": val_acc
            }
        )

The following is an incorrect example of logging the metric to W&B. The sweep configuration optimizes for val_acc, but the code logs val_acc within a nested dictionary under the key validation. You must log the metric directly, not within a nested dictionary.

with wandb.init() as run:
    val_loss, val_acc = train()
    run.log(
        {
            "validation": {
                "val_loss": val_loss, 
                "val_acc": val_acc
                }
            }
        )

​Original training script

​Add W&B to your training script

Original training script

Add W&B to your training script