spine.main

Main functions that call the Driver class.

This is the first module called when launching a binary script under the bin directory. It takes care of setting up the environment and the Driver object(s) used to execute/train ML models, post-processors, analysis scripts, writers and profilers.

Functions

inference_single(cfg)

Execute a model in inference mode in a single process.

process_world(base)

Check on the number of available GPUs and what has been requested.

run(cfg)

Execute a model in one or more processes.

run_single(cfg)

Execute a model on a single process.

setup_ddp(rank, world_size[, backend])

Sets up the DistributedDataParallel environment.

train_single(rank, cfg[, distributed, ...])

Train a model in a single process.

spine.main.run(cfg: dict) None[source]

Execute a model in one or more processes.

Parameters:

cfg (dict) – Full driver/trainer configuration

spine.main.run_single(cfg: dict) None[source]

Execute a model on a single process.

Parameters:

cfg (dict) – Full driver/trainer configuration

spine.main.train_single(rank: int | None, cfg: dict, distributed: bool = False, world_size: int | None = None, torch_sharing: str | None = None) None[source]

Train a model in a single process.

Parameters:
  • rank (int, optional) – Process rank

  • cfg (dict) – Full driver/trainer configuration

  • distributed (bool, default False) – If True, distribute the training process

  • world_size (int, optional) – Number of devices to use in the distributed training process

  • torch_sharing (str or None, optional) – File sharing strategy for torch distributed training

spine.main.inference_single(cfg: dict) None[source]

Execute a model in inference mode in a single process.

Parameters:

cfg (dict) – Full driver configuration

spine.main.process_world(base: dict) Tuple[bool, int, str | None][source]

Check on the number of available GPUs and what has been requested.

Parameters:

base (dict) – Base driver configuration dictionary

Returns:

  • distributed (bool) – If True, distribute the training process

  • world_size (int) – Number of devices to use in the distributed training process

  • torch_sharing (str or None) – File sharing strategy for torch distributed training

spine.main.setup_ddp(rank: int, world_size: int, backend: str = 'nccl') None[source]

Sets up the DistributedDataParallel environment.

Parameters:
  • rank (int) – Global rank of this process (0 to world_size-1)

  • world_size (int) – Total number of processes across all nodes

  • backend (str, default "nccl") – Distributed backend to use

Notes

For multi-node training, set these environment variables: - MASTER_ADDR: IP address of the master node - MASTER_PORT: Free port on the master node - RANK: Global rank (0 to world_size-1) - WORLD_SIZE: Total number of processes - LOCAL_RANK (optional): Local rank on this node