spine.main

Main functions that call the Driver class.

This is the first module called when launching a binary script under the bin directory. It takes care of setting up the environment and the Driver object(s) used to execute/train ML models, post-processors, analysis scripts, writers and profilers.

Functions

`inference_single`(cfg)	Execute a model in inference mode in a single process.
`process_world`(base)	Check on the number of available GPUs and what has been requested.
`run`(cfg)	Execute a model in one or more processes.
`run_single`(cfg)	Execute a model on a single process.
`setup_ddp`(rank, world_size[, backend])	Sets up the DistributedDataParallel environment.
`train_single`(rank, cfg[, distributed, ...])	Train a model in a single process.

spine.main.run(cfg: dict) → None[source]

Execute a model in one or more processes.

Parameters:: cfg (dict) – Full driver/trainer configuration

spine.main.run_single(cfg: dict) → None[source]

Execute a model on a single process.

Parameters:: cfg (dict) – Full driver/trainer configuration

spine.main.train_single(rank: int | None, cfg: dict, distributed: bool = False, world_size: int | None = None, torch_sharing: str | None = None) → None[source]

Train a model in a single process.

Parameters:

rank (int, optional) – Process rank
cfg (dict) – Full driver/trainer configuration
distributed (bool, default False) – If True, distribute the training process
world_size (int, optional) – Number of devices to use in the distributed training process
torch_sharing (str or None, optional) – File sharing strategy for torch distributed training

spine.main.inference_single(cfg: dict) → None[source]

Execute a model in inference mode in a single process.

Parameters:: cfg (dict) – Full driver configuration

spine.main.process_world(base: dict) → Tuple[bool, int, str | None][source]

Check on the number of available GPUs and what has been requested.

Parameters:

base (dict) – Base driver configuration dictionary

Returns:

distributed (bool) – If True, distribute the training process
world_size (int) – Number of devices to use in the distributed training process
torch_sharing (str or None) – File sharing strategy for torch distributed training

spine.main.setup_ddp(rank: int, world_size: int, backend: str = 'nccl') → None[source]

Sets up the DistributedDataParallel environment.

Parameters:

rank (int) – Global rank of this process (0 to world_size-1)
world_size (int) – Total number of processes across all nodes
backend (str, default "nccl") – Distributed backend to use

Notes

For multi-node training, set these environment variables: - MASTER_ADDR: IP address of the master node - MASTER_PORT: Free port on the master node - RANK: Global rank (0 to world_size-1) - WORLD_SIZE: Total number of processes - LOCAL_RANK (optional): Local rank on this node