spine.main
Main functions that call the Driver class.
This is the first module called when launching a binary script under the bin directory. It takes care of setting up the environment and the Driver object(s) used to execute/train ML models, post-processors, analysis scripts, writers and profilers.
Functions
|
Execute a model in inference mode in a single process. |
|
Check on the number of available GPUs and what has been requested. |
|
Execute a model in one or more processes. |
|
Execute a model on a single process. |
|
Sets up the DistributedDataParallel environment. |
|
Train a model in a single process. |
- spine.main.run(cfg: dict) None[source]
Execute a model in one or more processes.
- Parameters:
cfg (dict) – Full driver/trainer configuration
- spine.main.run_single(cfg: dict) None[source]
Execute a model on a single process.
- Parameters:
cfg (dict) – Full driver/trainer configuration
- spine.main.train_single(rank: int | None, cfg: dict, distributed: bool = False, world_size: int | None = None, torch_sharing: str | None = None) None[source]
Train a model in a single process.
- Parameters:
rank (int, optional) – Process rank
cfg (dict) – Full driver/trainer configuration
distributed (bool, default False) – If True, distribute the training process
world_size (int, optional) – Number of devices to use in the distributed training process
torch_sharing (str or None, optional) – File sharing strategy for torch distributed training
- spine.main.inference_single(cfg: dict) None[source]
Execute a model in inference mode in a single process.
- Parameters:
cfg (dict) – Full driver configuration
- spine.main.process_world(base: dict) Tuple[bool, int, str | None][source]
Check on the number of available GPUs and what has been requested.
- Parameters:
base (dict) – Base driver configuration dictionary
- Returns:
distributed (bool) – If True, distribute the training process
world_size (int) – Number of devices to use in the distributed training process
torch_sharing (str or None) – File sharing strategy for torch distributed training
- spine.main.setup_ddp(rank: int, world_size: int, backend: str = 'nccl') None[source]
Sets up the DistributedDataParallel environment.
- Parameters:
rank (int) – Global rank of this process (0 to world_size-1)
world_size (int) – Total number of processes across all nodes
backend (str, default "nccl") – Distributed backend to use
Notes
For multi-node training, set these environment variables: - MASTER_ADDR: IP address of the master node - MASTER_PORT: Free port on the master node - RANK: Global rank (0 to world_size-1) - WORLD_SIZE: Total number of processes - LOCAL_RANK (optional): Local rank on this node