spine.io.write.HDF5Writer

class spine.io.write.HDF5Writer(file_name: str | None = None, directory: str | None = None, prefix: str | list[str] | None = None, suffix: str = 'spine', keys: list[str] | None = None, skip_keys: list[str] | None = None, dummy_ds: dict[str, str] | None = None, overwrite: bool = False, append: bool = False, split: bool = False, lite: bool = False, keep_open: bool = True, flush_frequency: int | None = None)[source]

Writes data to an HDF5 file.

Builds an HDF5 file to store the input and/or the output of the reconstruction chain. It can also be used to append an existing HDF5 file with information coming out of the analysis tools.

Typical configuration should look like:

io:
  ...
  writer:
    name: hdf5
    file_name: output.h5
    keys:
      - input_data
      - segmentation
      - ...

Methods

`DataFormat`([dtype, class_name, width, ...])	Data structure to hold writing parameters.
`__call__`(data[, cfg])	Append the HDF5 file with the content of a batch.
`append_entry`(out_file, data, batch_id)	Stores one entry.
`append_key`(out_file, event, data, key, batch_id)	Stores data key in a specific dataset of an HDF5 file.
`close`()	Close any persistent HDF5 output handles owned by this writer.
`create`(data[, cfg, append])	Initialize the output file structure based on the data dictionary.
`finalize`()	Mark initialized output files as complete and flush metadata.
`flush`()	Flush all persistent HDF5 output handles to disk.
`get_data_type`(data, key)	Identify the dtype and shape objects to be dealt with.
`get_data_types`(data, keys)	Get the data type information for each key.
`get_file_names`([file_name, prefix, suffix, ...])	Build output file name(s) from an explicit name or input prefix(es).
`get_object_dtype`(obj)	Loop over the attributes of a class to figure out what to store.
`get_stored_keys`(data)	Get the list of data product keys to store.
`initialize_datasets`(out_file, type_dict)	Create place hodlers for all the datasets to be filled.
`store`(out_file, event, key, array)	Stores an ndarray in the file and stores its mapping in the event dataset.
`store_flat`(out_file, event, key, array_list)	Stores a concatenated list of arrays in the file and stores its index mapping in the event dataset to break them.
`store_jagged`(out_file, event, key, array_list)	Stores a jagged list of arrays in the file and stores an index mapping for each array element in the event dataset.
`store_objects`(out_file, event, key, array, ...)	Stores a list of objects with understandable attributes in the file and stores its mapping in the event dataset.
`with_source_provenance`(data)	Return a data dictionary augmented with persisted source provenance.

__init__(file_name: str | None = None, directory: str | None = None, prefix: str | list[str] | None = None, suffix: str = 'spine', keys: list[str] | None = None, skip_keys: list[str] | None = None, dummy_ds: dict[str, str] | None = None, overwrite: bool = False, append: bool = False, split: bool = False, lite: bool = False, keep_open: bool = True, flush_frequency: int | None = None) → None[source]

Initializes the basics of the output file.

Parameters:

file_name (str, optional) – Name of the output HDF5 file
directory (str, optional) – Output directory. When provided, all generated file names are relocated into this directory while preserving their resolved base names.
prefix (str or List[str], optional) – Input file prefix. It will be use to form the output file name, provided that no file_name is explicitly provided. Must be a list with one prefix per input file when split is True.
suffix (str, default "spine") – Suffix to add to the output file name if it is built from the input
keys (List[str], optional) – List of data product keys to store. If not specified, store everything
skip_keys (List[str], optionl) – List of data product keys to skip
dummy_ds (Dict[str, str], optional) – Keys for which to create placeholder datasets. For each key, specify the object type it is supposed to represent as a string.
overwrite (bool, default False) – If True, overwrite the output file if it already exists
append (bool, default False) – If True, add new values to the end of an existing file
split (bool, default False) – If True, split the output to produce one file per input file
lite (bool, default False) – If True, the lite version of objects is stored (drop point indexes)
keep_open (bool, default True) – If True, keep one append handle open per output file and per process. This reduces HDF5 open/close churn when writing many batches. If False, open and close the file on each write call.
flush_frequency (int, optional) – If specified, flush each output file after this many appended entries. If None, only flush when explicitly requested or when the file handle is closed.

Methods

`__init__`([file_name, directory, prefix, ...])	Initializes the basics of the output file.
`append_entry`(out_file, data, batch_id)	Stores one entry.
`append_key`(out_file, event, data, key, batch_id)	Stores data key in a specific dataset of an HDF5 file.
`close`()	Close any persistent HDF5 output handles owned by this writer.
`create`(data[, cfg, append])	Initialize the output file structure based on the data dictionary.
`finalize`()	Mark initialized output files as complete and flush metadata.
`flush`()	Flush all persistent HDF5 output handles to disk.
`get_data_type`(data, key)	Identify the dtype and shape objects to be dealt with.
`get_data_types`(data, keys)	Get the data type information for each key.
`get_file_names`([file_name, prefix, suffix, ...])	Build output file name(s) from an explicit name or input prefix(es).
`get_object_dtype`(obj)	Loop over the attributes of a class to figure out what to store.
`get_stored_keys`(data)	Get the list of data product keys to store.
`initialize_datasets`(out_file, type_dict)	Create place hodlers for all the datasets to be filled.
`store`(out_file, event, key, array)	Stores an ndarray in the file and stores its mapping in the event dataset.
`store_flat`(out_file, event, key, array_list)	Stores a concatenated list of arrays in the file and stores its index mapping in the event dataset to break them.
`store_jagged`(out_file, event, key, array_list)	Stores a jagged list of arrays in the file and stores an index mapping for each array element in the event dataset.
`store_objects`(out_file, event, key, array, ...)	Stores a list of objects with understandable attributes in the file and stores its mapping in the event dataset.
`with_source_provenance`(data)	Return a data dictionary augmented with persisted source provenance.

Attributes

`name`
`source_index_keys`

name = 'hdf5'

source_index_keys = {'file_entry_index': 'source_file_entry_index', 'file_index': 'source_file_index'}

close() → None[source]

Close any persistent HDF5 output handles owned by this writer.

This only affects handles cached in the current process. It is safe to call repeatedly.

flush() → None[source]

Flush all persistent HDF5 output handles to disk.

This is useful when the writer keeps files open for a long time and the caller wants to force buffered metadata and dataset updates to disk.

finalize() → None[source]

Mark initialized output files as complete and flush metadata.

This method should only be called once the caller knows writing completed successfully for the relevant files.

static get_file_names(file_name: str | None = None, prefix: str | list[str] | None = None, suffix: str = 'spine', split: bool = False, directory: str | None = None) → list[str][source]

Build output file name(s) from an explicit name or input prefix(es).

Logic is as follows:

If split is False and file_name is provided, use file_name
If split is False and file_name is not provided, build the file name from the input prefix by adding a suffix
If split is True and file_name is not provided, build the file names from the input prefix by adding a suffix
If split is True and file_name is provided, build the file names from file_name by adding an index, unless there is only one input prefix, in which case use file_name as is

Parameters:

file_name (str, optional) – Name of the output HDF5 file. If not provided, it will be built from the input prefix(es).
prefix (str or List[str], optional) – Input file prefix(es).
suffix (str, default "spine") – Suffix to add to the output file name if it is built from the input
split (bool, default False) – If True, split the output to produce one file per input file.
directory (str, optional) – Output directory. When provided, the resolved output file base name is placed under this directory regardless of the directory encoded in file_name or prefix.

Returns:

List of output file names.

Return type:

List[str]

class DataFormat(dtype: type | list[tuple[str, type]] | None = None, class_name: str | None = None, width: int | list[int] = 0, merge: bool = False, scalar: bool = False)[source]

Data structure to hold writing parameters.

dtype

Data type

Type:: type or list[tuple[str, type]], optional

class_name

Name of the class the information comes from

Type:: str, optional

width

Width of the tensor to store, if it is a tensor

Type:: int or list[int], default 0

merge

Whether to merge lists of arrays into a single dataset

Type:: bool, default False

scalar

Whether the data is a scalar object or not

Type:: bool, default False

Attributes:

class_name
dtype

dtype: type | list[tuple[str, type]] | None = None

class_name: str | None = None

width: int | list[int] = 0

merge: bool = False

scalar: bool = False

create(data: dict[str, Any], cfg: dict[str, Any] | None = None, append: bool = False) → None[source]

Initialize the output file structure based on the data dictionary.

Parameters:

data (Dict[str, Any]) – Dictionary of data products
cfg (Dict[str, Any]) – Dictionary containing the complete SPINE configuration
append (bool, default False) – If True, load existing files if present and create missing files

with_source_provenance(data: dict[str, Any]) → dict[str, Any][source]

Return a data dictionary augmented with persisted source provenance.

When upstream products carry file_index and/or file_entry_index, preserve those values under explicit source_* names so they survive round-tripping through HDF5 without colliding with the reader-owned runtime index fields of the produced HDF5 file.

Parameters:: data (dict) – Dictionary of data products to be written
Returns:: Shallow copy of the data dictionary with source_* aliases added when the corresponding upstream index fields are present.
Return type:: dict

get_stored_keys(data: dict[str, Any]) → set[str][source]

Get the list of data product keys to store.

Parameters:: data (Dict[str, Any]) – Dictionary of data products
Returns:: keys – List of data keys to store to file
Return type:: Set[str]

get_data_types(data: dict[str, Any], keys: set[str]) → tuple[dict[str, DataFormat], list[list[tuple[str, type]]]][source]

Get the data type information for each key.

Parameters:

data (Dict[str, Any]) – Dictionary of data products

Returns:

type_dict (Dict[str, DataFormat]) – Dictionary containing the data type information for each key
object_dtypes (List[List[Tuple[str, type]]]) – List of composite object dtypes found in the data

get_data_type(data: dict[str, Any], key: str) → DataFormat[source]

Identify the dtype and shape objects to be dealt with.

Parameters:

data (Dict[str, Any]) – Dictionary containing the information to be stored
key (str) – Dictionary key name

Returns:

DataFormat object containing the data type information for the key

Return type:

DataFormat

get_object_dtype(obj: Any) → list[tuple[str, type]][source]

Loop over the attributes of a class to figure out what to store.

This function assumes that the class only posseses getters that return either a scalar, string or np.ndarrary.

Parameters:: object (class) – Instance of an class used to identify attribute types
Returns:: List of (key, dtype) pairs
Return type:: List[Tuple[str, type]]

initialize_datasets(out_file: File, type_dict: dict[str, DataFormat]) → None[source]

Create place hodlers for all the datasets to be filled.

Parameters:

out_file (h5py.File) – HDF5 file instance
type_dict (Dict[str, DataFormat]) – Dictionary containing the data type information for each key

append_entry(out_file: File, data: dict[str, Any], batch_id: int) → None[source]

Stores one entry.

Parameters:

out_file (h5py.File) – HDF5 file instance
data (Dict[str, Any]) – Dictionary of data products
batch_id (int) – Batch ID to be stored

append_key(out_file: File, event: ndarray, data: dict[str, Any], key: str, batch_id: int) → None[source]

Stores data key in a specific dataset of an HDF5 file.

Parameters:

out_file (h5py.File) – HDF5 file instance
event (np.ndarray) – Array representing the event to which the data corresponds
data (dict) – Dictionary of data products
key (string) – Dictionary key name
batch_id (int) – Batch ID to be stored

static store(out_file: File, event: ndarray, key: str, array: ndarray) → None[source]

Stores an ndarray in the file and stores its mapping in the event dataset.

Parameters:

out_file (h5py.File) – HDF5 file instance
event (np.ndarray) – Array representing the event to which the data corresponds
key (str) – Name of the dataset in the file
array (np.ndarray) – Array to be stored

static store_jagged(out_file: File, event: ndarray, key: str, array_list: list[ndarray]) → None[source]

Stores a jagged list of arrays in the file and stores an index mapping for each array element in the event dataset.

Parameters:

out_file (h5py.File) – HDF5 file instance
event (np.ndarray) – Array representing the event to which the data corresponds
key (str) – Name of the dataset in the file
array_list (list(np.ndarray)) – List of arrays to be stored

static store_flat(out_file: File, event: ndarray, key: str, array_list: list[ndarray]) → None[source]

Stores a concatenated list of arrays in the file and stores its index mapping in the event dataset to break them.

Parameters:

out_file (h5py.File) – HDF5 file instance
event (np.ndarray) – Array representing the event to which the data corresponds
key (str) – Name of the dataset in the file
array_list (list(np.ndarray)) – List of arrays to be stored

static store_objects(out_file: File, event: ndarray, key: str, array: ndarray, obj_dtype: list[tuple[str, type]], lite: bool) → None[source]

Stores a list of objects with understandable attributes in the file and stores its mapping in the event dataset.

Parameters:

out_file (h5py.File) – HDF5 file instance
event (np.ndarray) – Array representing the event to which the data corresponds
key (str) – Name of the dataset in the file
array (np.ndarray) – Array of objects or dictionaries to be stored
obj_dtype (list) – List of (key, dtype) pairs which specify what’s to store
lite (bool) – If True, store the lite version of objects