spine.io.write.HDF5Writer

class spine.io.write.HDF5Writer(file_name: str | None = None, directory: str | None = None, prefix: str | list[str] | None = None, suffix: str = 'spine', keys: list[str] | None = None, skip_keys: list[str] | None = None, dummy_ds: dict[str, str] | None = None, overwrite: bool = False, append: bool = False, split: bool = False, lite: bool = False, keep_open: bool = True, flush_frequency: int | None = None)[source]

Writes data to an HDF5 file.

Builds an HDF5 file to store the input and/or the output of the reconstruction chain. It can also be used to append an existing HDF5 file with information coming out of the analysis tools.

Typical configuration should look like:

io:
  ...
  writer:
    name: hdf5
    file_name: output.h5
    keys:
      - input_data
      - segmentation
      - ...

Methods

DataFormat([dtype, class_name, width, ...])

Data structure to hold writing parameters.

__call__(data[, cfg])

Append the HDF5 file with the content of a batch.

append_entry(out_file, data, batch_id)

Stores one entry.

append_key(out_file, event, data, key, batch_id)

Stores data key in a specific dataset of an HDF5 file.

close()

Close any persistent HDF5 output handles owned by this writer.

create(data[, cfg, append])

Initialize the output file structure based on the data dictionary.

finalize()

Mark initialized output files as complete and flush metadata.

flush()

Flush all persistent HDF5 output handles to disk.

get_data_type(data, key)

Identify the dtype and shape objects to be dealt with.

get_data_types(data, keys)

Get the data type information for each key.

get_file_names([file_name, prefix, suffix, ...])

Build output file name(s) from an explicit name or input prefix(es).

get_object_dtype(obj)

Loop over the attributes of a class to figure out what to store.

get_stored_keys(data)

Get the list of data product keys to store.

initialize_datasets(out_file, type_dict)

Create place hodlers for all the datasets to be filled.

store(out_file, event, key, array)

Stores an ndarray in the file and stores its mapping in the event dataset.

store_flat(out_file, event, key, array_list)

Stores a concatenated list of arrays in the file and stores its index mapping in the event dataset to break them.

store_jagged(out_file, event, key, array_list)

Stores a jagged list of arrays in the file and stores an index mapping for each array element in the event dataset.

store_objects(out_file, event, key, array, ...)

Stores a list of objects with understandable attributes in the file and stores its mapping in the event dataset.

with_source_provenance(data)

Return a data dictionary augmented with persisted source provenance.

__init__(file_name: str | None = None, directory: str | None = None, prefix: str | list[str] | None = None, suffix: str = 'spine', keys: list[str] | None = None, skip_keys: list[str] | None = None, dummy_ds: dict[str, str] | None = None, overwrite: bool = False, append: bool = False, split: bool = False, lite: bool = False, keep_open: bool = True, flush_frequency: int | None = None) None[source]

Initializes the basics of the output file.

Parameters:
  • file_name (str, optional) – Name of the output HDF5 file

  • directory (str, optional) – Output directory. When provided, all generated file names are relocated into this directory while preserving their resolved base names.

  • prefix (str or List[str], optional) – Input file prefix. It will be use to form the output file name, provided that no file_name is explicitly provided. Must be a list with one prefix per input file when split is True.

  • suffix (str, default "spine") – Suffix to add to the output file name if it is built from the input

  • keys (List[str], optional) – List of data product keys to store. If not specified, store everything

  • skip_keys (List[str], optionl) – List of data product keys to skip

  • dummy_ds (Dict[str, str], optional) – Keys for which to create placeholder datasets. For each key, specify the object type it is supposed to represent as a string.

  • overwrite (bool, default False) – If True, overwrite the output file if it already exists

  • append (bool, default False) – If True, add new values to the end of an existing file

  • split (bool, default False) – If True, split the output to produce one file per input file

  • lite (bool, default False) – If True, the lite version of objects is stored (drop point indexes)

  • keep_open (bool, default True) – If True, keep one append handle open per output file and per process. This reduces HDF5 open/close churn when writing many batches. If False, open and close the file on each write call.

  • flush_frequency (int, optional) – If specified, flush each output file after this many appended entries. If None, only flush when explicitly requested or when the file handle is closed.

Methods

__init__([file_name, directory, prefix, ...])

Initializes the basics of the output file.

append_entry(out_file, data, batch_id)

Stores one entry.

append_key(out_file, event, data, key, batch_id)

Stores data key in a specific dataset of an HDF5 file.

close()

Close any persistent HDF5 output handles owned by this writer.

create(data[, cfg, append])

Initialize the output file structure based on the data dictionary.

finalize()

Mark initialized output files as complete and flush metadata.

flush()

Flush all persistent HDF5 output handles to disk.

get_data_type(data, key)

Identify the dtype and shape objects to be dealt with.

get_data_types(data, keys)

Get the data type information for each key.

get_file_names([file_name, prefix, suffix, ...])

Build output file name(s) from an explicit name or input prefix(es).

get_object_dtype(obj)

Loop over the attributes of a class to figure out what to store.

get_stored_keys(data)

Get the list of data product keys to store.

initialize_datasets(out_file, type_dict)

Create place hodlers for all the datasets to be filled.

store(out_file, event, key, array)

Stores an ndarray in the file and stores its mapping in the event dataset.

store_flat(out_file, event, key, array_list)

Stores a concatenated list of arrays in the file and stores its index mapping in the event dataset to break them.

store_jagged(out_file, event, key, array_list)

Stores a jagged list of arrays in the file and stores an index mapping for each array element in the event dataset.

store_objects(out_file, event, key, array, ...)

Stores a list of objects with understandable attributes in the file and stores its mapping in the event dataset.

with_source_provenance(data)

Return a data dictionary augmented with persisted source provenance.

Attributes

name

source_index_keys

name = 'hdf5'
source_index_keys = {'file_entry_index': 'source_file_entry_index', 'file_index': 'source_file_index'}
close() None[source]

Close any persistent HDF5 output handles owned by this writer.

This only affects handles cached in the current process. It is safe to call repeatedly.

flush() None[source]

Flush all persistent HDF5 output handles to disk.

This is useful when the writer keeps files open for a long time and the caller wants to force buffered metadata and dataset updates to disk.

finalize() None[source]

Mark initialized output files as complete and flush metadata.

This method should only be called once the caller knows writing completed successfully for the relevant files.

static get_file_names(file_name: str | None = None, prefix: str | list[str] | None = None, suffix: str = 'spine', split: bool = False, directory: str | None = None) list[str][source]

Build output file name(s) from an explicit name or input prefix(es).

Logic is as follows:

  • If split is False and file_name is provided, use file_name

  • If split is False and file_name is not provided, build the file name from the input prefix by adding a suffix

  • If split is True and file_name is not provided, build the file names from the input prefix by adding a suffix

  • If split is True and file_name is provided, build the file names from file_name by adding an index, unless there is only one input prefix, in which case use file_name as is

Parameters:
  • file_name (str, optional) – Name of the output HDF5 file. If not provided, it will be built from the input prefix(es).

  • prefix (str or List[str], optional) – Input file prefix(es).

  • suffix (str, default "spine") – Suffix to add to the output file name if it is built from the input

  • split (bool, default False) – If True, split the output to produce one file per input file.

  • directory (str, optional) – Output directory. When provided, the resolved output file base name is placed under this directory regardless of the directory encoded in file_name or prefix.

Returns:

List of output file names.

Return type:

List[str]

class DataFormat(dtype: type | list[tuple[str, type]] | None = None, class_name: str | None = None, width: int | list[int] = 0, merge: bool = False, scalar: bool = False)[source]

Data structure to hold writing parameters.

dtype

Data type

Type:

type or list[tuple[str, type]], optional

class_name

Name of the class the information comes from

Type:

str, optional

width

Width of the tensor to store, if it is a tensor

Type:

int or list[int], default 0

merge

Whether to merge lists of arrays into a single dataset

Type:

bool, default False

scalar

Whether the data is a scalar object or not

Type:

bool, default False

Attributes:
class_name
dtype
dtype: type | list[tuple[str, type]] | None = None
class_name: str | None = None
width: int | list[int] = 0
merge: bool = False
scalar: bool = False
create(data: dict[str, Any], cfg: dict[str, Any] | None = None, append: bool = False) None[source]

Initialize the output file structure based on the data dictionary.

Parameters:
  • data (Dict[str, Any]) – Dictionary of data products

  • cfg (Dict[str, Any]) – Dictionary containing the complete SPINE configuration

  • append (bool, default False) – If True, load existing files if present and create missing files

with_source_provenance(data: dict[str, Any]) dict[str, Any][source]

Return a data dictionary augmented with persisted source provenance.

When upstream products carry file_index and/or file_entry_index, preserve those values under explicit source_* names so they survive round-tripping through HDF5 without colliding with the reader-owned runtime index fields of the produced HDF5 file.

Parameters:

data (dict) – Dictionary of data products to be written

Returns:

Shallow copy of the data dictionary with source_* aliases added when the corresponding upstream index fields are present.

Return type:

dict

get_stored_keys(data: dict[str, Any]) set[str][source]

Get the list of data product keys to store.

Parameters:

data (Dict[str, Any]) – Dictionary of data products

Returns:

keys – List of data keys to store to file

Return type:

Set[str]

get_data_types(data: dict[str, Any], keys: set[str]) tuple[dict[str, DataFormat], list[list[tuple[str, type]]]][source]

Get the data type information for each key.

Parameters:

data (Dict[str, Any]) – Dictionary of data products

Returns:

  • type_dict (Dict[str, DataFormat]) – Dictionary containing the data type information for each key

  • object_dtypes (List[List[Tuple[str, type]]]) – List of composite object dtypes found in the data

get_data_type(data: dict[str, Any], key: str) DataFormat[source]

Identify the dtype and shape objects to be dealt with.

Parameters:
  • data (Dict[str, Any]) – Dictionary containing the information to be stored

  • key (str) – Dictionary key name

Returns:

DataFormat object containing the data type information for the key

Return type:

DataFormat

get_object_dtype(obj: Any) list[tuple[str, type]][source]

Loop over the attributes of a class to figure out what to store.

This function assumes that the class only posseses getters that return either a scalar, string or np.ndarrary.

Parameters:

object (class) – Instance of an class used to identify attribute types

Returns:

List of (key, dtype) pairs

Return type:

List[Tuple[str, type]]

initialize_datasets(out_file: File, type_dict: dict[str, DataFormat]) None[source]

Create place hodlers for all the datasets to be filled.

Parameters:
  • out_file (h5py.File) – HDF5 file instance

  • type_dict (Dict[str, DataFormat]) – Dictionary containing the data type information for each key

append_entry(out_file: File, data: dict[str, Any], batch_id: int) None[source]

Stores one entry.

Parameters:
  • out_file (h5py.File) – HDF5 file instance

  • data (Dict[str, Any]) – Dictionary of data products

  • batch_id (int) – Batch ID to be stored

append_key(out_file: File, event: ndarray, data: dict[str, Any], key: str, batch_id: int) None[source]

Stores data key in a specific dataset of an HDF5 file.

Parameters:
  • out_file (h5py.File) – HDF5 file instance

  • event (np.ndarray) – Array representing the event to which the data corresponds

  • data (dict) – Dictionary of data products

  • key (string) – Dictionary key name

  • batch_id (int) – Batch ID to be stored

static store(out_file: File, event: ndarray, key: str, array: ndarray) None[source]

Stores an ndarray in the file and stores its mapping in the event dataset.

Parameters:
  • out_file (h5py.File) – HDF5 file instance

  • event (np.ndarray) – Array representing the event to which the data corresponds

  • key (str) – Name of the dataset in the file

  • array (np.ndarray) – Array to be stored

static store_jagged(out_file: File, event: ndarray, key: str, array_list: list[ndarray]) None[source]

Stores a jagged list of arrays in the file and stores an index mapping for each array element in the event dataset.

Parameters:
  • out_file (h5py.File) – HDF5 file instance

  • event (np.ndarray) – Array representing the event to which the data corresponds

  • key (str) – Name of the dataset in the file

  • array_list (list(np.ndarray)) – List of arrays to be stored

static store_flat(out_file: File, event: ndarray, key: str, array_list: list[ndarray]) None[source]

Stores a concatenated list of arrays in the file and stores its index mapping in the event dataset to break them.

Parameters:
  • out_file (h5py.File) – HDF5 file instance

  • event (np.ndarray) – Array representing the event to which the data corresponds

  • key (str) – Name of the dataset in the file

  • array_list (list(np.ndarray)) – List of arrays to be stored

static store_objects(out_file: File, event: ndarray, key: str, array: ndarray, obj_dtype: list[tuple[str, type]], lite: bool) None[source]

Stores a list of objects with understandable attributes in the file and stores its mapping in the event dataset.

Parameters:
  • out_file (h5py.File) – HDF5 file instance

  • event (np.ndarray) – Array representing the event to which the data corresponds

  • key (str) – Name of the dataset in the file

  • array (np.ndarray) – Array of objects or dictionaries to be stored

  • obj_dtype (list) – List of (key, dtype) pairs which specify what’s to store

  • lite (bool) – If True, store the lite version of objects