Zum Hauptinhalt springen

🎉 We released Spotlight 1.6.0 check it out

Version: 1.6.0

Loading Data from Spotlight HDF5

Spotlight provides a dataset class that is backed by the HDF5 format. This format is very flexible and supports a wide range of unstructured multimodal data.

The Spotlight dataset is the best choice, if your data types are not well supported by Pandas or Huggingface datasets (e.g. geometric or simulation data).

Writing and loading a HDF5 file

In this example, we save an image dataset as Spotlight H5:

from sklearn import datasets
import numpy as np
from renumics import spotlight

OUTPUT_DATASET = "image_example_dataset.h5"

digits = datasets.load_digits()
with spotlight.Dataset(OUTPUT_DATASET, "w") as dataset:
dataset.append_int_column("index", order=1)
dataset.append_int_column("label", order=0)
dataset.append_image_column("image")
for i, (image, label) in enumerate(zip(digits.images, digits.target)):
# in the sample dataset the value 0 means white and 16 means black
# in order to display it correctly in the browser, we need to switch that
# to have 0 as black and 255 as white
image = (255 * (1 - image / 16)).round().astype("uint8")
# scale image by 32 along each dimension in order to display it in the browser
image = np.repeat(image, 32, axis=1)
image = np.repeat(image, 32, axis=0)

dataset.append_row(index=i, label=label, image=image)

We can add data enrichments:

from sklearn.decomposition import PCA

pca_embeddings = PCA(8).fit_transform(digits.data)
with spotlight.Dataset(OUTPUT_DATASET, "a") as dataset:
dataset.append_embedding_column("pca", pca_embeddings)

The HDF5 files can be easily loaded through the Spotlight Python API or directly from the file system within the Spotlight file browser:

spotlight.show(OUTPUT_DATASET)

Supported data types

As discussed in the data types section, the HDF5 format supports a wide variety of tabular and unstructured data:

Here is an example that includes different kinds of tabular data:

with spotlight.Dataset("example.h5", "w") as dataset:
dataset.append_bool_column("boolean", [True, False, False, True])
dataset.append_int_column("integer", range(4))
dataset.append_float_column("float", 1.0)
dataset.append_string_column("string", "foo")
dataset.append_categorical_column(
"categorical", ["test", "train", "test", "train"]
)
dataset.append_datetime_column("datetime", datetime.now())

spotlight.show("example.h5")
Spotlight

H5 scalars

Appending data to the dataset

With the help of the Dataset wrapper, you can also add columns or rows to an already created Dataset by opening the Dataset in append mode.

with spotlight.Dataset("example.h5", "w") as dataset:
dataset.append_bool_column("boolean", [True, True, True, False], default=False)
dataset.append_int_column("integer", range(4), default=-1)
dataset.append_float_column("float", 1.0, optional=True)
dataset.append_string_column("string", "foo", optional=True)
dataset.append_categorical_column(
"categorical", ["test", "train", "test", "train"], optional=True
)
dataset.append_datetime_column("datetime", datetime.now(), optional=True)

with spotlight.Dataset("example.h5", "a") as dataset:
dataset[1] = {key: None for key in dataset.keys()}

spotlight.show("example.h5")
Spotlight

H5 scalars

Detailed examples

Take a look at our use case section to find more detailed examples for different modalities.