Loading Data into Spotlight
Spotlight is designed to work with your existing data management workflows and tools.
We are continuously improving Spotlight's integration into common data workflows. If you find the support for your tooling to be lacking, please shoot us an email or join our discord.
Currently, there are two different ways to load your unstructured datasets into Spotlight:
You can load you data directly from an in-memory Pandas dataframe. In this case, unstructured data like images are represented as file references. One big advantage of this approach is that many popular data management tools (e.g. Huggingface dataset) offer a Pandas interface.
You can write your data to a HDF5 file. In this case, the unstructured data is directly stored in the HDF5 file. In this way, you can neatly store different multimodal data including images, audio and geometric data.
Loading data from Pandas dataframe
Loading unstructured data from file references
You can store file references that can either point to the local filesystem or to an S3 compatible storage. This is the recommended way to deal with heavy data such as images or audio.
In this example we assume you have a computer vision dataset where the images are stored as file paths in a csv file:
import pandas as pd
from renumics import spotlight
df = pd.read_csv("https://spotlight.renumics.com/data/mnist/mnist.csv")
spotlight.show(df, dtype={"image": spotlight.Image})
Storing unstructured data in the dataframe
You can store your data directly in the dataframe cells as a Python list. This is useful for enrichments such as embeddings.
In this example we append an embedding column to the previously mentioned dataset. We assume you have the 2D Numpy array embeddings containing an embedding (1D array) for each data point:
from renumics import spotlight
df["embedding"] = [emb.tolist() for emb in embeddings]
spotlight.show(df, dtype={"image": spotlight.Image, "embedding": spotlight.Embedding})
Full API documentation
Please refer to the Python API documentation for the viewer and the layout for detailed usage. You can also browse our workflow examples to see typical integrations in action.
Huggingface example
The Huggingface dataset class has a convenient Pandas interface:
from renumics import spotlight
from datasets import load_dataset
train_dset = load_dataset("beans", split='train')
df = train_dset.to_pandas()
spotlight.show(df, dtype={'image_url': spotlight.Image})
SQLite example
You can export an SQLite table to a dataframe:
import sqlite3
import pandas as pd
from renumics import spotlight
cnx = sqlite3.connect('sample.db')
df = pd.read_sql_query("SELECT * FROM table_name", cnx)
spotlight.show(df, dtype={'image_url': spotlight.Image})
Writing and loading a HDF5 file
Spotlight provides a dataset class to store unstructured multimodal data. Please consult the API documentation for the supported datatypes as well as for a detailed description of the dataset class.
Code snippets with examples are also available in the API documentation.
Computer vision example
Load dataset from sklearn and write it to the HDF5 file
from sklearn import datasets
import numpy as np
from renumics import spotlight
OUTPUT_DATASET = "image_example_dataset.h5"
digits = datasets.load_digits()
with spotlight.Dataset(OUTPUT_DATASET, "w") as dataset:
dataset.append_int_column("index", order=1)
dataset.append_int_column("label", order=0)
dataset.append_image_column("image")
for i, (image, label) in enumerate(zip(digits.images, digits.target)):
# in the sample dataset the value 0 means white and 16 means black
# in order to display it correctly in the browser, we need to switch that
# to have 0 as black and 255 as white
image = (255 * (1 - image / 16)).round().astype("uint8")
# scale image by 32 along each dimension in order to display it in the browser
image = np.repeat(image, 32, axis=1)
image = np.repeat(image, 32, axis=0)
dataset.append_row(index=i, label=label, image=image)
Enrich it with a simple similarity measure based on PCA
from sklearn.decomposition import PCA
pca_embeddings = PCA(8).fit_transform(digits.data)
with spotlight.Dataset(OUTPUT_DATASET, "a") as dataset:
dataset.append_embedding_column("pca", pca_embeddings)
Load it into Spotlight
spotlight.show(OUTPUT_DATASET)