Skip to main content

🎉 We released Spotlight 1.4.0 check it out

Version: 1.1.0

💡 Data-centric AI glossary

We define important terms and concepts for data-centric AI workflows.

Data types

Image data

2D-array of image pixels. Image data is typically represented as a link to an image file (on disk or object storage) or as an in-memory array.

Audio data

High-frequency time series (1-D array) data. It is typically represented as a link to a (possibly compressed) audio file or an in-memory array. Often, an image representation (spectrograms) are used as input features for ML models.

Decision boundary

The decision boundary is a hypersurface in the embedding space that separate different classes. Inspecting data samples near the decision boundary is useful to find critical edge cases.

Embedding

Many model architectures (in particular neural networks) inherently transform the input space into a low-dimensional representation. This embedding of each data sample into a latent space is very useful for understanding both data traits as well as model behavior. In practice, the embedding is dense vector (typical sizes range from 64 to 2048) which is optained by saving a hideen layer of the model.

Error patterns on computer vision data (images, videos)

Image data

darkness

lightness

blurr

Signal-to-noise ratio

Aspect ratio

Label

Nearest neighbor & k-th nearest neighbor

Probabilities

In a classification problem, the ML model output is usually given as a probability vector. The prediction is then obtained by looking for the maximum value in this vector. Depending on the model, a softmax function has to be applied on output logits in order to obtain the probabilities.