Skip to main content

๐ŸŽ‰ We released Spotlight 1.6.0 check it out โ†’

Version: Next

๐Ÿ’ก Glossary

We define important terms and concepts for data-centric AI workflows.

Data typesโ€‹

Image dataโ€‹

2D-array of image pixels. Image data is typically represented as a link to an image file (on disk or object storage) or as an in-memory array.

Audio dataโ€‹

High-frequency time series (1-D array) data. It is typically represented as a link to a (possibly compressed) audio file or an in-memory array. Often, an image representation (spectrograms) are used as input features for ML models.

Decision boundaryโ€‹

The decision boundary is a hypersurface in the embedding space that separate different classes. Inspecting data samples near the decision boundary is useful to find critical edge cases.

Embeddingโ€‹

Many model architectures (in particular neural networks) inherently transform the input space into a low-dimensional representation. This embedding of each data sample into a latent space is very useful for understanding both data traits as well as model behavior. In practice, the embedding is dense vector (typical sizes range from 64 to 2048) which is optained by saving a hideen layer of the model.

Error patterns on computer vision data (images, videos)โ€‹

Image dataโ€‹

darknessโ€‹

lightnessโ€‹

blurrโ€‹

Signal-to-noise ratioโ€‹

Aspect ratioโ€‹

Labelโ€‹

Nearest neighbor & k-th nearest neighborโ€‹

Predictionโ€‹

Probabilitiesโ€‹

In a classification problem, the ML model output is usually given as a probability vector. The prediction is then obtained by looking for the maximum value in this vector. Depending on the model, a softmax function has to be applied on output logits in order to obtain the probabilities.

Featuresโ€‹

Metadataโ€‹