π‘ Data-centric AI glossary
We define important terms and concepts for data-centric AI workflows.
Data typesβ
Image dataβ
2D-array of image pixels. Image data is typically represented as a link to an image file (on disk or object storage) or as an in-memory array.
Audio dataβ
High-frequency time series (1-D array) data. It is typically represented as a link to a (possibly compressed) audio file or an in-memory array. Often, an image representation (spectrograms) are used as input features for ML models.
Decision boundaryβ
The decision boundary is a hypersurface in the embedding space that separate different classes. Inspecting data samples near the decision boundary is useful to find critical edge cases.
Embeddingβ
Many model architectures (in particular neural networks) inherently transform the input space into a low-dimensional representation. This embedding of each data sample into a latent space is very useful for understanding both data traits as well as model behavior. In practice, the embedding is dense vector (typical sizes range from 64 to 2048) which is optained by saving a hideen layer of the model.
Error patterns on computer vision data (images, videos)β
Image dataβ
darknessβ
lightnessβ
blurrβ
Signal-to-noise ratioβ
Aspect ratioβ
Labelβ
Nearest neighbor & k-th nearest neighborβ
Predictionβ
Probabilitiesβ
In a classification problem, the ML model output is usually given as a probability vector. The prediction is then obtained by looking for the maximum value in this vector. Depending on the model, a softmax function has to be applied on output logits in order to obtain the probabilities.