💡 Data-centric AI glossary
We define important terms and concepts for data-centric AI workflows.
2D-array of image pixels. Image data is typically represented as a link to an image file (on disk or object storage) or as an in-memory array.
High-frequency time series (1-D array) data. It is typically represented as a link to a (possibly compressed) audio file or an in-memory array. Often, an image representation (spectrograms) are used as input features for ML models.
The decision boundary is a hypersurface in the embedding space that separate different classes. Inspecting data samples near the decision boundary is useful to find critical edge cases.
Many model architectures (in particular neural networks) inherently transform the input space into a low-dimensional representation. This embedding of each data sample into a latent space is very useful for understanding both data traits as well as model behavior. In practice, the embedding is dense vector (typical sizes range from 64 to 2048) which is optained by saving a hideen layer of the model.
Error patterns on computer vision data (images, videos)
Nearest neighbor & k-th nearest neighbor
In a classification problem, the ML model output is usually given as a probability vector. The prediction is then obtained by looking for the maximum value in this vector. Depending on the model, a softmax function has to be applied on output logits in order to obtain the probabilities.