🏀 Playbook
This is a collection of useful data curation workflows (plays) on unstructured data. We distinguish between basic data enrichment workflows (rookie plays), established data curation solutions (veteran plays) and current state-of-the-art techniques (all-star plays).
Rookie plays
- Create embeddings with Huggingface
- Create embeddings with towhee
- Extract decision boundary based on probability ratios
Veteran plays
- Detect duplicates with Annoy
- Detect leakage with Annoy
- Detect data drift
- Detect label errors
- Detect outliers
- Detect image error patterns