Version: 1.6.0

Find typical issues in image datasets with Cleanvision

We extract typical issues (regarding brightness, blurr, aspect ratio, SNR and duplicates) in image datasets with the Cleanvision library. We then identify critical segments with Spotlight.

Use Chrome to run Spotlight in Colab. Due to Colab restrictions (e.g. no websocket support), the performance is limited. Run the notebook locally for the full Spotlight experience.

inputs
outputs
parameters

df['image'] contains the paths to the images in the dataset

df['dark_score'] contains a score [0,1] that denotes the darkness of the image sample.
df['light_score'] contains a score [0,1] that denotes the lightness of the image sample.
df['blurry_score'] contains a score [0,1] that denotes the blurriness of the image sample.
df['low_information_score'] contains a score [0,1] that denotes the Signal-to-Noise ratio of the image sample.
df['odd_aspect_ratio_score'] contains a score [0,1] that denotes anomalies in the aspect ratio of the image sample.

Spotlight_screenshot_outliers

Imports and play as copy-n-paste functions

# Install dependencies

#@title Install required packages with PIP

!pip install renumics-spotlight cleanlab datasets

# Play as copy-n-paste functions

#@title Play as copy-n-paste functions

from cleanvision.imagelab import Imagelab
import pandas as pd
from renumics import spotlight
import requests

def cv_issues_cleanvision(df, image_name='image'):

    image_paths = df['image'].to_list()
    imagelab = Imagelab(filepaths=image_paths)
    imagelab.find_issues()

    df_cv=imagelab.issues.reset_index()

    return df_cv

Step-by-step example on CIFAR-100

Load CIFAR-100 from Huggingface hub and convert it to Pandas dataframe

dataset = datasets.load_dataset("renumics/cifar100-enriched", split="train")
df = dataset.to_pandas()

Compute heuristics for typical image data error scores with Cleanvision

df_cv=cv_issues_cleanvision(df)
df = pd.concat([df, df_cv], axis=1)

Inspect errors and detect problematic data segments with Spotlight

df_show = df.drop(columns=['embedding', 'probabilities'])
layout_url = "https://raw.githubusercontent.com/Renumics/spotlight/playbook_initial_draft/playbook/rookie/cv_issues.json"
response = requests.get(layout_url)
layout = spotlight.layout.nodes.Layout(**json.loads(response.text))
spotlight.show(df_show, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding}, layout=layout)

Find typical issues in image datasets with Cleanvision

Imports and play as copy-n-paste functions​

Step-by-step example on CIFAR-100​

Load CIFAR-100 from Huggingface hub and convert it to Pandas dataframe​

Compute heuristics for typical image data error scores with Cleanvision​

Inspect errors and detect problematic data segments with Spotlight​

Imports and play as copy-n-paste functions

Step-by-step example on CIFAR-100

Load CIFAR-100 from Huggingface hub and convert it to Pandas dataframe

Compute heuristics for typical image data error scores with Cleanvision

Inspect errors and detect problematic data segments with Spotlight