DoubtLab helps you find bad labels.

This repository contains general tricks that may help you find bad, or noisy, labels in your dataset. The hope is that this repository makes it easier for folks to quickly check their own datasets before they invest too much time and compute on gridsearch.


You can install the tool via pip or conda.

Install with pip

python -m pip install doubtlab

Install with conda

conda install -c conda-forge doubtlab

Getting Started

If you want to get started, we recommend starting here.

  • The cleanlab project was an inspiration for this one. They have a great heuristic for bad label detection but I wanted to have a library that implements many. Be sure to check out their work on the project.
  • My former employer, Rasa, has always had a focus on data quality. Some of that attitude is bound to have seeped in here. Be sure to check the Conversation Driven Development approach and Rasa X if you're working on virtual assistants.
  • My current employer, Explosion, has a neat labelling tool called prodigy. I'm currently investigating how tools like doubtlab might lead to better labels when combined with this (very like-able) annotation tool.