DoubtLab helps you find bad labels.
This repository contains general tricks that may help you find bad, or noisy, labels in your dataset. The hope is that this repository makes it easier for folks to quickly check their own datasets before they invest too much time and compute on gridsearch.
You can install the tool via
Install with pip
python -m pip install doubtlab
Install with conda
conda install -c conda-forge doubtlab
If you want to get started, we recommend starting here.
- The cleanlab project was an inspiration for this one. They have a great heuristic for bad label detection but I wanted to have a library that implements many. Be sure to check out their work on the labelerrors.com project.
- My former employer, Rasa, has always had a focus on data quality. Some of that attitude is bound to have seeped in here. Be sure to check the Conversation Driven Development approach and Rasa X if you're working on virtual assistants.
- My current employer, Explosion, has a neat labelling tool called prodigy. I'm currently investigating how tools like doubtlab might lead to better labels when combined with this (very like-able) annotation tool.