Deduplication

An interesting use-case for simsity is to use it as a tool that explores deduplication.

Example

Let's consider the voters dataset.

from simsity.datasets import fetch_voters

df = fetch_voters()

name	suburb	postcode
khimerc thomas	charlotte	2826g
lucille richardst	kannapolis	28o81
reb3cca bauerboand	raleigh	27615
maleda mccloud	goldsboro	2753o
belida stovall	morrisvill	27560

This dataset contains information about "voters" and the concern is that some of these rows may represent the same person. The persons name might occur in different spellings and the postcodes may contain typos, but they could still refer to the same person. In other words; there may be duplicates in this dataframe that we cannot remove with .drop_duplicates(). So how might we go about finding these?

Similarity Service!

Let's build a similarity service, but now we'll use encoders from the dirty_cat package. These encoders are designed to handle dirty categorical data, which would be perfect for our use-case here.

from simsity.service import Service
from simsity.indexer import AnnoyIndexer
from dirty_cat import GapEncoder

# Set up
indexer = AnnoyIndexer(n_trees=50)
encoder = GapEncoder().fit(df)
service = Service(indexer=indexer, encoder=encoder)

# Index
service.index(df)

Query

If we now want to construct a query, we will need to send a pandas row. The encoder assumes pandas, so we need to make sure our query is compatible.

# Query as a dictionary
dict_in = dict(name="khimerc thmas", suburb="chariotte", postcode="28273")
# Single row from dataframe
q_in = pd.DataFrame([dict_in]).iloc[0]

idx, dists = service.query(q_in, n_neighbors=10)
df.iloc[idx].assign(dist=dists)

This is the dataframe that we get out.

name	suburb	postcode	dist
chimerc thmas	chaflotte	28269	3.14833
chimerc thomas	charlotte	28269	3.95177
khimerc thomas	charlotte	2826g	3.98925
angelique deas	charlotte	28278	4.76251
barbara dambrosio	charlotte	28277	5.46748
kendel beachum	charlotte	28226	5.6414
mariq simpsony	charlotte	28269	6.76645
herber oxendine	charlotte	28247	8.22691
steven twamley	chapel hill	27514	8.97374
herbert oxendin	chsrlotte	28277	9.53476

It certainly seems like we have some duplicates in here! So we may be able to use retreival/embedding tricks for that use-case.

It deserves mentioning, once again, that the quality of our retreival depends a lot on our choice of index and encoding. But experimenting with this is exactly what this library makes easy.