Skip to content

textprep

from tokenwiser.textprep import *

In the textprep submodule you can find scikit-learn compatbile components that transform text into another type of text. The idea is that this may be combined in interesting ways in CountVectorizers.

Cleaner (TextPrep, BaseEstimator)

Applies a lowercase and removes non-alphanum.

Usage:

from tokenwiser.textprep import Cleaner

single = Cleaner().encode_single("$$$5 dollars")
assert single == "5 dollars"
multi = Cleaner().transform(["$$$5 dollars", "#hashtag!"])
assert multi == ["5 dollars", "hashtag"]

Identity (TextPrep, BaseEstimator)

Keeps the text as is. Can be used as a placeholder in a pipeline.

Usage:

from tokenwiser.textprep import Identity

text = ["hello", "world"]
example = Identity().transform(text)

assert example == ["hello", "world"]

The main use-case is as a placeholder.

from tokenwiser.pipeline import make_concat
from sklearn.pipeline import make_pipeline, make_union

from tokenwiser.textprep import Cleaner, Identity, HyphenTextPrep

pipe = make_pipeline(
    Cleaner(),
    make_concat(Identity(), HyphenTextPrep()),
)

HyphenTextPrep (TextPrep, BaseEstimator)

Hyphenate the text going in.

Usage:

from tokenwiser.textprep import HyphenTextPrep

multi = HyphenTextPrep().transform(["geology", "astrology"])
assert multi == ['geo logy', 'as tro logy']

SentencePiecePrep (TextPrep, BaseEstimator)

The SentencePiecePrep object splits text into subtokens based on a pre-trained model.

You can find many pre-trained subtokenizers via the bpemb project. For example, on the English sub-site you can find many models for different vocabulary sizes. Note that this site supports 275 pre-trained subword tokenizers.

Note that you can train your own sentencepiece tokenizer as well.

import sentencepiece as spm

# This saves a file named `mod.model` which can be read in later.
spm.SentencePieceTrainer.train('--input=tests/data/nlp.txt --model_prefix=mod --vocab_size=2000')

Parameters:

Name Type Description Default
model_file

pre-trained model file

required

Usage:

from tokenwiser.textprep import SentencePiecePrep
sp_tfm = SentencePiecePrep(model_file="tests/data/en.vs5000.model")

texts = ["talking about geology"]
example = sp_tfm.transform(texts)
assert example == ['▁talk ing ▁about ▁ge ology']

download(lang, vocab_size, filename=None) classmethod

Download a pre-trained model from the bpemb project.

You can see some examples of pre-trained models on the English sub-site. There are many languages available, but you should take care that you pick the right vocabulary size.

Parameters:

Name Type Description Default
lang str

language code

required
vocab_size int

vocab size, can be 1000, 3000, 5000, 10000, 25000, 50000, 100000, 200000

required
Source code in tokenwiser/textprep/_sentpiece.py
@classmethod
def download(self, lang: str, vocab_size: int, filename: str = None):
    """
    Download a pre-trained model from the bpemb project.

    You can see some examples of pre-trained models on the [English](https://nlp.h-its.org/bpemb/en/) sub-site.
    There are many languages available, but you should take care that you pick the right 
    vocabulary size. 

    Arguments:
        lang: language code
        vocab_size: vocab size, can be 1000, 3000, 5000, 10000, 25000, 50000, 100000, 200000
    """
    url = f"https://bpemb.h-its.org/{lang}/{lang}.wiki.bpe.vs{vocab_size}.model"
    if not filename:
        filename = f"{lang}.wiki.bpe.vs{vocab_size}.model"
    try:
        urllib.request.urlretrieve(url=url, filename=filename)
    except HTTPError:
        raise ValueError(f"Double check if the language ({lang}) and voacb size ({vocab_size}) combo exist.")

PhoneticTextPrep (TextPrep, BaseEstimator)

The ProneticPrep object prepares strings by encoding them phonetically.

Parameters:

Name Type Description Default
kind

type of encoding, either "soundex", "metaphone" or "nysiis"

required

Usage:

import spacy
from tokenwiser.textprep import PhoneticTextPrep

nlp = spacy.load("en_core_web_sm")
example1 = PhoneticTextPrep(kind="soundex").transform(["dinosaurus book"])
example2 = PhoneticTextPrep(kind="metaphone").transform(["dinosaurus book"])
example3 = PhoneticTextPrep(kind="nysiis").transform(["dinosaurus book"])

assert example1[0] == 'D526 B200'
assert example2[0] == 'TNSRS BK'
assert example3[0] == 'DANASAR BAC'

YakeTextPrep (TextPrep, BaseEstimator)

Remove all text except meaningful key-phrases. Uses yake.

Parameters:

Name Type Description Default
top_n

number of key-phrases to select

required
unique

only return unique keywords from the key-phrases

required

Usage:

from tokenwiser.textprep import YakeTextPrep

text = ["Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning"]
example = YakeTextPrep(top_n=3, unique=False).transform(text)

assert example[0] == 'hosts data science acquiring kaggle google is acquiring'

SpacyMorphTextPrep (TextPrep, BaseEstimator)

Adds morphologic information to tokens in text.

Usage:

import spacy
from tokenwiser.textprep import SpacyMorphTextPrep

nlp = spacy.load("en_core_web_sm")
example1 = SpacyMorphTextPrep(nlp).encode_single("quick! duck!")
example2 = SpacyMorphTextPrep(nlp).encode_single("hey look a duck")

assert example1 == "quick|Degree=Pos !|PunctType=Peri duck|Number=Sing !|PunctType=Peri"
assert example2 == "hey| look|VerbForm=Inf a|Definite=Ind|PronType=Art duck|Number=Sing"

SpacyPosTextPrep (TextPrep, BaseEstimator)

Adds part of speech information per token using spaCy.

Parameters:

Name Type Description Default
model

the spaCy model to use

required
lemma

also lemmatize the text

required
fine_grained

use fine grained parts of speech

required

Usage:

import spacy
from tokenwiser.textprep import SpacyPosTextPrep

nlp = spacy.load("en_core_web_sm")
example1 = SpacyPosTextPrep(nlp).encode_single("we need to duck")
example2 = SpacyPosTextPrep(nlp).encode_single("hey look a duck")

assert example1 == "we|PRON need|VERB to|PART duck|VERB"
assert example2 == "hey|INTJ look|VERB a|DET duck|NOUN"

SpacyLemmaTextPrep (TextPrep, BaseEstimator)

Turns each token into a lemmatizer version using spaCy.

Usage:

import spacy
from tokenwiser.textprep import SpacyLemmaTextPrep

nlp = spacy.load("en_core_web_sm")
example1 = SpacyLemmaTextPrep(nlp).encode_single("we are running")
example2 = SpacyLemmaTextPrep(nlp).encode_single("these are dogs")

assert example1 == 'we be run'
assert example2 == 'these be dog'

SnowballTextPrep (TextPrep, BaseEstimator)

Applies the snowball stemmer to the text.

There are 26 languages supported, for the full list check the list on the lefthand side on pypi.

Usage:

from tokenwiser.textprep import SnowballTextPrep

single = SnowballTextPrep(language='english').encode_single("Dogs like running")
assert single == "Dog like run"
multi = Cleaner().transform(["Dogs like running", "Cats like sleeping"])
assert multi == ["Dog like run", "Cat like sleep"]