textprep
¶
from tokenwiser.textprep import *
In the textprep
submodule you can find scikit-learn compatbile
components that transform text into another type of text. The idea
is that this may be combined in interesting ways in CountVectorizers.
Cleaner (TextPrep, BaseEstimator)
¶
Applies a lowercase and removes non-alphanum.
Usage:
from tokenwiser.textprep import Cleaner
single = Cleaner().encode_single("$$$5 dollars")
assert single == "5 dollars"
multi = Cleaner().transform(["$$$5 dollars", "#hashtag!"])
assert multi == ["5 dollars", "hashtag"]
Identity (TextPrep, BaseEstimator)
¶
Keeps the text as is. Can be used as a placeholder in a pipeline.
Usage:
from tokenwiser.textprep import Identity
text = ["hello", "world"]
example = Identity().transform(text)
assert example == ["hello", "world"]
The main use-case is as a placeholder.
from tokenwiser.pipeline import make_concat
from sklearn.pipeline import make_pipeline, make_union
from tokenwiser.textprep import Cleaner, Identity, HyphenTextPrep
pipe = make_pipeline(
Cleaner(),
make_concat(Identity(), HyphenTextPrep()),
)
HyphenTextPrep (TextPrep, BaseEstimator)
¶
Hyphenate the text going in.
Usage:
from tokenwiser.textprep import HyphenTextPrep
multi = HyphenTextPrep().transform(["geology", "astrology"])
assert multi == ['geo logy', 'as tro logy']
SentencePiecePrep (TextPrep, BaseEstimator)
¶
The SentencePiecePrep object splits text into subtokens based on a pre-trained model.
You can find many pre-trained subtokenizers via the bpemb project. For example, on the English sub-site you can find many models for different vocabulary sizes. Note that this site supports 275 pre-trained subword tokenizers.
Note that you can train your own sentencepiece tokenizer as well.
import sentencepiece as spm
# This saves a file named `mod.model` which can be read in later.
spm.SentencePieceTrainer.train('--input=tests/data/nlp.txt --model_prefix=mod --vocab_size=2000')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_file |
pre-trained model file |
required |
Usage:
from tokenwiser.textprep import SentencePiecePrep
sp_tfm = SentencePiecePrep(model_file="tests/data/en.vs5000.model")
texts = ["talking about geology"]
example = sp_tfm.transform(texts)
assert example == ['▁talk ing ▁about ▁ge ology']
download(lang, vocab_size, filename=None)
classmethod
¶
Download a pre-trained model from the bpemb project.
You can see some examples of pre-trained models on the English sub-site. There are many languages available, but you should take care that you pick the right vocabulary size.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lang |
str |
language code |
required |
vocab_size |
int |
vocab size, can be 1000, 3000, 5000, 10000, 25000, 50000, 100000, 200000 |
required |
Source code in tokenwiser/textprep/_sentpiece.py
@classmethod
def download(self, lang: str, vocab_size: int, filename: str = None):
"""
Download a pre-trained model from the bpemb project.
You can see some examples of pre-trained models on the [English](https://nlp.h-its.org/bpemb/en/) sub-site.
There are many languages available, but you should take care that you pick the right
vocabulary size.
Arguments:
lang: language code
vocab_size: vocab size, can be 1000, 3000, 5000, 10000, 25000, 50000, 100000, 200000
"""
url = f"https://bpemb.h-its.org/{lang}/{lang}.wiki.bpe.vs{vocab_size}.model"
if not filename:
filename = f"{lang}.wiki.bpe.vs{vocab_size}.model"
try:
urllib.request.urlretrieve(url=url, filename=filename)
except HTTPError:
raise ValueError(f"Double check if the language ({lang}) and voacb size ({vocab_size}) combo exist.")
PhoneticTextPrep (TextPrep, BaseEstimator)
¶
The ProneticPrep object prepares strings by encoding them phonetically.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kind |
type of encoding, either |
required |
Usage:
import spacy
from tokenwiser.textprep import PhoneticTextPrep
nlp = spacy.load("en_core_web_sm")
example1 = PhoneticTextPrep(kind="soundex").transform(["dinosaurus book"])
example2 = PhoneticTextPrep(kind="metaphone").transform(["dinosaurus book"])
example3 = PhoneticTextPrep(kind="nysiis").transform(["dinosaurus book"])
assert example1[0] == 'D526 B200'
assert example2[0] == 'TNSRS BK'
assert example3[0] == 'DANASAR BAC'
YakeTextPrep (TextPrep, BaseEstimator)
¶
Remove all text except meaningful key-phrases. Uses yake.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_n |
number of key-phrases to select |
required | |
unique |
only return unique keywords from the key-phrases |
required |
Usage:
from tokenwiser.textprep import YakeTextPrep
text = ["Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning"]
example = YakeTextPrep(top_n=3, unique=False).transform(text)
assert example[0] == 'hosts data science acquiring kaggle google is acquiring'
SpacyMorphTextPrep (TextPrep, BaseEstimator)
¶
Adds morphologic information to tokens in text.
Usage:
import spacy
from tokenwiser.textprep import SpacyMorphTextPrep
nlp = spacy.load("en_core_web_sm")
example1 = SpacyMorphTextPrep(nlp).encode_single("quick! duck!")
example2 = SpacyMorphTextPrep(nlp).encode_single("hey look a duck")
assert example1 == "quick|Degree=Pos !|PunctType=Peri duck|Number=Sing !|PunctType=Peri"
assert example2 == "hey| look|VerbForm=Inf a|Definite=Ind|PronType=Art duck|Number=Sing"
SpacyPosTextPrep (TextPrep, BaseEstimator)
¶
Adds part of speech information per token using spaCy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
the spaCy model to use |
required | |
lemma |
also lemmatize the text |
required | |
fine_grained |
use fine grained parts of speech |
required |
Usage:
import spacy
from tokenwiser.textprep import SpacyPosTextPrep
nlp = spacy.load("en_core_web_sm")
example1 = SpacyPosTextPrep(nlp).encode_single("we need to duck")
example2 = SpacyPosTextPrep(nlp).encode_single("hey look a duck")
assert example1 == "we|PRON need|VERB to|PART duck|VERB"
assert example2 == "hey|INTJ look|VERB a|DET duck|NOUN"
SpacyLemmaTextPrep (TextPrep, BaseEstimator)
¶
Turns each token into a lemmatizer version using spaCy.
Usage:
import spacy
from tokenwiser.textprep import SpacyLemmaTextPrep
nlp = spacy.load("en_core_web_sm")
example1 = SpacyLemmaTextPrep(nlp).encode_single("we are running")
example2 = SpacyLemmaTextPrep(nlp).encode_single("these are dogs")
assert example1 == 'we be run'
assert example2 == 'these be dog'
SnowballTextPrep (TextPrep, BaseEstimator)
¶
Applies the snowball stemmer to the text.
There are 26 languages supported, for the full list check the list on the lefthand side on pypi.
Usage:
from tokenwiser.textprep import SnowballTextPrep
single = SnowballTextPrep(language='english').encode_single("Dogs like running")
assert single == "Dog like run"
multi = Cleaner().transform(["Dogs like running", "Cats like sleeping"])
assert multi == ["Dog like run", "Cat like sleep"]