BytePairEncoder¶
embetter.text.BytePairEncoder
¶
This language represents token-free pre-trained subword embeddings. Originally created by Benjamin Heinzerling and Michael Strube.
These vectors will auto-download by the BPEmb package. You can also specify "multi" to download multi language embeddings. A full list of available languages can be found here. The article that belongs to this work can be found here The availability of vocabulary size as well as dimensionality can be varified on the project website. See here for an example link in English. Please credit the original authors if you use their work.
Parameters
Name | Type | Description | Default |
---|---|---|---|
lang |
str |
name of the model to load | required |
vs |
int |
vocabulary size of the byte pair model | 1000 |
dim |
int |
the embedding dimensionality | 25 |
agg |
str |
the aggregation method to reduce many subword vectors into a single one, can be "max", "mean" or "both" | 'mean' |
cache_dir |
Path |
The folder in which downloaded BPEmb files will be cached, can overwrite to custom folder. | None |
Usage
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from embetter.grab import ColumnGrabber
from embetter.text import BytePairEncoder
# Let's suppose this is the input dataframe
dataf = pd.DataFrame({
"text": ["positive sentiment", "super negative"],
"label_col": ["pos", "neg"]
})
# This pipeline grabs the `text` column from a dataframe
# which then get fed into a small English model
text_emb_pipeline = make_pipeline(
ColumnGrabber("text"),
BytePairEncoder(lang="en")
)
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
text_emb_pipeline,
LogisticRegression()
)
# Prediction example
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)
fit(self, X, y=None)
¶
Show source code in text/_bpemb.py
82 83 84 85 86 |
|
No-op. Merely checks for object inputs per sklearn standard.
transform(self, X, y=None)
¶
Show source code in text/_bpemb.py
93 94 95 96 97 98 99 100 101 102 103 |
|
Transforms the phrase text into a numeric representation.