BytePairEncoder

embetter.text.BytePairEncoder

This language represents token-free pre-trained subword embeddings. Originally created by Benjamin Heinzerling and Michael Strube.

These vectors will auto-download by the BPEmb package. You can also specify "multi" to download multi language embeddings. A full list of available languages can be found here. The article that belongs to this work can be found here The availability of vocabulary size as well as dimensionality can be varified on the project website. See here for an example link in English. Please credit the original authors if you use their work.

Parameters

Name Type Description Default
lang str name of the model to load required
vs int vocabulary size of the byte pair model 1000
dim int the embedding dimensionality 25
agg str the aggregation method to reduce many subword vectors into a single one, can be "max", "mean" or "both" 'mean'
cache_dir Path The folder in which downloaded BPEmb files will be cached, can overwrite to custom folder. None

Usage

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.text import BytePairEncoder

# Let's suppose this is the input dataframe
dataf = pd.DataFrame({
    "text": ["positive sentiment", "super negative"],
    "label_col": ["pos", "neg"]
})

# This pipeline grabs the `text` column from a dataframe
# which then get fed into a small English model
text_emb_pipeline = make_pipeline(
    ColumnGrabber("text"),
    BytePairEncoder(lang="en")
)
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])

# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
    text_emb_pipeline,
    LogisticRegression()
)

# Prediction example
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)

fit(self, X, y=None)

Show source code in text/_bpemb.py
82
83
84
85
86
    def fit(self, X, y=None):
        """No-op. Merely checks for object inputs per sklearn standard."""
        # Scikit-learn also expects this in the `.fit()` command.
        self._check_inputs(X)
        return self

No-op. Merely checks for object inputs per sklearn standard.

transform(self, X, y=None)

Show source code in text/_bpemb.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
    def transform(self, X, y=None):
        """Transforms the phrase text into a numeric representation."""
        self._check_inputs(X)
        if self.agg == "mean":
            return np.array([self.module.embed(x).mean(axis=0) for x in X])
        if self.agg == "max":
            return np.array([self.module.embed(x).max(axis=0) for x in X])
        if self.agg == "both":
            mean_arr = np.array([self.module.embed(x).max(axis=0) for x in X])
            max_arr = np.array([self.module.embed(x).max(axis=0) for x in X])
            return np.concatenate([mean_arr, max_arr], axis=1)

Transforms the phrase text into a numeric representation.