SentenceEncoder

embetter.text.SentenceEncoder

Encoder that can numerically encode sentences.

Parameters

Name Type Description Default
name name of model, see available options 'all-MiniLM-L6-v2'
device manually override cpu/gpu device, tries to grab gpu automatically when available None

The following model names should be supported:

  • all-mpnet-base-v2
  • multi-qa-mpnet-base-dot-v1
  • all-distilroberta-v1
  • all-MiniLM-L12-v2
  • multi-qa-distilbert-cos-v1
  • all-MiniLM-L6-v2
  • multi-qa-MiniLM-L6-cos-v1
  • paraphrase-multilingual-mpnet-base-v2
  • paraphrase-albert-small-v2
  • paraphrase-multilingual-MiniLM-L12-v2
  • paraphrase-MiniLM-L3-v2
  • distiluse-base-multilingual-cased-v1
  • distiluse-base-multilingual-cased-v2

You can find the more options, and information, on the sentence-transformers docs page.

Usage:

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder

# Let's suppose this is the input dataframe
dataf = pd.DataFrame({
    "text": ["positive sentiment", "super negative"],
    "label_col": ["pos", "neg"]
})

# This pipeline grabs the `text` column from a dataframe
# which then get fed into Sentence-Transformers' all-MiniLM-L6-v2.
text_emb_pipeline = make_pipeline(
    ColumnGrabber("text"),
    SentenceEncoder('all-MiniLM-L6-v2')
)
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])

# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
    text_emb_pipeline,
    LogisticRegression()
)

# Prediction example
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)

transform(self, X, y=None)

Show source code in text/_sbert.py
77
78
79
80
81
82
83
    def transform(self, X, y=None):
        """Transforms the text into a numeric representation."""
        # Convert pd.Series objects to encode compatable
        if isinstance(X, pd.Series):
            X = X.to_numpy()

        return self.tfm.encode(X)

Transforms the text into a numeric representation.