OpenAIEncoder

embetter.external.OpenAIEncoder

Encoder that can numerically encode sentences.

Note that this is an external embedding provider. If their API breaks, so will this component. We also assume that you've already importen openai upfront and ran this command:

import openai

openai.organization = OPENAI_ORG
openai.api_key = OPENAI_KEY

Parameters

Name Type Description Default
model name of model, can be "small" or "large" 'text-embedding-ada-002'
batch_size Batch size to send to OpenAI. 25

Usage:

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.external import CohereEncoder

import openai

# You must run this first!
openai.organization = OPENAI_ORG
openai.api_key = OPENAI_KEY

# Let's suppose this is the input dataframe
dataf = pd.DataFrame({
    "text": ["positive sentiment", "super negative"],
    "label_col": ["pos", "neg"]
})

# This pipeline grabs the `text` column from a dataframe
# which then get fed into Cohere's endpoint
text_emb_pipeline = make_pipeline(
    ColumnGrabber("text"),
    OpenAIEncoder()
)
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])

# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
    text_emb_pipeline,
    LogisticRegression()
)

# Prediction example
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)

transform(self, X, y=None)

Show source code in external/_openai.py
77
78
79
80
81
82
83
    def transform(self, X, y=None):
        """Transforms the text into a numeric representation."""
        result = []
        for b in _batch(X, self.batch_size):
            resp = openai.Embedding.create(input=X, model=self.model)  # fmt: off
            result.extend([_["embedding"] for _ in resp["data"]])
        return np.array(result)

Transforms the text into a numeric representation.