Skip to content

FAQ

Why can't I use normal Pipeline objects with the spaCy API?

Scikit-Learn assumes that data is trained via .fit(X, y).predict(X). This is great when you've got a dataset fully in memory but it's not so great when your dataset is too big to fit in one go. This is a main reason why spaCy has an .update() API for their trainable pipeline components. It's similar to .partial_fit(X) in scikit-learn. You wouldn't train on a single batch of data. Instead you would iteratively train on subsets of the dataset.

A big downside of the Pipeline API is that it cannot use .partial_fit(X). Even if all the components on the inside are compatible, it forces you to use .fit(X). That is why this library offers a PartialPipeline. It only allows for components that have .partial_fit implemented and it's these pipelines that can also comply with spaCy's .update() API.

Note that all scikit-learn components offered by this library are compatible with the PartialPipeline. This includes everything from the tokeniser.textprep submodule.

Can I train spaCy with scikit-learn from Jupyter?

It's not our favorite way of doing things, but nobody is stopping you.

import spacy 
from spacy import registry
from spacy.training import Example
from spacy.language import Language

from tokenwiser.pipeline import PartialPipeline
from tokenwiser.model.sklearnmod import SklearnCat
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

@Language.factory("custom-sklearn-cat")
def make_sklearn_cat(nlp, name, sklearn_model, label, classes):
    return SklearnCat(nlp, name, sklearn_model, label, classes)

@registry.architectures("sklearn_model_basic_sgd.v1")
def make_sklearn_cat_basic_sgd():
    """This creates a *partial* pipeline. We can't use a standard pipeline from scikit-learn."""
    return PartialPipeline([("hash", HashingVectorizer()), ("lr", SGDClassifier(loss="log"))])


nlp = spacy.load("en_core_web_sm")
config = {
    "sklearn_model": "@sklearn_model_basic_sgd.v1", 
    "label": "pos", 
    "classes": ["pos", "neg"]
}
nlp.add_pipe("custom-sklearn-cat", config=config)

texts = [
    "you are a nice person", 
    "this is a great movie", 
    "i do not like cofee", 
    "i hate tea"
]
labels = ["pos", "pos", "neg", "neg"]

# This is the training loop just for out categorizer model.
with nlp.select_pipes(enable="custom-sklearn-cat"):
    optimizer = nlp.resume_training()
    for loop in range(10):
        for t, lab in zip(texts, labels):
            doc = nlp.make_doc(t)
            example = Example.from_dict(doc, {"cats": {"pos": lab}})
            nlp.update([example], sgd=optimizer)

nlp("you are a nice person").cats # {'pos': 0.9979167909733176}
nlp("coffee i do not like").cats # {'neg': 0.990049724779963}