Skip to content

Scikit-Learn

Scikit-Learn pipelines are amazing but they are not perfect for simple text use-cases.

  • The standard pipeline does not allow for interactive learning. You can apply .fit but that's it. Even if the tools inside of the pipeline have a .partial_fit available, the pipeline doesn't allow it.
  • The CountVectorizer is great, but we might need some more text-tricks at our disposal that are specialized towards text to make this object more effective.

Part of what this library does is give more tools that extend scikit-learn for simple text classification problems. In this document we will showcase some of the main features.

Text Preparation Tools

Let's first discuss a basic pipeline for text inside of scikit-learn.

Base Pipeline

This simplest text classification pipeline in scikit-learn looks like this;

from sklearn.pipeline import make_pipeline 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier

pipe = make_pipeline(
    CountVectorizer(), 
    SGDClassifier()
)

This pipeline will encode words as sparse features before passing them on to the logistic regression model. This pattern is very common and has proven to work well enough for many English text classification tasks.

The nice thing about using a SGDClassifier is that we're able to learn from our data even if the dataset does not fit in memory. We can call .partial_fit instead of .fit and learn in a more "online" setting.

That said, there are things we can do even to this pipeline to make it better.

Spelling Errors

When you are classifying online texts you are often confronted with spelling errors. To deal with this you'd typically use a CountVectorizer with a character-level analyzer such that you also encode subwords.

With all of these subwords around, we'll be more robust against spelling errors.
The downside of this approach is that you might wonder if we really need all these subwords. So how about this, let's add a step that will turn our text into subwords by splitting up hyphens.

from tokenwiser.textprep import HyphenTextPrep

multi = HyphenTextPrep().transform(["geology", "astrology"])

assert multi == ['geo logy', 'as tro logy']

The HyphenTextPrep preprocessor is a TextPrep-object. For all intents and purposes these are scikit-learn compatible preprocessing components but they all output strings instead of arrays. What's nice about these though is that you can "retokenize" the original text. This allows you to use the subtokens as if they were tokens which might help keep your pipelines lightweight while still keeping them robust against certain spelling errors.

Long Texts

There are some other tricks that you might want to apply for longer texts. Maybe you want to summarise a text before vectorizing it. So maybe it'd be nice to use a transformer that keeps only the most important tokens.

A neat heuristic toolkit for this is yake (you can find a demo here). This package also features a scikit-learn compatible component for it.

from tokenwiser.textprep import YakeTextPrep

text = [
  "Sources tell us that Google is acquiring Kaggle, \
   a platform that hosts data science and machine learning"
]
example = YakeTextPrep(top_n=3, unique=False).transform(text)

assert example[0] == 'hosts data science acquiring kaggle google is acquiring'

The idea here is to reduce the text down to only the most important words. Again, this trick might keep the algorithm lightweight and this trick will go a lot further than most "stopword"-lists.

Bag of Tricks!

The goal of this library is to host a few meaningful tricks that might be helpful. Here's some more;

  • Cleaner lowercase text remove all non alphanumeric characters.
  • Identity just keeps the text as is, useful when constructing elaborate pipelines.
  • PhoneticTextPrep translate text into a phonetic encoding.
  • SpacyPosTextPrep add part of speech infomation to the text using spaCy.
  • SpacyLemmaTextPrep lemmatize the text using spaCy.

All of these tools are part of the textprep submodule and are documented in detail here.

Pipeline Tools

Pipeline components are certainly nice. But maybe we can go a step further for text. Maybe we can make better pipelines for text too!

Concatenate Text

In scikit-learn you would use FeatureUnion or make_union to concatenate features in a pipeline. Ut is assumed that transformers output arrays that need to be concatenated so the result of a concatenation is always a 2D array. This can be a bit awkward if you're using text preprocessors.

The reason why we want to keep everything a string is so that the CountVectorizer from scikit-learn can properly encode it. That is why this library comes with a special union component: TextConcat. It concatenates the output of text-prep tools into a string instead of an array. Note that we also pack a convenient make_concat function too.

from sklearn.pipeline import make_pipeline 

from tokenwiser.pipeline import make_concat
from tokenwiser.textprep import Cleaner, Identity, HyphenTextPrep

pipe = make_pipeline(
    Cleaner(),
    make_concat(Identity(), HyphenTextPrep()),
)

output = pipe.fit_transform(["hello astrology!!!!"])
assert output == ['hello astrology hel lo astro logy']

Again, we see that we're taking a text input and that we're generating a text output. The make_concat is making sure that we concatenate strings, not arrays! This is great when we want to follow up with a `CountVectorizer!

from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

from tokenwiser.pipeline import make_concat
from tokenwiser.textprep import Cleaner, Identity, HyphenTextPrep

pipe = make_pipeline(
    Cleaner(),
    make_concat(Identity(), HyphenTextPrep()),
    CountVectorizer(), 
    LogisticRegression()
)

The mental picture for pipe-pipeline looks like the diagram below.

Partial Fit

We can go a step further though. The scikit-learn pipeline follows the fit/predict API. That means that we cannot use .partial_fit(). Even if all the components in the pipeline are compatible with the partial_fit/predict API. That is why this library also introduced components for mini-batch learning: PartialPipeline and make_partial_pipeline

In these scenarios you will need to swap out the CountVectorizer with a HashVectorizer in order to be able to learn from new data comming in.

from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer

from tokenwiser.textprep import Cleaner, Identity, HyphenTextPrep
from tokenwiser.pipeline import make_concat, make_partial_pipeline

pipe = make_partial_pipeline(
    Cleaner(),
    make_concat(Identity(), HyphenTextPrep()),
    HashingVectorizer(), 
    SGDClassifier()
)

This pipe-Pipeline is scikit-learn compatible for all intents and purposes but it has the option of learning from batches of data via partal_fit. This is great because it means that you're able to classify text even when it doesn't fit into memory!

Note that all of the TextPrep-components in this library allow for partial_fit.

To make a partial_fit actually work you will need to supply the names of the classes at learning time. Otherwise you might accidentally get a batch that only contains one class and the algorithm would become numerically unstable.

import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer
from tokenwiser.textprep import Cleaner, Identity, HyphenTextPrep
from tokenwiser.pipeline import make_partial_pipeline, make_partial_union

pipe = make_partial_pipeline(
    Cleaner(),
    make_partial_union(
        make_partial_pipeline(Identity(), HashingVectorizer()),
        make_partial_pipeline(HyphenTextPrep(), HashingVectorizer())
    ),
    SGDClassifier()
)

X = [
    "i really like this post",
    "thanks for that comment",
    "i enjoy this friendly forum",
    "this is a bad post",
    "i dislike this article",
    "this is not well written"
]

y = np.array([1, 1, 1, 0, 0, 0])

for loop in range(3):
    # It might make sense to loop over the same dataset multiple times
    # if the dataset is small. For larger datasets this isn't recommended. 
    pipe.partial_fit(X, y, classes=[0, 1])

assert np.all(pipe.predict(X) == np.array([1, 1, 1, 0, 0, 0]))

Concatenate Features

The standard FeatureUnion from scikit-learn also does not allow for .partial_fit. So we've added a PartialFeatureUnion class and a make_partial_union function to this library as well.

import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer
from tokenwiser.textprep import Cleaner, Identity, HyphenTextPrep
from tokenwiser.pipeline import make_partial_pipeline, make_partial_union

pipe = make_partial_pipeline(
    Cleaner(),
    make_partial_union(
        make_partial_pipeline(Identity(), HashingVectorizer()),
        make_partial_pipeline(HyphenTextPrep(), HashingVectorizer())
    ),
    SGDClassifier()
)

X = [
    "i really like this post",
    "thanks for that comment",
    "i enjoy this friendly forum",
    "this is a bad post",
    "i dislike this article",
    "this is not well written"
]

y = np.array([1, 1, 1, 0, 0, 0])

for loop in range(3):
    pipe.partial_fit(X, y, classes=[0, 1])

assert np.all(pipe.predict(X) == np.array([1, 1, 1, 0, 0, 0]))