whatlies.language.CountVectorLanguage

This object is used to lazily fetch Embeddings or EmbeddingSets from a countvector language backend. This object is meant for retreival, not plotting.

This model will first train a scikit-learn CountVectorizer after which it will perform dimensionality reduction to make the numeric representation a vector. The reduction occurs via TruncatedSVD, also from scikit-learn.

Warning

This method does not implement a word embedding in the traditional sense. The interpretation needs to be altered. The information that is captured here only relates to the words/characters that are used in the text. There is no notion of meaning that should be suggested.

Also, in order to keep this system consistent with the rest of the api you train the system when you retreive vectors if you just use __getitem__. If you want to seperate train/test you need to call fit_manual yourself or use it in a scikit-learn pipeline.

Parameters

Name Type Description Default
n_components int Number of components that TruncatedSVD will reduce to. required
lowercase bool If the tokens need to be lowercased beforehand. True
analyzer str Which analyzer to use, can be "word", "char", "char_wb". 'char'
ngram_range Tuple[int, int] The range that specifies how many ngrams to use. (1, 2)
min_df Union[int, float] Ignore terms that have a document frequency strictly lower than the given threshold. 1
max_df Union[int, float] Ignore terms that have a document frequency strictly higher than the given threshold. 1.0
binary bool Determines if the counts are binary or if they can accumulate. False
strip_accents str Remove accents and perform normalisation. Can be set to "ascii" or "unicode". None
random_state int Random state for SVD algorithm. 42

For more elaborate explainers on these arguments, check out the scikit-learn documentation.

Usage:

> from whatlies.language import CountVectorLanguage
> lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
> lang[['pizza', 'pizzas', 'firehouse', 'firehydrant']]

__getitem__(self, query)

Show source code in language/_countvector_lang.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
    def __getitem__(self, query: Union[str, List[str]]):
        """
        Retreive a set of embeddings.

        Arguments:
            query: list of strings

        **Usage**

        ```python
        > from whatlies.language import CountVectorLanguage
        > lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
        > lang[['pizza', 'pizzas', 'firehouse', 'firehydrant']]
        ```
        """
        orig_str = isinstance(query, str)
        if orig_str:
            query = [query]
        if any([len(q) == 0 for q in query]):
            raise ValueError(
                "You've passed an empty string to the language model which is not allowed."
            )
        if self.fitted_manual:
            X = self.cv.transform(query)
            X_vec = self.svd.transform(X)
        else:
            X = self.cv.fit_transform(query)
            X_vec = self.svd.fit_transform(X)
        if orig_str:
            return Embedding(name=query[0], vector=X_vec[0])
        return EmbeddingSet(
            *[Embedding(name=n, vector=v) for n, v in zip(query, X_vec)]
        )

Retreive a set of embeddings.

Parameters

Name Type Description Default
query Union[str, List[str]] list of strings required

Usage

> from whatlies.language import CountVectorLanguage
> lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
> lang[['pizza', 'pizzas', 'firehouse', 'firehydrant']]

embset_similar(self, emb, n=10, lower=False, metric='cosine')

Show source code in language/_countvector_lang.py
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
    def embset_similar(
        self,
        emb: Union[str, Embedding],
        n: int = 10,
        lower=False,
        metric="cosine",
    ) -> EmbeddingSet:
        """
        Retreive an [EmbeddingSet][whatlies.embeddingset.EmbeddingSet] that are the most similar to the passed query.
        Note that we will only consider words that were passed in the `.fit_manual()` step.

        Arguments:
            emb: query to use
            n: the number of items you'd like to see returned
            metric: metric to use to calculate distance, must be scipy or sklearn compatible
            lower: only fetch lower case tokens

        Returns:
            An [EmbeddingSet][whatlies.embeddingset.EmbeddingSet] containing the similar embeddings.
        """
        embs = [
            w[0] for w in self.score_similar(emb=emb, n=n, lower=lower, metric=metric)
        ]
        return EmbeddingSet({w.name: w for w in embs})

Retreive an EmbeddingSet that are the most similar to the passed query. Note that we will only consider words that were passed in the .fit_manual() step.

Parameters

Name Type Description Default
emb Union[str, whatlies.embedding.Embedding] query to use required
n int the number of items you'd like to see returned 10
metric metric to use to calculate distance, must be scipy or sklearn compatible 'cosine'
lower only fetch lower case tokens False

Returns

Type Description
EmbeddingSet An EmbeddingSet containing the similar embeddings.

fit_manual(self, query)

Show source code in language/_countvector_lang.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
    def fit_manual(self, query):
        """
        Fit the model manually. This way you can call `__getitem__` independantly of training.

        Arguments:
            query: list of strings

        **Usage**

        ```python
        > from whatlies.language import CountVectorLanguage
        > lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
        > lang.fit_manual(['pizza', 'pizzas', 'firehouse', 'firehydrant'])
        > lang[['piza', 'pizza', 'pizzaz', 'fyrehouse', 'firehouse', 'fyrehidrant']]
        ```
        """
        if any([len(q) == 0 for q in query]):
            raise ValueError(
                "You've passed an empty string to the language model which is not allowed."
            )
        X = self.cv.fit_transform(query)
        self.svd.fit(X)
        self.fitted_manual = True
        self.corpus = query
        return self

Fit the model manually. This way you can call __getitem__ independantly of training.

Parameters

Name Type Description Default
query list of strings required

Usage

> from whatlies.language import CountVectorLanguage
> lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
> lang.fit_manual(['pizza', 'pizzas', 'firehouse', 'firehydrant'])
> lang[['piza', 'pizza', 'pizzaz', 'fyrehouse', 'firehouse', 'fyrehidrant']]

score_similar(self, emb, n=10, metric='cosine', lower=False)

Show source code in language/_countvector_lang.py
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
    def score_similar(
        self,
        emb: Union[str, Embedding],
        n: int = 10,
        metric="cosine",
        lower=False,
    ) -> List:
        """
        Retreive a list of (Embedding, score) tuples that are the most similar to the passed query.
        Note that we will only consider words that were passed in the `.fit_manual()` step.

        Arguments:
            emb: query to use
            n: the number of items you'd like to see returned
            metric: metric to use to calculate distance, must be scipy or sklearn compatible
            lower: only fetch lower case tokens

        Returns:
            An list of ([Embedding][whatlies.embedding.Embedding], score) tuples.
        """
        if isinstance(emb, str):
            emb = self[emb]

        queries = self._prepare_queries(lower=lower)
        distances = self._calculate_distances(emb=emb, queries=queries, metric=metric)
        by_similarity = sorted(zip(queries, distances), key=lambda z: z[1])

        if len(self.corpus) < n:
            raise ValueError(
                f"You're trying to retreive {n} items while the corpus only trained on {len(self.corpus)}."
            )

        if len(queries) < n:
            warnings.warn(
                f"We could only find {len(queries)} feasible words. Consider changing `top_n` or `lower`",
                UserWarning,
            )

        return [(self[q], float(d)) for q, d in by_similarity[:n]]

Retreive a list of (Embedding, score) tuples that are the most similar to the passed query. Note that we will only consider words that were passed in the .fit_manual() step.

Parameters

Name Type Description Default
emb Union[str, whatlies.embedding.Embedding] query to use required
n int the number of items you'd like to see returned 10
metric metric to use to calculate distance, must be scipy or sklearn compatible 'cosine'
lower only fetch lower case tokens False

Returns

Type Description
List An list of (Embedding, score) tuples.