

`whatlies.language.CountVectorLanguage`¶

This object is used to lazily fetch Embeddings or EmbeddingSets from a countvector language backend. This object is meant for retreival, not plotting.

This model will first train a scikit-learn CountVectorizer after which it will perform dimensionality reduction to make the numeric representation a vector. The reduction occurs via TruncatedSVD, also from scikit-learn.

Warning

This method does not implement a word embedding in the traditional sense. The interpretation needs to be altered. The information that is captured here only relates to the words/characters that are used in the text. There is no notion of meaning that should be suggested.

Also, in order to keep this system consistent with the rest of the api you train the system when you retreive vectors if you just use __getitem__. If you want to seperate train/test you need to call fit_manual yourself or use it in a scikit-learn pipeline.

Parameters

Name	Type	Description	Default
`n_components`	`int`	Number of components that TruncatedSVD will reduce to.	required
`lowercase`	`bool`	If the tokens need to be lowercased beforehand.	`True`
`analyzer`	`str`	Which analyzer to use, can be "word", "char", "char_wb".	`'char'`
`ngram_range`	`Tuple[int, int]`	The range that specifies how many ngrams to use.	`(1, 2)`
`min_df`	`Union[int, float]`	Ignore terms that have a document frequency strictly lower than the given threshold.	`1`
`max_df`	`Union[int, float]`	Ignore terms that have a document frequency strictly higher than the given threshold.	`1.0`
`binary`	`bool`	Determines if the counts are binary or if they can accumulate.	`False`
`strip_accents`	`str`	Remove accents and perform normalisation. Can be set to "ascii" or "unicode".	`None`
`random_state`	`int`	Random state for SVD algorithm.	`42`

For more elaborate explainers on these arguments, check out the scikit-learn documentation.

Usage:

> from whatlies.language import CountVectorLanguage
> lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
> lang[['pizza', 'pizzas', 'firehouse', 'firehydrant']]

`getitem(self, query)`¶

Show source code in language/_countvector_lang.py

    def __getitem__(self, query: Union[str, List[str]]):
        """
        Retreive a set of embeddings.

        Arguments:
            query: list of strings

        **Usage**

        ```python
        > from whatlies.language import CountVectorLanguage
        > lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
        > lang[['pizza', 'pizzas', 'firehouse', 'firehydrant']]
        ```
        """
        orig_str = isinstance(query, str)
        if orig_str:
            query = [query]
        if any([len(q) == 0 for q in query]):
            raise ValueError(
                "You've passed an empty string to the language model which is not allowed."
            )
        if self.fitted_manual:
            X = self.cv.transform(query)
            X_vec = self.svd.transform(X)
        else:
            X = self.cv.fit_transform(query)
            X_vec = self.svd.fit_transform(X)
        if orig_str:
            return Embedding(name=query[0], vector=X_vec[0])
        return EmbeddingSet(
            *[Embedding(name=n, vector=v) for n, v in zip(query, X_vec)]
        )

Retreive a set of embeddings.

Parameters

Name	Type	Description	Default
`query`	`Union[str, List[str]]`	list of strings	required

Usage

> from whatlies.language import CountVectorLanguage
> lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
> lang[['pizza', 'pizzas', 'firehouse', 'firehydrant']]

`embset_similar(self, emb, n=10, lower=False, metric='cosine')`¶

Show source code in language/_countvector_lang.py

    def embset_similar(
        self,
        emb: Union[str, Embedding],
        n: int = 10,
        lower=False,
        metric="cosine",
    ) -> EmbeddingSet:
        """
        Retreive an [EmbeddingSet][whatlies.embeddingset.EmbeddingSet] that are the most similar to the passed query.
        Note that we will only consider words that were passed in the `.fit_manual()` step.

        Arguments:
            emb: query to use
            n: the number of items you'd like to see returned
            metric: metric to use to calculate distance, must be scipy or sklearn compatible
            lower: only fetch lower case tokens

        Returns:
            An [EmbeddingSet][whatlies.embeddingset.EmbeddingSet] containing the similar embeddings.
        """
        embs = [
            w[0] for w in self.score_similar(emb=emb, n=n, lower=lower, metric=metric)
        ]
        return EmbeddingSet({w.name: w for w in embs})

Retreive an EmbeddingSet that are the most similar to the passed query. Note that we will only consider words that were passed in the .fit_manual() step.

Parameters

Name	Type	Description	Default
`emb`	`Union[str, whatlies.embedding.Embedding]`	query to use	required
`n`	`int`	the number of items you'd like to see returned	`10`
`metric`		metric to use to calculate distance, must be scipy or sklearn compatible	`'cosine'`
`lower`		only fetch lower case tokens	`False`

Returns

Type	Description
`EmbeddingSet`	An EmbeddingSet containing the similar embeddings.

`fit_manual(self, query)`¶

Show source code in language/_countvector_lang.py

    def fit_manual(self, query):
        """
        Fit the model manually. This way you can call `__getitem__` independantly of training.

        Arguments:
            query: list of strings

        **Usage**

        ```python
        > from whatlies.language import CountVectorLanguage
        > lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
        > lang.fit_manual(['pizza', 'pizzas', 'firehouse', 'firehydrant'])
        > lang[['piza', 'pizza', 'pizzaz', 'fyrehouse', 'firehouse', 'fyrehidrant']]
        ```
        """
        if any([len(q) == 0 for q in query]):
            raise ValueError(
                "You've passed an empty string to the language model which is not allowed."
            )
        X = self.cv.fit_transform(query)
        self.svd.fit(X)
        self.fitted_manual = True
        self.corpus = query
        return self

Fit the model manually. This way you can call __getitem__ independantly of training.

Parameters

Name	Type	Description	Default
`query`		list of strings	required

Usage

> from whatlies.language import CountVectorLanguage
> lang = CountVectorLanguage(n_components=2, ngram_range=(1, 2), analyzer="char")
> lang.fit_manual(['pizza', 'pizzas', 'firehouse', 'firehydrant'])
> lang[['piza', 'pizza', 'pizzaz', 'fyrehouse', 'firehouse', 'fyrehidrant']]

`score_similar(self, emb, n=10, metric='cosine', lower=False)`¶

Show source code in language/_countvector_lang.py

    def score_similar(
        self,
        emb: Union[str, Embedding],
        n: int = 10,
        metric="cosine",
        lower=False,
    ) -> List:
        """
        Retreive a list of (Embedding, score) tuples that are the most similar to the passed query.
        Note that we will only consider words that were passed in the `.fit_manual()` step.

        Arguments:
            emb: query to use
            n: the number of items you'd like to see returned
            metric: metric to use to calculate distance, must be scipy or sklearn compatible
            lower: only fetch lower case tokens

        Returns:
            An list of ([Embedding][whatlies.embedding.Embedding], score) tuples.
        """
        if isinstance(emb, str):
            emb = self[emb]

        queries = self._prepare_queries(lower=lower)
        distances = self._calculate_distances(emb=emb, queries=queries, metric=metric)
        by_similarity = sorted(zip(queries, distances), key=lambda z: z[1])

        if len(self.corpus) < n:
            raise ValueError(
                f"You're trying to retreive {n} items while the corpus only trained on {len(self.corpus)}."
            )

        if len(queries) < n:
            warnings.warn(
                f"We could only find {len(queries)} feasible words. Consider changing `top_n` or `lower`",
                UserWarning,
            )

        return [(self[q], float(d)) for q, d in by_similarity[:n]]

Retreive a list of (Embedding, score) tuples that are the most similar to the passed query. Note that we will only consider words that were passed in the .fit_manual() step.

Parameters

Name	Type	Description	Default
`emb`	`Union[str, whatlies.embedding.Embedding]`	query to use	required
`n`	`int`	the number of items you'd like to see returned	`10`
`metric`		metric to use to calculate distance, must be scipy or sklearn compatible	`'cosine'`
`lower`		only fetch lower case tokens	`False`

Returns

Type	Description
`List`	An list of (Embedding, score) tuples.

whatlies.language.CountVectorLanguage¶

__getitem__(self, query)¶

embset_similar(self, emb, n=10, lower=False, metric='cosine')¶

fit_manual(self, query)¶

score_similar(self, emb, n=10, metric='cosine', lower=False)¶

`whatlies.language.CountVectorLanguage`¶

`getitem(self, query)`¶

`embset_similar(self, emb, n=10, lower=False, metric='cosine')`¶

`fit_manual(self, query)`¶

`score_similar(self, emb, n=10, metric='cosine', lower=False)`¶