CountVectorizer

CountVectorizer

January 22, 2024 | seedling, permanent

tags :

Summary #

This technique counts the frequency of each word in a document and represents the document as a vector of word counts. Each word corresponds to a dimension, and the value in each dimension is the count of that word in the document.

Example #

ref

text = [Hello my name is james, this is my python notebook]

The text is transformed to a sparse matrix as shown below.

text = [Hello my name is james' , ’this is my python notebook’]

I have 2 text inputs, what happens is that each input is preprocessed, tokenized, and represented as a sparse matrix. By default, Countvectorizer converts the text to lowercase and uses word-level tokenization.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
text = [Hello my name is james,
james this is my python notebook,
james trying to create a big dataset,
james of words to try differnt,
features of count vectorizer]
coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(df)
text = [hello my name is james,
Hello my name is James]
coun_vect = CountVectorizer(lowercase=False)
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(df)
text = [hello my name is james,
Hello my name is James]
coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(df)


Links to this note

Go to random page

Previous Next