CountVectorizer

January 22, 2024 | seedling, permanent

tags :

Summary #

This technique counts the frequency of each word in a document and represents the document as a vector of word counts. Each word corresponds to a dimension, and the value in each dimension is the count of that word in the document.

Example #

ref

text = [‘Hello my name is james, this is my python notebook’]

The text is transformed to a sparse matrix as shown below.

text = [‘Hello my name is james' , ’this is my python notebook’]

I have 2 text inputs, what happens is that each input is preprocessed, tokenized, and represented as a sparse matrix. By default, Countvectorizer converts the text to lowercase and uses word-level tokenization.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
text = [‘Hello my name is james’,
‘james this is my python notebook’,
‘james trying to create a big dataset’,
‘james of words to try differnt’,
‘features of count vectorizer’]
coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(df)

text = [‘hello my name is james’,
‘Hello my name is James’]
coun_vect = CountVectorizer(lowercase=False)
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(df)

text = [‘hello my name is james’,
‘Hello my name is James’]
coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(df)

Links to this note

vectorization