Probably the most basic data processing that any Natural Language Processing (NLP) project requires is to convert the text data to the numeric data. So long as the info is in text form we cannot do any form of computation motion on it.
There are multiple methods available for this text-to-numeric data conversion. This tutorial will explain some of the basic vectorizers, the CountVectorizer method within the scikit-learn library.
This method may be very easy. It takes the frequency of occurrence of every word because the numeric value. An example will make it clear.
In the next code block:
- We are going to import the CountVectorizer method.
- Call the tactic.
- Fit the text data to the CountVectorizer method and, convert that to an array.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer #That is the text to be vectorized
text = ["Hello Everyone! This is Lilly. My aunt's name is also Lilly. I love my aunt.
I am trying to learn how to use count vectorizer."]
cv= CountVectorizer()
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_arr
Output:
array([[1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1]],
dtype=int64)
Here I even have the numeric values representing the text data above.
How can we know which values represent which words within the text?
To make that clear, it’ll be helpful to convert the array right into a DataFrame where column names might be the words themselves.
cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df
Now, it shows clearly. The worth of the word ‘also’ is 1 which suggests ‘also’ appeared just once within the test. The word ‘aunt’ got here twice within the text. So, the worth of the word ‘aunt’ is 2.
Within the last example, all of the sentences were in a single string. So, we got just one row of information for 4 sentences. Let’s rearrange the text and…