In my last post, we took a more in-depth have a look at foundation models and enormous language models (LLMs). We tried to know what they’re, how they’re used and what makes them special. We explored where they work well and where they may fall short. We discussed their applications in numerous areas like understanding text and generating content. These LLMs have been transformative in the sector of Natural Language Processing (NLP).
When we expect of an NLP Pipeline, feature engineering (also often known as feature extraction or text representation or text vectorization) is a really integral and essential step. This step involves techniques to represent text as numbers (feature vectors). We want to perform this step when working on NLP problem as computers cannot understand text, they only understand numbers and it is that this numerical representation of text that should be fed into the machine learning algorithms for solving various text based use cases equivalent to language translation, sentiment evaluation, summarization etc.
For those of us who’re aware of the machine learning pipeline generally, we understand that feature engineering is a really crucial step in generating good results from the model. The identical concept applies in NLP as well. After we generate numerical representation of textual data, one essential objective that we try to realize is that the numerical representation generated should give you the option to capture the meaning of the underlying text. So today, in our post we won’t only discuss the assorted techniques available for this purpose but additionally evaluate how close they’re to our objective at each step.
A few of the distinguished approaches for feature extraction are:
– One hot encoding
– Bag of Words (BOW)
– ngrams
– TF-IDF
– Word Embeddings
We’ll start by understanding some basic terminologies and the way they relate to one another.
– Corpus — All words within the dataset
– Vocabulary — Unique words within the dataset
– Document — Unique records within the dataset
– Word — Each word in a document
E.g. For the sake of simplicity, let’s assume that our dataset has only three sentences, the next table shows the difference between the corpus and vocabulary.
Now each of the three records within the above dataset will probably be known as the document (D) and every word within the document is the word(W).
Let’s now start with the techniques.
That is probably the most basic techniques to convert text to numbers.
We’ll use the identical dataset as above. Our dataset has three documents — we are able to call them D1, D2 and D3 respectively.
We all know the vocabulary (V) is [Cat, plays, dog, boy, ball] which is having 5 elements in it. In One Hot Encoding (OHE), we’re representing each of the words in each document based on the vocabulary of the dataset. “1” appears in positions where there may be a match.
We are able to subsequently use the above to derive One Hot Encoded representation of every of the documents.
What we’re essentially doing here is that we’re converting each word of our document right into a 2- dimensional vector where the primary dimension is the variety of words within the document and the second value indicate the vocabulary size (V=5 in our case).
Though it is vitally easy to know and implement, there are some drawbacks of this method due which this method isn’t preferred for use.
– Sparse representation (meaning there are numerous 0s and for every word for less than one position there may be a 1). Larger the corpus, greater is the V value and more will probably be the sparsity.
– Suffers from Out of Vocabulary problems — meaning if there may be a brand new word (a word which isn’t present in V while training) is introduced at inference time, the algorithm fails to work.
– Last and most significant point, this doesn’t capture the semantic relationship between words (which is our primary objective if you happen to remember our discussion above).
That leads us to explore the following technique
It’s a extremely popular and quite old technique.
First step is to again create the Vocabulary (V) from the dataset. Then, we compare the variety of occurrences of every word from the document against Vocabulary created. The next demonstration using earlier data will help to know higher
In the primary document, “cat” appears once, “plays” appears once and so does the “ball”. So, 1 is the count for every of those words and the opposite positions are marked 0. Similarly, we are able to arrive on the respective counts for every of the opposite two documents.
The BOW technique converts each document right into a vector of size equal to the Vocabulary V. Here we get three 5 dimensional vectors — [1,1,0,0,1], [0,1,1,0,1] and [0,1,0,1,1]
Bag of Words is utilized in classification tasks and has been found to perform quite well. And if you happen to have a look at the table you may understand that it also helps to capture the similarity between the sentences -atleast little . For e g: “plays” and “ball” appears in all of the three documents and hence we are able to see 1 at those positions for all of the three documents.
Pros:
– Very easy to know and implement
– Fixed length problem found earlier doesn’t occur here , because the counts are calculated based on the prevailing Vocabulary which in turns helps to unravel the Out Of Vocabulary problem identified in earlier technique. So, if a brand new word appears in the info at inference time, the calculations are done based on the prevailing words and never on latest words.
Cons:
– This continues to be a sparse representation which is difficult for computation
– Though we don’t get error as earlier when a brand new word comes as only the prevailing words are considered, we are literally losing information by ignoring the brand new words.
– It doesn’t consider the sequence of words as order of words might be very essential in understanding the text concerned
– When documents have common words more often than not but when a small change can convey opposite meaning, then BOW fails to work. E.g.:
E.g.: Suppose there are two sentences –
1. I like when it rains.
2. I don’t like when it rains.
With the way in which BOW is calculated, we are able to see each sentences will probably be considered similar as all words except “don’t” are present in each sentences, but that single word completely changes the meaning of the second sentence in comparison to the primary.
This method is analogous to BOW which we learnt just now but this time, as a substitute of single words, our vocabulary will probably be made using ngrams ( 2 words together often known as bigrams, 3 words together often known as trigrams .. or to deal with it in a generic manner — n words together often known as “ngrams”)
Ngrams technique is an improvement on top of Bag of Words because it helps to capture the semantic meaning of the sentences, again at the least to some extent. Let’s consider the instance used above.
1. I like when it rains.
2. I don’t like when it rains.
These two sentences are completely opposite in meaning and their vector representations should subsequently be far off.
After we use only the only words i.e. BOW with n=1 or unigrams, their vector representation will probably be as follows
D1 can represented as [1,1,1,1,1,0] while D2 might be represented as [1,1,1,1,1,1].
D1 and D2 seem like very similar and the difference seems to occur in only one dimension. Subsequently, they might be represented quite close to one another when plotted in a vector space.
Now, let’s see how using bigrams could also be more useful in such a situation.
With this approach we are able to see the values don’t match across three dimensions which definitely helps to represent the dissimilarity of the sentences higher within the vector space in comparison to the sooner technique.
Pros
· It’s a quite simple & intuitive approach that is simple to know and implement
· Helps to capture the semantic meaning of the text, at the least to some extent
Cons:
· Computationally costlier as as a substitute of single tokens , we’re using a mixture of tokens now. Using n-grams significantly increases the dimensions of the feature space. For example, in practical cases we won’t be coping with few words , relatively our vocabulary of can have words in hundreds, the variety of possible bigrams will probably be very high.
· This sort of data representation continues to be sparse. As n-grams capture specific sequences of words, many n-grams may not appear ceaselessly within the corpus, leading to a sparse matrix.
· The OOV problem still exits. If a brand new sentence is available in, it’s going to be ignored as much like BOW technique only the prevailing words/bigrams present within the vocabulary will probably be considered.
It is feasible to make use of ngrams(bigrams , trigrams etc) together with unigrams and will help to realize good ends in certain usecases.
For the techniques that we discussed above till now, now we have used the worth at each position based on the presence or absence or a selected word/ngrams or the frequency of the word/ngrams.
This method employs a singular logic (formula) to calculate the weights for every word based on two facets –
Term frequency (TF)– indicator of how ceaselessly a word appears in a document. It’s the ratio of the variety of times a word appears in a document to the entire variety of words within the document.
Inverse Document Frequency (IDF) however indicates the importance of a term with respect to the complete corpus.
Formula of TF-IDF:
Let’s undergo the calculation now using our corpus with three documents.
1. Cat plays ball (D1)
2. Dog plays ball (D2)
3. Boy plays ball (D3)
We are able to see from the above table that the effect of words occurring in all of the documents (plays and ball) has been reduced to 0.
Calculating the TF-IDF using the formula TF-IDF(t,d)=TF(t,d)×IDF(t,D)
We are able to see how the common words “plays” and “ball” gets dropped and more essential words equivalent to “Cat”, “Dog” and “Boy” are identified.
Thus, the approach helps to assigns higher weights to words which appear ceaselessly in a given document but appear fewer times across the entire corpus. TF-IDF may be very useful in machine learning tasks such a text classification, information retrieval etc.
We’ll now move on to learn more advanced vectorization technique.
I’ll start this topic by quoting the definition of word embeddings explained beautifully at link
“ Word embeddings are a way of representing words as vectors in a multi-dimensional space, where the gap and direction between vectors reflect the similarity and relationships among the many corresponding words.”