Despite the fact that transformer-based LLM currently dominates many NLP tasks, classical NLP methodology still has practical benefits in certain areas, such as lexical searches and word2vec embeddings. Rather than sending the entire text into the transformer-based model to keep semantic meaning, traditional NLP methods necessitate pre-processing to clean the dataset. The code snippets below are tested in the Colab notebook.
Normalize Text
For the text has words that follow certain capitalization like camel case, title case, sentence case, etc., or some mis-capitalized words (eg: pYthOn), it is advisable to normalize the text into lowercase[2].
1 2
text = 'Python PROGRAMMING LanGUage.' print(text.lower()) # python programming language.
import regex as re doc = '<document><p> Food is very good and <b>cheap</b>.</p></document>' res=re.sub('<.*?>','',doc) print(res) # Food is very good and cheap.
import regex as re doc = 'you can contact me on my emails abcsw46545677@gmail.com or abcsw46545677@qq.com for any queries.' res=re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", doc) print(res) # you can contact me on my emails or for any queries.
Remove URLs
A generic URL may contain a protocol, subdomain, domain name, top level domain, and directory path, etc[2].
1 2 3 4
import regex as re doc = 'follow my medium profile at https://test.com/@test12345 and subscribe to my email list at https://test.medium.com/subscribe' res=re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , doc) print(res) # follow my medium profile at and subscribe to my email list at.
import regex as re doc = 'thanks @test12 and @test34 for the sharing the testing result.' res=re.findall(r'@\w+', doc) print(res) # ['@test12', '@test34']
Emojis can convey emotion and sentiment in a tweet, and can therefore be a useful feature.
1 2 3 4 5 6 7 8
# pip install emoji import emoji doc = "😊 Life's journey is a rollercoaster 🎢 filled with ups and downs. 🌦️" emojis = [emoji.emojize(word) for word in doc.split()] demojized_doc = emoji.demojize(doc) res=re.findall(r':(.*?):', demojized_doc) print(res) # ['smiling_face_with_smiling_eyes', 'roller_coaster', 'sun_behind_rain_cloud'] print([emoji.emojize(f":{word}:") for word in res]) # ['😊', '🎢', '🌦️']
Convert Accented Characters
Accent marks[2] are symbols used over letters especially vowels to emphasize the pronunciation of a word. These characters cause problems in analysis by increasing the vocabulary size unnecessarily.
1 2 3 4
import unicodedata doc = 'résumé length is good. resume font is bad.' processed_doc=unicodedata.normalize('NFKD', doc).encode('ascii', 'ignore').decode('utf-8', 'ignore') print(processed_doc) # "resume length is good. resume font is bad."
Remove Special Symbols
Special symbols[2] are characters that are not considered either letters or digits. Different symbols, punctuation, and accent marks are considered special symbols.
1 2 3 4
import regex as re doc = 'Congrats!, David You have won $1,000.00.' processed_doc=re.sub(r'[^\w ]+', "", doc) print(processed_doc) # "Congrats David You have won 100000" # Note, it breaks the currency representation.
Remove Stopwords
Some of the most common stopwords are: the, is, for, when, to, at, etc[2].
1 2 3 4 5 6 7 8
import nltk nltk.download('stopwords') from nltk.corpus import stopwords
doc = 'this is one of the best action movie i have ever watched.' english_stopwords = set(stopwords.words('english')) cleaned_doc = ' '.join([word for word in doc.split() if word notin english_stopwords]) print(cleaned_doc) # "one best action movie ever watched."
Stemming
Stemming[2] is the process of converting a word to its root by removing suffix and prefix from it. This is often done to group together different forms of a word so they can be analyzed together as a single item. For example, Stemming will reduce ‘Learning’, ‘Learns’, and ‘Learned’, to their root word ‘Learn’.
1 2 3 4 5 6 7 8 9
import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer
ps = PorterStemmer() doc = 'learning learn learned learns learnt' text = " ".join([ps.stem(word) for word in word_tokenize(doc)]) print(text) # "learn learn learn learn learnt" # Note, learnt is NOT stemmed to learn with the module.
Lemmatization
Lemmatization[2] is similar to stemming but the difference between the two is that it takes into consideration the morphological analysis of the words that allows us to differentiate between present, past, and indefinite tense.
doc = 'history always repeat itself.' lemmatizer = WordNetLemmatizer() text = " ".join([lemmatizer.lemmatize(word) for word in word_tokenize(doc)]) print('Lemmatization: ',text) # "Lemmatization: history always repeat itself."
Tokenization
Tokenization[3] is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements, known as tokens.
1 2 3 4 5 6 7 8
import nltk nltk.download('punkt') tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') from nltk.tokenize import word_tokenize
text = "This is an example of tokenization." tokens = word_tokenize(text) print(tokens) # ['This', 'is', 'an', 'example', 'of', 'tokenization', '.']
Sentiment Analysis
Sentiment Analysis[3] is the process of determining the emotional tone behind a piece of text, whether it is positive, negative, or neutral.
1 2 3 4 5 6 7
import nltk nltk.download('vader_lexicon') from nltk.sentiment import SentimentIntensityAnalyzer text = "I love this product! It's amazing." sia = SentimentIntensityAnalyzer() score = sia.polarity_scores(text) print(score) # {'neg': 0.0, 'neu': 0.266, 'pos': 0.734, 'compound': 0.6369}
Part-of-Speech Tagging
Part-of-speech (POS) tags can provide information about the structure and function of the words in a given text trunk; tagging[3] is the process of marking each word in a text with its corresponding POS tag.
1 2 3 4 5 6 7 8
import nltk nltk.download('averaged_perceptron_tagger') from nltk import pos_tag from nltk.tokenize import word_tokenize text = "I am learning NLP techniques in Python." tokens = word_tokenize(text) pos_tags = pos_tag(tokens) print(pos_tags) # [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('techniques', 'NNS'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]
Name Entity Recognition
Named Entity Recognition (NER)[3] is the process of identifying and classifying named entities in a text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
import nltk nltk.download('maxent_ne_chunker') nltk.download('words') from nltk import ne_chunk, pos_tag from nltk.tokenize import word_tokenize text = "Barack Obama was born in Hawaii." tokens = word_tokenize(text) tagged_tokens = pos_tag(tokens) ner_tree = ne_chunk(tagged_tokens) print(ner_tree) ---------------------------------------------------- (S (PERSON Barack/NNP) (PERSON Obama/NNP) was/VBD born/VBN in/IN (GPE Hawaii/NNP) ./.)