ML Data Processing

NLP Practice

Despite the fact that transformer-based LLM currently dominates many NLP tasks, classical NLP methodology still has practical benefits in certain areas, such as lexical searches and word2vec embeddings. Rather than sending the entire text into the transformer-based model to keep semantic meaning, traditional NLP methods necessitate pre-processing to clean the dataset. The code snippets below are tested in the Colab notebook.

Normalize Text

For the text has words that follow certain capitalization like camel case, title case, sentence case, etc., or some mis-capitalized words (eg: pYthOn), it is advisable to normalize the text into lowercase^[2].

1 2	text = 'Python PROGRAMMING LanGUage.' print(text.lower()) # python programming language.

Remove Unnecessary Whitespaces^[2]

1
2
3

import regex as re
doc = 'Multiple white    spaces in this    sentence       .     '
res = re.sub("\s+"," ",doc) # Multiple white spaces in this sentence .

Remove HTML Tags^[2]

import regex as re
doc = '<document><p> Food is very good and <b>cheap</b>.</p></document>'
res=re.sub('<.*?>','',doc)
print(res) # Food is very good and cheap.

Remove Emails^[2]

import regex as re
doc = 'you can contact me on my emails abcsw46545677@gmail.com or abcsw46545677@qq.com for any queries.'
res=re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", doc)
print(res) # you can contact me on my emails  or  for any queries.

Remove URLs

A generic URL may contain a protocol, subdomain, domain name, top level domain, and directory path, etc^[2].

import regex as re
doc = 'follow my medium profile at https://test.com/@test12345 and subscribe to my email list at https://test.medium.com/subscribe'
res=re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , doc)
print(res) # follow my medium profile at  and subscribe to my email list at.

Find Hashtags^[1]

import regex as re
doc = '#funnyanimal #animal #goldenretriever'
res=re.findall(r'#\w+', doc)
print(res) # ['#funnyanimal', '#animal', '#goldenretriever']

Find Mentions^[1]

import regex as re
doc = 'thanks @test12 and @test34 for the sharing the testing result.'
res=re.findall(r'@\w+', doc)
print(res) # ['@test12', '@test34']

Find Emoji^[1]

Emojis can convey emotion and sentiment in a tweet, and can therefore be a useful feature.

# pip install emoji
import emoji
doc = "😊 Life's journey is a rollercoaster 🎢 filled with ups and downs. 🌦️"
emojis = [emoji.emojize(word) for word in doc.split()]
demojized_doc = emoji.demojize(doc)
res=re.findall(r':(.*?):', demojized_doc)
print(res) # ['smiling_face_with_smiling_eyes', 'roller_coaster', 'sun_behind_rain_cloud']
print([emoji.emojize(f":{word}:") for word in res]) # ['😊', '🎢', '🌦️']

Convert Accented Characters

Accent marks^[2] are symbols used over letters especially vowels to emphasize the pronunciation of a word. These characters cause problems in analysis by increasing the vocabulary size unnecessarily.

import unicodedata
doc = 'résumé length is good. resume font is bad.'
processed_doc=unicodedata.normalize('NFKD', doc).encode('ascii', 'ignore').decode('utf-8', 'ignore')
print(processed_doc) # "resume length is good. resume font is bad."

Remove Special Symbols

Special symbols^[2] are characters that are not considered either letters or digits. Different symbols, punctuation, and accent marks are considered special symbols.

import regex as re
doc = 'Congrats!, David You have won $1,000.00.'
processed_doc=re.sub(r'[^\w ]+', "", doc) 
print(processed_doc) # "Congrats David You have won 100000" # Note, it breaks the currency representation.

Remove Stopwords

Some of the most common stopwords are: the, is, for, when, to, at, etc^[2].

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

doc = 'this is one of the best action movie i have ever watched.'
english_stopwords = set(stopwords.words('english'))
cleaned_doc = ' '.join([word for word in doc.split() if word not in english_stopwords])
print(cleaned_doc) # "one best action movie ever watched."

Stemming

Stemming^[2] is the process of converting a word to its root by removing suffix and prefix from it. This is often done to group together different forms of a word so they can be analyzed together as a single item. For example, Stemming will reduce ‘Learning’, ‘Learns’, and ‘Learned’, to their root word ‘Learn’.

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

ps = PorterStemmer()
doc = 'learning learn learned learns learnt'
text = " ".join([ps.stem(word) for word in word_tokenize(doc)])
print(text) # "learn learn learn learn learnt" # Note, learnt is NOT stemmed to learn with the module.

Lemmatization

Lemmatization^[2] is similar to stemming but the difference between the two is that it takes into consideration the morphological analysis of the words that allows us to differentiate between present, past, and indefinite tense.

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer

doc = 'history always repeat itself.'
lemmatizer = WordNetLemmatizer()
text = " ".join([lemmatizer.lemmatize(word) for word in word_tokenize(doc)])
print('Lemmatization: ',text) # "Lemmatization:  history always repeat itself."

Tokenization

Tokenization^[3] is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements, known as tokens.

import nltk
nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
from nltk.tokenize import word_tokenize

text = "This is an example of tokenization."
tokens = word_tokenize(text)
print(tokens) # ['This', 'is', 'an', 'example', 'of', 'tokenization', '.']

Sentiment Analysis

Sentiment Analysis^[3] is the process of determining the emotional tone behind a piece of text, whether it is positive, negative, or neutral.

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
text = "I love this product! It's amazing."
sia = SentimentIntensityAnalyzer()
score = sia.polarity_scores(text)
print(score) # {'neg': 0.0, 'neu': 0.266, 'pos': 0.734, 'compound': 0.6369}

Part-of-Speech Tagging

Part-of-speech (POS) tags can provide information about the structure and function of the words in a given text trunk; tagging^[3] is the process of marking each word in a text with its corresponding POS tag.

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
from nltk.tokenize import word_tokenize
text = "I am learning NLP techniques in Python."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
print(pos_tags) # [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('techniques', 'NNS'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]

Name Entity Recognition

Named Entity Recognition (NER)^[3] is the process of identifying and classifying named entities in a text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk import ne_chunk, pos_tag
from nltk.tokenize import word_tokenize
text = "Barack Obama was born in Hawaii."
tokens = word_tokenize(text)
tagged_tokens = pos_tag(tokens)
ner_tree = ne_chunk(tagged_tokens)
print(ner_tree)
----------------------------------------------------
(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Hawaii/NNP)
  ./.)

Word Embedding

Word2Vec embeddings^[3]

from gensim.models import Word2Vec
# Define a dataset
sentences = [['This', 'is', 'a', 'positive', 'text'],
            ['This', 'is', 'a', 'negative', 'text'],
            ['This', 'is', 'a', 'neutral', 'text']]
# Train the model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
# Access the trained model's word vector
word_vector = model.wv['positive']
print(word_vector) # array([8.13227147e-03 -4.45733406e-03 -1.06835726e-03, ...], dtype=float32)

GloVe embeddings^[3]

from gensim.models import KeyedVectors
# Load the pre-trained GloVe model, e.g., 100d embeddings
model = KeyedVectors.load_word2vec_format('path/to/glove.6B.100d.txt', binary=False)
# Access the word vector
word_vector = model['word']
print(word_vector)