During 2018-19, I was studying a lot about Machine Learning and Deep Learning at the weekends. As a small side project, I even created a Conversational AI chatbot out of curiosity (using Tensorflow Seq2Seq + Numpy + Python) to see how things work behind the scenes. What fascinated me the most were the capabilities of some of the known contextual AI chatbots of that time – Meena, DialoGPT, Cleverbot, Mitsuku, XiaoIce etc. to name a few. None of them made it big like ChatGPT but all of them had one thing in common – they were introducing the concept of contextual conversation in the field of AI.
There was a small “AI Winter” after that, during the pandemic, when people were talking less about innovations and more about survival.
Fast forward to today – ChatGPT, and other implementations of Generative AI, have changed how software is being visioned, and built and problems are being solved. The core fundamental concept behind technologies like ChatGPT is Natural Language Processing (abbr: NLP). In simple words – performing manipulation and analysis on the natural language text used by humans.
In my opinion, every inquisitive software engineer should know how these mind-boggling capabilities come into existence – starting from the very basic steps to the advanced powerful engine implementation that they are. This is not to ask people to move into the Data Science field but to encourage them to utilise the power of the latest research work to solve various problems in their respective fields. The steps one should undertake to start learning NLP are in the following order:
– Text cleaning and Text Preprocessing techniques (Parsing, Tokenization, Stemming, Stopwords, Lemmatization, Word2Vec, Bag of words, Word embeddings, Unigrams, Bigrams, N-grams)
– ANN (Artificial Neural Network) and RNN (Recurrent Neural Network)
– LSTM (Long short term memory) and GRU (Gated Recurring Units)
– Encoding and Decoding
– Attention Models
– Transformer architecture and Language models
– Use cases like BERT, ChatGPT
In this article, I will explain a few of the initial steps of Cleaning and Preprocessing the natural text before they are sent for further processes in the NLP project pipelines.
A little bit of NLP history
To give you all some context – the whole idea of working with NLP began long back in the 1950s as an intersection study between AI and linguistics. At that time, there was another field that was making huge improvements – Automated TIR (Text Information Retrieval) whose primary purpose was to index, search and extract text from huge volumes of data. Later on, the study of NLP and TIR merged and came under the umbrella of broader “NLP” terms. Post that, some major work happened in this field:
– Word-to-word translation using homographs
– BNF (Backus-Naur Form) ‘s context-free grammar (CFG) that represented programming languages’ syntax. This was inadequate for NLP problems
– Lexical Analyzer (Lexer) generators and parser generators (more on its implementation in my future articles)
All of the above and other parsing techniques were not sufficient enough to extract “semantics” (meanings) from the text. This led to the birth of “Statistical NLP” where a statistical parser would determine the “most likely” (context-dependent) parse of a sentence. This is the field that has made big progress in NLP and its application can be found in concepts like Natural language text processing, summarization, cross-language information retrieval and speech recognition.
With that info, let’s go through the cleaning and processing phases.
Most of the time when people collect text data with the help of web scraping, crowdsourcing, existing datasets or language resources (e.g. dictionaries, ontologies), the data comes in a raw and unstructured format. This form of collected linguistic text data (also known as a corpus in the NLP world) is usually not so useful for the NLP use cases for which they were gathered. To convert the collected data into a usable form, text cleaning needs to be done. There are several ways of performing data cleaning but the operation depends on several factors like the business domain, the use cases, the business context and the preferred outcome. Based on these factors, it is up to the engineers to apply proper cleaning techniques for removing inconsistencies or correcting errors. Some frequent data-cleaning techniques that are applied are:
– Removing emojis or emoticons (not preferred for use cases like sentiment analysis where this holds a value)
– Removing punctuations and numbers
– Removing extra space
– Converting the whole corpus into lower cases
– Removing non-English words
… and many more. The list is not exhaustive and depends on the factors mentioned before.
Let’s move on to the coding side and see how this can be done. Most data engineers use Python as the preferred language for these NLP tasks.
Once you take the corpus, you can use code to:
– remove the punctuations
– convert to lower case
– remove extra spaces
– remove emojis and emoticons
– remove non-English words
import string import emoji import re import nltk class TextCleaning: def __init__(self): nltk.download("words") def remove_punctuation(self, corpus: str) -> str: translator = str.maketrans("", "", string.punctuation) return corpus.translate(translator) def convert_to_lowercase(self, corpus: str) -> str: return corpus.lower() def remove_extra_spaces(self, corpus: str) -> str: output_data = " ".join(corpus.split()) return output_data def remove_emojis_and_emoticons(self, corpus: str) -> str: corpus = emoji.replace_emoji(corpus) emoticon_pattern = re.compile("[" u"\U0001F600-\U0001F64F" u"\U0001F300-\U0001F5FF" u"\U0001F680-\U0001F6FF" u"\U0001F1E0-\U0001F1FF" u"\U00002702-\U000027B0" u"\U000024C2-\U0001F251" "]+", flags=re.UNICODE) corpus = emoticon_pattern.sub("", corpus) return corpus def remove_non_english_words(self, corpus: str) -> str: english_words = set(nltk.corpus.words.words()) words = corpus.split() filtered_words = [word for word in words if word.lower() in english_words] corpus = " ".join(filtered_words) return corpus if __name__ == "__main__": text_cleaning = TextCleaning() corpus = """I just got back from a trip to España! 🇪🇸 It was amazing. The food 🍴 was incredible, and I tried so many new dishes like paella and churros. 🤤 The architecture in Barcelona was breathtaking, with towering cathedrals and colorful buildings. 🏰🌈 I also had the chance to practice my español with some locals, which was a bit intimidating at first but really fun in the end. 🗣️ Overall, it was an unforgettable experience and I can't wait to travel to more countries and learn new languages in the future! 🌍🧳""" corpus = text_cleaning.remove_punctuation(corpus) corpus = text_cleaning.convert_to_lowercase(corpus) corpus = text_cleaning.remove_extra_spaces(corpus) corpus = text_cleaning.remove_emojis_and_emoticons(corpus) corpus = text_cleaning.remove_non_english_words(corpus) print(corpus)
…. and get the resultant text.
Once we get the cleansed data as per our need, we can move on to the next stages:
Text Preprocessing (Tokenization)
For any text analysis or text generation using NLP, it is important to concentrate on the basic units (e.g. words or phrases) called “tokens” and segregate them. But how to identify and break down the corpus into these basic units by recognizing them in the first place? Different languages have different rules for tokenization which make the process more complex. Take the example of the words – “New Delhi” and “isn’t”. Even though “New Delhi” has two words but they should be tied together. On the other hand, “isn’t” needs to be broken down into two separate words – “is not” to be meaningful. We can tokenize at different levels e.g. at the sentence level, and at the word level.
There are popular libraries in different languages that do most of this heavy lifting for us:
– nltk (Natural Language Toolkit), spaCy, keras, scikit-learn, gensim (in Python)
– Standford CoreNLP, OpenNLP (in Java)
– tidytext, text2vec (in R)
Text Preprocessing (Stemming)
Now the basic forms that we have derived from the previous “Tokenization” step need to be processed further to reduce them to their root forms. Usually, this is done by applying some complex stemming algorithms that apply some rules/heuristics and remove prefixes/suffixes before spitting out the output. Consider the example for the words: “finale“, “final“, “finally“, and “finalize“. After applying the stemming process, all these will be transformed to their common base form – “final” and the subsequent steps will be applied. But, there is no guarantee that the derived root form will be something meaningful. Take the example of the words: “history” and “historical“. After applying the stemming process to them, the result is “histori” which bears no meaning. The main aim of the stemming process is to optimize the reduction of the words to their root form without paying attention to the proper meaningful words. So it has some limitations.
Text Preprocessing (Lemmatization)
One important shortcoming that “Stemming” has is that it may give out an approximate root form that may not be valid at all in that language (discussed above). The “Lemmatization” technique overcomes this disadvantage by always producing valid words. It uses more advanced algorithms by considering the words’ part of speech, and other grammatical structures and gives results that have some level of contextual meaning attached to them. It is more computationally intensive than the stemming technique but gives better results.
Most of the libraries (mentioned before) have support for both “stemming” and “lemmatization”.
You can check here how tokenization and lemmatization of a corpus paragraph can be done using the nltk library. The example also uses nltk’s “stopwords” collection to remove words/phrases that have little or no meaning in the context of the supplied corpus paragraph.
import nltk from nltk import WordNetLemmatizer from nltk.corpus import stopwords class TextPreprocessing: def tokenization(self, corpus): lemmatizer = WordNetLemmatizer() sentence_list = nltk.sent_tokenize(corpus) # this generates a list of sentences from the corpus new_list =  # this will generate at the end a list after doing lemmatization of the words in the input sentences for index, sentence in enumerate(sentence_list): words = nltk.word_tokenize(sentence) words = [lemmatizer.lemmatize(word=word, pos="n") for word in words if word not in set(stopwords.words("english"))] sentence = " ".join(words) new_list.append(sentence) for i, j in enumerate(new_list): print(j) from text_preprocessing import TextPreprocessing if __name__ == "__main__": corpus = """When learning is purposeful, creativity blossoms. When creativity blossoms, thinking emanates. When thinking emanates, knowledge is fully lit. When knowledge is lit, economy flourishes.""" text_preprocessing = TextPreprocessing() text_preprocessing.tokenization(corpus)
Output: When learning purposeful , creativity blossom . When creativity blossom , thinking emanates . When thinking emanates , knowledge fully lit . When knowledge lit , economy flourish .