Natural Language Processing for Newbies

“Don’t mind me…I’m not REALLY reading, I am just looking for patterns in the data…”

As promised in my previous post, I shall be explaining AI-related terminology for non-experts. This post is aimed at explaining Natural Language Processing, or NLP in short, for those who have a general idea of what it means but would like to understand better. NLP is a sub-field of AI which focuses on the communication between computers and humans via natural languages. More specifically, NLP is concerned with the programming of computers to decipher, analyse and produce natural language data such (text). This is not to be confused with Speech Recognition (SR) which involves the processing of sounds (spoken language) as opposed to text (written language). Although you will find many programs in which SR and NLP are used in conjunction with one another, such as in the case of Apple’s Siri or Amazon’s Alexa, they are still two separate problem domains which require their own individual pipelines.

We come into contact with data in the form of text on a daily basis, such as in textbooks, novels, signs, websites, instruction manuals, recipes, etc.. This type of data is referred to as unstructured data because it cannot be easily stored into a tabular format, such as a table. Unstructured data makes up to 79% of all online data, and is constantly being generated as we communicate on Facebook, Twitter, Messenger, Reddit, Whatsapp etc… Understanding the content of unstructured data using computers or in an automated fashion is a tricky task because it requires human level intelligence – the intelligence that we use to derive meaning from text or spoken language on a daily basis. 

This is where NLP comes in. At a high level NLP algorithms cannot read text in the same way humans do, but it looks for patterns within the data. It does this by first eliminating unimportant words from the text. Such unimportant words can be conjunctions (“but”, “however” etc..) or articles (“a”, “the” etc..) or any word that is irrelevant for the task at hand. Following this step an NLP algorithm will split the data into small groups of words or characters known as a token, and will then proceed to count how many times a specific token appears in a given document. The algorithm then stores this information into a matrix which defines how the document is distributed. A commonly used application of NLP is “learning” a foreign language by comparing two identical books written in different languages. A NLP-based program will first break down each book into matrix form as described. Using deep learning (deep learning is a complex machine learning method that identifies and matches patterns within cohorts of data) the program will compare tokens from the two books and therefore learn translations for a multitude of words. This type of NLP is known as Natural Language Understanding (NLU) as it deals with gather meaning, or information, from text. The said program can then be used to translate a new phrase of previously unseen text from one language to another – this is known as Natural Language Interpretation (NLI) as it involves the generation of new content. 

The application of NLP is generally thought to have started in the early 1950s and was developed with the aim of translating text to/from different languages. One of the scientists heavily involved in this field was Alan Turing, who in turn established the Turing test, which is a test in the form of a game called “Can Machines Think?”, which is used to asses a machine’s ability to exhibit intelligent behaviour and whether it can be distinguished from human intelligence. As part of the test a human observer is presented with two sets of answers to a list of questions, one set contains answers produced by a machine, whilst the other set contains answers produced by a human. The human observer must then decipher which set of answers belongs to the machine and which set belongs to the human. The progress of NLP has not been as fast as anticipated, in fact one of the very first machines to pass the Turing test is a chatbot named Eugene Goostman, which is a simulation of a 13-year old Ukranian boy, and who has passed the test and convinced 33% of the judges at the Royal Society in London that it was human on June 7th 2014.

There is an abundance of applications for which NLP can be used. Here are some introductory examples to give you a better understanding of its power: NLP for grouping accident reports on oil rigs, NLP for book translations, Sentiment analysis on Harry Potter books 

Popular Posts

Leave a Reply

Your email address will not be published. Required fields are marked *