MAGIC AI Workshop 03 Natural Language Processing

Introduction

In this lab, we will introduce tools for natural language processing (NLP), from basic data preparation through to some exploration and building a simple machine learning model. We are only scratching the surface of what is possible with NLP methods in this lab. See the tidytext website for further examples.

You’ll need several packages for the lab including:

tidytext: a library for cleaning and processing text data
SnowballC
spacyr
textstem
word2vec
uwot
textdata

install.packages(c("tidytext"))

Now load the first few packages:

library(tidyverse)
library(tidytext)
library(textstem)

Data

We’ll use a set of tweets related to climate change from 2015 to 2018, taken from:

https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset

The data are held in the file twitter_sentiment_data.csv, which you can download from the github repository. Read these in and take a quick look. There are three columns: a sentiment estimate, the tweet (message) and a tweet id. The sentiment estimate was provided by a group of experts and are tagged as follows:

2 (News): the tweet links to factual news about climate change
1 (Pro): the tweet supports the belief of man-made climate change
0 (Neutral): the tweet neither supports nor refutes the belief of man-made climate change
-1(Anti): the tweet does not believe in man-made climate change

dat <- read.csv("./datafiles/twitter_sentiment_data.csv")
head(dat)

##   sentiment
## 1        -1
## 2         1
## 3         1
## 4         1
## 5         2
## 6         0
##                                                                                                                                              message
## 1          @tiniebeany climate change is an interesting hustle as it was global warming but the planet stopped warming for 15 yes while the suv boom
## 2 RT @NatGeoChannel: Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change https://t.co/LkDehj3tNn httÃ¢â‚¬Â¦
## 3                              Fabulous! Leonardo #DiCaprio's film on #climate change is brilliant!!! Do watch. https://t.co/7rV6BrmxjW via @youtube
## 4    RT @Mick_Fanning: Just watched this amazing documentary by leonardodicaprio on climate change. We all think thisÃ¢â‚¬Â¦ https://t.co/kNSTE8K8im
## 5        RT @cnalive: Pranita Biswasi, a Lutheran from Odisha, gives testimony on effects of climate change &amp; natural disasters on the poÃ¢â‚¬Â¦
## 6                                                                              Unamshow awache kujinga na iko global warming https://t.co/mhIflU7M1X
##        tweetid
## 1 7.929274e+17
## 2 7.931242e+17
## 3 7.931244e+17
## 4 7.931246e+17
## 5 7.931252e+17
## 6 7.931254e+17

Our basic plan here is:

Prepare the data for analysis.
Visualize and explore the data
Create an embedding for the tweets. This represents each tweet as a vector of numbers, and can be used for further analysis
Create a simple machine learning model to predict the sentiment of a tweet

Text processing

Processing text data into a usable form can be one of the most time consuming parts of the analysis. Basically, we want to remove any characters or words that are irrelevant to any analysis. In addition, we should try to simplify and standardize the language used. For example, a computer will not necessarily recognize that ‘see’ and ‘seen’ are related to each other.

General cleaning

First, we’ll remove any retweets from the dataset (indicated by RT at the start of the message). While there are some applications where the number of retweets are of interest, we will consider them as duplicates for this exercise.

dat = dat %>%
  filter(str_starts(message, "RT", negate = TRUE))

To illustrate the next steps, we’ll extract the fourth tweet from the dataset:

tweet = dat[4, ]
print(tweet$message)

## [1] "#BeforeTheFlood Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change... https://t.co/HCIZrPUhLF"

This is a typical tweet and has several issues for text processing:

There is a URL at the end of the tweet
There is at least one username (@...)
There are several hashtag (#...)

We’ll use several steps to clean this up. To illustrate these, we’ll walk through the individual steps for the first 5 tweets.

First extract the first 5 tweets:

tidy_dat <- dat %>%
  slice_head(n = 5)

Remove various non-words (URLs, symbols, etc)

tidy_dat <- tidy_dat %>%
  mutate(message = str_replace_all(message, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https", "")) 
head(tidy_dat)

##   sentiment
## 1        -1
## 2         1
## 3         0
## 4         1
## 5         1
##                                                                                                                                        message
## 1    @tiniebeany climate change is an interesting hustle as it was global warming but the planet stopped warming for 15 yes while the suv boom
## 2                                               Fabulous! Leonardo #DiCaprio's film on #climate change is brilliant!!! Do watch.  via @youtube
## 3                                                                                               Unamshow awache kujinga na iko global warming 
## 4                             #BeforeTheFlood Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change... 
## 5 Bangladesh did not cause climate change, so the country does not need Ã¢â‚¬Å“aidÃ¢â‚¬ï†\u009d; instead it needs compensation for theÃ¢â‚¬Â¦ 
##        tweetid
## 1 7.929274e+17
## 2 7.931244e+17
## 3 7.931254e+17
## 4 7.931273e+17
## 5 7.931297e+17

Remove usernames (starting with @..)

tidy_dat <- tidy_dat %>% 
  mutate(message = str_replace_all(message, "@\\w+", "")) 
head(tidy_dat)

##   sentiment
## 1        -1
## 2         1
## 3         0
## 4         1
## 5         1
##                                                                                                                                        message
## 1                climate change is an interesting hustle as it was global warming but the planet stopped warming for 15 yes while the suv boom
## 2                                                       Fabulous! Leonardo #DiCaprio's film on #climate change is brilliant!!! Do watch.  via 
## 3                                                                                               Unamshow awache kujinga na iko global warming 
## 4                                         #BeforeTheFlood Watch #BeforeTheFlood right here, as  travels the world to tackle climate change... 
## 5 Bangladesh did not cause climate change, so the country does not need Ã¢â‚¬Å“aidÃ¢â‚¬ï†\u009d; instead it needs compensation for theÃ¢â‚¬Â¦ 
##        tweetid
## 1 7.929274e+17
## 2 7.931244e+17
## 3 7.931254e+17
## 4 7.931273e+17
## 5 7.931297e+17

Convert the tweets into individual words or tokens. Note that this converts the data from being one line per tweet to one line per work

tidy_dat <- tidy_dat %>%
  unnest_tokens(word, message)
head(tidy_dat)

##   sentiment      tweetid        word
## 1        -1 7.929274e+17     climate
## 2        -1 7.929274e+17      change
## 3        -1 7.929274e+17          is
## 4        -1 7.929274e+17          an
## 5        -1 7.929274e+17 interesting
## 6        -1 7.929274e+17      hustle

Finally, remove stopwords. These are a predefined set of commonly occurring words that have little value in analysis (e.g. the, and, …).

tidy_dat <- tidy_dat %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
head(tidy_dat)

##   sentiment      tweetid    word
## 1        -1 7.929274e+17 climate
## 2        -1 7.929274e+17  change
## 3        -1 7.929274e+17  hustle
## 4        -1 7.929274e+17  global
## 5        -1 7.929274e+17 warming
## 6        -1 7.929274e+17  planet

Word matching

The last thing we’ll need to do is match words with similar meanings. There’s a couple of approaches to this: stemming and lemmatization. Stemming strips words back to the core stem using stem_words() from the textstem library. For example, here are 5 different words related to programming. The stemmer converts them all to program:

words <- c("program","programming","programer","programs","programmed")
stem_words(words)

## [1] "program" "program" "program" "program" "program"

One disadvantage to this is that the stems may no longer reflect actual words. For example, the stem to climate is climat:

stem_words("climate")

## [1] "climat"

The second issue is that stemming does not account for context - that words with different meanings may be spelled the same, and can only be distinguished in the context of the sentence.

Lemmatization attempts to avoid these issues by converting words to a standard form, and accounting for the meaning of the surrounding words. Here we’ll use the spacyr package to perform lemmatization. Use this to compare the conversion of saw in these two phrases:

library(spacyr)
spacy_parse("Owen saw a rabbit")

##   doc_id sentence_id token_id  token  lemma   pos entity
## 1  text1           1        1   Owen   Owen PROPN       
## 2  text1           1        2    saw    see  VERB       
## 3  text1           1        3      a      a   DET       
## 4  text1           1        4 rabbit rabbit  NOUN

spacy_parse("Owen cut a plank with a saw")

##   doc_id sentence_id token_id token lemma   pos   entity
## 1  text1           1        1  Owen  Owen PROPN PERSON_B
## 2  text1           1        2   cut   cut  VERB         
## 3  text1           1        3     a     a   DET         
## 4  text1           1        4 plank plank  NOUN         
## 5  text1           1        5  with  with   ADP         
## 6  text1           1        6     a     a   DET         
## 7  text1           1        7   saw   saw  NOUN

Putting it all together

Step 1: clean

tidy_dat <- dat %>%
  mutate(message = str_replace_all(message, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https", "")) %>% 
  mutate(message = str_replace_all(message, "@\\w+", "")) %>%
  unnest_tokens(word, message) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

Step 2: lemmatize

Note to keep things running quickly in this lab, we’ll use textstem’s function for lemmatization. This is not quite as robust as the spacyr library, but substantially faster.

tidy_dat$clean_word <- lemmatize_words(tidy_dat$word)

Exploring the data

We can now use the cleaned text data to do some exploration. We’ll start by making some word clouds. These are a very common visualization of text data, where words are randomly placed on a figure and scaled according to their frequency. We’ll use the wordcloud package to make plots, and create a data frame of the counts of individual words for use in the cloud.

tidy_count <- tidy_dat %>%
  count(clean_word) %>%
  arrange(-n)
head(tidy_count)

##   clean_word     n
## 1    climate 14478
## 2     change 14118
## 3     global  5401
## 4       warm  5138
## 5      trump  1460
## 6       real   793

First, let’s plot all the data. This is, not surprisingly, dominated by the words ‘climate’ and ‘change’

library(wordcloud)
wordcloud(tidy_count$clean_word, tidy_count$n, max.words = 100)

For the next plot, we’ll extract only the ‘pro’ tweets, and skip plotting climate and change by setting them as stopwords

tidy_count_pos <- tidy_dat %>%
  filter(sentiment == 1,
         !clean_word %in% c("climate", "change", "global", "warm")) %>%
  count(clean_word) %>%
  arrange(-n)
wordcloud(tidy_count_pos$clean_word, tidy_count_pos$n, max.words = 100)

tidy_count_neg <- tidy_dat %>%
  filter(sentiment == -1,
         !clean_word %in% c("climate", "change", "global", "warm")) %>%
  count(clean_word) %>%
  arrange(-n)
wordcloud(tidy_count_neg$clean_word, tidy_count_neg$n, max.words = 100)

Sentiment analysis

Next, we’ll estimate the sentiment of the tweets. The data already has a column labeled sentiment, which is a category describing whether the tweet was for or against climate change (or neutral). Sentiment analysis is a little different from this, as it attempts to score some text based on whether the words are overall positive, neutral or negative, irrespective of the belief in or against climate change. There are several different lexicons for sentiment analysis, some of which provide more fine grained detail. We’ll use here a function from the tidytext library (get_sentiment), which scores sentiment values between -5 (negative) and 5 (positive) for each word. You may be prompted to download the AFINN library when running this.

get_sentiments("afinn") %>%
  head()

## # A tibble: 6 × 2
##   word       value
##   <chr>      <dbl>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2

We merge this with the cleaned data by joining on the cleaned word:

tidy_sentiment <- inner_join(tidy_dat, get_sentiments("afinn"), by = c("clean_word" = "word"))
head(tidy_sentiment)

##   sentiment      tweetid      word clean_word value
## 1        -1 7.929274e+17   warming       warm     1
## 2        -1 7.929274e+17   stopped       stop    -1
## 3        -1 7.929274e+17   warming       warm     1
## 4         1 7.931244e+17  fabulous   fabulous     4
## 5         1 7.931244e+17 brilliant  brilliant     4
## 6         0 7.931254e+17   warming       warm     1

And we can make a word cloud of the positive terms used in conjunction with climate change (I am well aware of the irony of trump being considered positive here, so I’m going to remove it):

tidy_count_pos <- tidy_sentiment %>%
  filter(value > 1,
         !clean_word %in% c("climate", "change", "global", "warm", "trump")) %>%
  count(clean_word) %>%
  arrange(-n)
wordcloud(tidy_count_pos$clean_word, tidy_count_pos$n, max.words = 100)

Embedding text data

To go further in the analysis of text data, we need to use a text embedding. This converts the text to a numeric representation in a high dimensional space. The simplest form of this is one-hot encoding, which creates a binary matrix with one column per word, and one row per tweet. If the word occurs in that tweet, then it’s labeled with a 1, and a 0 if not. One hot encoding works well with a small number of words, but scales poorly with richer text.

Embeddings are more complex representations of text, usually created by analyzing which words are likely to occur in similar contexts. It has a lot of similarities to principal component analysis for numeric data, in which complex data can be represented by a small number of components that capture correlations between the variables. For text, these means that the embedding for ‘dog’ and ‘cat’ will be similar, but ‘dog’ and ‘car’ will be dissimilar. This can then be used to explore the similarity between pieces of text, or (as we’ll see below) to use text in machine learning models. These embeddings are a key part of large language models (e.g. ChatGPT), where they are used to relate prompts or questions to the appropriate text that makes up a response.

While it’s possible to create your own embedding (which is useful for specific projects), this can be quite time consuming, and can require a substantial amount of text. In the example we’ll use below, we’ll use an embedding that was created using a model called Word2Vec and trained using Google news articles. You can download the file that contains the embedding weights from the Google Drive folder:

https://drive.google.com/drive/folders/1GMEY1fYEj1YMI__u3hU4y6agnrz3ekna?usp=drive_link

A good selection of alternative, pre-trained embeddings can be found at Hugging Face:

https://huggingface.co/models?other=text-embedding

Load the word2vec package to find the embeddings for different pieces of text, and we’ll need to load the embeddings file:

library(word2vec)
model <- read.word2vec(file = "./wgts/GoogleNews-vectors-negative300.bin", normalize = TRUE)

As an example, here is the embedding for the word ‘cat’ (I’ve just printed the first 50 values):

predict(model, "cat", type = "embedding")[1, 1:50]

##  [1]  0.07029951  1.16377008 -1.62593722  1.23615777  0.67376167  0.47330365
##  [7]  0.28398219 -0.05429071  1.25843084 -0.71830785  0.45938295 -3.34096694
## [13] -0.02540527 -1.69275653 -0.07482374 -0.47608778  0.28815839  0.86308312
## [19] -2.56140804 -0.07725986  1.22502124 -0.84081000  1.28070402 -0.71273959
## [25] -0.55404365  1.41991091 -1.64821029  2.08253598  2.34981346 -0.49000847
## [31] -0.44824639 -1.12479222 -0.51784986 -0.80740035 -0.58466923  0.74614930
## [37] -0.01974999  0.41205257  0.25196460  1.97117043  0.42597327 -0.64035201
## [43]  0.38421118  0.64035201  0.11275763 -0.70438719  1.19717979 -0.41205257
## [49] -0.15869592  0.31599978

It’s pretty meaningless to us mortals, but this is a representation of the word ‘cat’ that a computer can work with. To follow the example given above, we can extract these for ‘cat’, ‘dog’ and ‘car’, and explore the correlations with these

cat_wv = predict(model, "cat", type = "embedding")[1, ]
car_wv = predict(model, "car", type = "embedding")[1, ]
dog_wv = predict(model, "dog", type = "embedding")[1, ]

cor(cat_wv, dog_wv)

## [1] 0.7607611

cor(car_wv, dog_wv)

## [1] 0.3075351

plot(cat_wv, dog_wv, xlab = "cat", ylab = "dog")

plot(cat_wv, car_wv, xlab = "cat", ylab = "car")

vectorized_words = predict(model, tidy_dat$clean_word, 
                           type = "embedding")

The result (vectorized_docs) is a numeric array with 300 columns and the same number of rows as the cleaned words. We’ll now collapse the values into mean embeddings for each tweet. To do this we have to add (and subsequently remove) the tweet id from the cleaned data.

vectorized_words = as.data.frame(vectorized_words)
vectorized_words$id = tidy_dat$tweetid

vectorized_docs <- vectorized_words %>% 
  drop_na() %>%
  group_by(id) %>% 
  summarise_all(mean, na.rm = TRUE) %>% 
  select(-id)

We can now use any of the usual tools for exploring and modeling numeric data,

Cluster analysis

We’ll first use a K-means cluster function to group the tweets into 4 sets.

tweet_km <- kmeans(vectorized_docs, 4)

We can also visualize the embeddings using other dimension reduction techniques. Here we use UMAP, a non-linear, efficient way of collapsing high-dimensional data to low (usually 2) dimensions

library(uwot)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

viz <- umap(vectorized_docs, n_neighbors = 15, 
            min_dist = 0.001, spread = 4, n_threads = 2)

This can be plotted - each point here represents an individual tweet, and the colors are the clusters we created in the previous step. Note there are quite a lot of outliers that could be potentially removed, and that one cluster is very distinct from the others. This may suggest a group of tweets that deal with different aspect of climate change. (You could plot the word cloud for these tweets to see if that shows some differences).

library(ggplot2)
df <- data.frame(x = viz[, 1], y = viz[, 2],
                 cluster = as.factor(tweet_km$cluster),
                 stringsAsFactors = FALSE)
ggplot(df, aes(x = x, y = y, col = cluster)) +
  geom_point() + theme_void()

Using embeddings in machine learning

As a last step, we’ll briefly look at using these embeddings in a machine learning model. We’ll build a model to try and predict the sentiment of a tweet (positive or negative) from it’s content. We’ll use a random forest model with the embeddings as features, and the sentiment value as a label. We’ll first need to integrate our embedding data with the sentiment score we generated earlier. First, we’ll remake the average embedding values per tweet, but this time we’ll keep the tweet id.

vectorized_docs_ml <- vectorized_words %>% 
  drop_na() %>%
  group_by(id) %>% 
  summarise_all(mean, na.rm = TRUE)

Next, we generate a mean sentiment score for each tweet, and convert it to a bianry (0 = negative, 1 = positive)

tidy_sentiment <- tidy_sentiment %>%
  group_by(tweetid) %>%
  summarize(value = mean(value)) %>%
  mutate(sentiment = ifelse(value > 0, 1, 0))

Then we merge these two datasets together using the tweet id, and remove any columns we do not want to use in the ML model

vectorized_docs_ml = inner_join(vectorized_docs_ml, 
                                tidy_sentiment, 
                                by = c("id" = "tweetid"))

vectorized_docs_ml = vectorized_docs_ml %>%
  select(-id, -value) %>%
  mutate(sentiment = as.factor(sentiment))

Now we’ll load the caret package. As the dataset is realtively large, we’ll use a different, more efficient, package (ranger) to build the random forest model

library(caret)
library(ranger)

Now form a training and test set (80/20 split):

train_id = createDataPartition(vectorized_docs_ml$sentiment, p = 0.8)
train = vectorized_docs_ml[train_id[[1]], ] 
test = vectorized_docs_ml[-train_id[[1]], ]

Train the model:

fit_rf = ranger(sentiment ~ ., train)

Predict for the test dataset:

y_pred = predict(fit_rf, test)$prediction

And get the performance metrics:

confusionMatrix(test$sentiment, y_pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1357  183
##          1  333  770
##                                           
##                Accuracy : 0.8048          
##                  95% CI : (0.7891, 0.8197)
##     No Information Rate : 0.6394          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5907          
##                                           
##  Mcnemar's Test P-Value : 5.404e-11       
##                                           
##             Sensitivity : 0.8030          
##             Specificity : 0.8080          
##          Pos Pred Value : 0.8812          
##          Neg Pred Value : 0.6981          
##              Prevalence : 0.6394          
##          Detection Rate : 0.5134          
##    Detection Prevalence : 0.5827          
##       Balanced Accuracy : 0.8055          
##                                           
##        'Positive' Class : 0               
##

There’s a lot of output here, but the key one we’ll use is the accuracy, which is roughly 80%. This suggests that, given a tweet, we’d be able to predict it’s sentiment fairly well. This could be improved by tuning the model, using the sentiment score rather than the 0/1 indicator and, of course, including more data.