In this lab, we will introduce tools for natural language processing (NLP), from basic data preparation through to some exploration and building a simple machine learning model. We are only scratching the surface of what is possible with NLP methods in this lab. See the tidytext website for further examples.
You’ll need several packages for the lab including:
tidytext
: a library for cleaning and processing text
dataSnowballC
spacyr
textstem
word2vec
uwot
textdata
install.packages(c("tidytext"))
Now load the first few packages:
library(tidyverse)
library(tidytext)
library(textstem)
We’ll use a set of tweets related to climate change from 2015 to 2018, taken from:
https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset
The data are held in the file twitter_sentiment_data.csv,
which you can download from the github repository. Read these in and
take a quick look. There are three columns: a sentiment estimate, the
tweet (message
) and a tweet id. The sentiment estimate was
provided by a group of experts and are tagged as follows:
2
(News): the tweet links to factual news about climate
change1
(Pro): the tweet supports the belief of man-made
climate change0
(Neutral): the tweet neither supports nor refutes the
belief of man-made climate change-1
(Anti): the tweet does not believe in man-made
climate changedat <- read.csv("./datafiles/twitter_sentiment_data.csv")
head(dat)
## sentiment
## 1 -1
## 2 1
## 3 1
## 4 1
## 5 2
## 6 0
## message
## 1 @tiniebeany climate change is an interesting hustle as it was global warming but the planet stopped warming for 15 yes while the suv boom
## 2 RT @NatGeoChannel: Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change https://t.co/LkDehj3tNn htt…
## 3 Fabulous! Leonardo #DiCaprio's film on #climate change is brilliant!!! Do watch. https://t.co/7rV6BrmxjW via @youtube
## 4 RT @Mick_Fanning: Just watched this amazing documentary by leonardodicaprio on climate change. We all think this… https://t.co/kNSTE8K8im
## 5 RT @cnalive: Pranita Biswasi, a Lutheran from Odisha, gives testimony on effects of climate change & natural disasters on the po…
## 6 Unamshow awache kujinga na iko global warming https://t.co/mhIflU7M1X
## tweetid
## 1 7.929274e+17
## 2 7.931242e+17
## 3 7.931244e+17
## 4 7.931246e+17
## 5 7.931252e+17
## 6 7.931254e+17
Our basic plan here is:
Processing text data into a usable form can be one of the most time consuming parts of the analysis. Basically, we want to remove any characters or words that are irrelevant to any analysis. In addition, we should try to simplify and standardize the language used. For example, a computer will not necessarily recognize that ‘see’ and ‘seen’ are related to each other.
First, we’ll remove any retweets from the dataset (indicated by
RT
at the start of the message). While there are some
applications where the number of retweets are of interest, we will
consider them as duplicates for this exercise.
dat = dat %>%
filter(str_starts(message, "RT", negate = TRUE))
To illustrate the next steps, we’ll extract the fourth tweet from the dataset:
tweet = dat[4, ]
print(tweet$message)
## [1] "#BeforeTheFlood Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change... https://t.co/HCIZrPUhLF"
This is a typical tweet and has several issues for text processing:
@...
)#...
)We’ll use several steps to clean this up. To illustrate these, we’ll walk through the individual steps for the first 5 tweets.
tidy_dat <- dat %>%
slice_head(n = 5)
tidy_dat <- tidy_dat %>%
mutate(message = str_replace_all(message, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https", ""))
head(tidy_dat)
## sentiment
## 1 -1
## 2 1
## 3 0
## 4 1
## 5 1
## message
## 1 @tiniebeany climate change is an interesting hustle as it was global warming but the planet stopped warming for 15 yes while the suv boom
## 2 Fabulous! Leonardo #DiCaprio's film on #climate change is brilliant!!! Do watch. via @youtube
## 3 Unamshow awache kujinga na iko global warming
## 4 #BeforeTheFlood Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change...
## 5 Bangladesh did not cause climate change, so the country does not need “aidâ€ï†\u009d; instead it needs compensation for the…
## tweetid
## 1 7.929274e+17
## 2 7.931244e+17
## 3 7.931254e+17
## 4 7.931273e+17
## 5 7.931297e+17
@..
)tidy_dat <- tidy_dat %>%
mutate(message = str_replace_all(message, "@\\w+", ""))
head(tidy_dat)
## sentiment
## 1 -1
## 2 1
## 3 0
## 4 1
## 5 1
## message
## 1 climate change is an interesting hustle as it was global warming but the planet stopped warming for 15 yes while the suv boom
## 2 Fabulous! Leonardo #DiCaprio's film on #climate change is brilliant!!! Do watch. via
## 3 Unamshow awache kujinga na iko global warming
## 4 #BeforeTheFlood Watch #BeforeTheFlood right here, as travels the world to tackle climate change...
## 5 Bangladesh did not cause climate change, so the country does not need “aidâ€ï†\u009d; instead it needs compensation for the…
## tweetid
## 1 7.929274e+17
## 2 7.931244e+17
## 3 7.931254e+17
## 4 7.931273e+17
## 5 7.931297e+17
tidy_dat <- tidy_dat %>%
unnest_tokens(word, message)
head(tidy_dat)
## sentiment tweetid word
## 1 -1 7.929274e+17 climate
## 2 -1 7.929274e+17 change
## 3 -1 7.929274e+17 is
## 4 -1 7.929274e+17 an
## 5 -1 7.929274e+17 interesting
## 6 -1 7.929274e+17 hustle
tidy_dat <- tidy_dat %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
head(tidy_dat)
## sentiment tweetid word
## 1 -1 7.929274e+17 climate
## 2 -1 7.929274e+17 change
## 3 -1 7.929274e+17 hustle
## 4 -1 7.929274e+17 global
## 5 -1 7.929274e+17 warming
## 6 -1 7.929274e+17 planet
The last thing we’ll need to do is match words with similar meanings.
There’s a couple of approaches to this: stemming and lemmatization.
Stemming strips words back to the core stem using
stem_words()
from the textstem library.
For example, here are 5 different words related to programming. The
stemmer converts them all to program
:
words <- c("program","programming","programer","programs","programmed")
stem_words(words)
## [1] "program" "program" "program" "program" "program"
One disadvantage to this is that the stems may no longer reflect
actual words. For example, the stem to climate is
climat
:
stem_words("climate")
## [1] "climat"
The second issue is that stemming does not account for context - that words with different meanings may be spelled the same, and can only be distinguished in the context of the sentence.
Lemmatization attempts to avoid these issues by converting words to a
standard form, and accounting for the meaning of the surrounding words.
Here we’ll use the spacyr package to perform
lemmatization. Use this to compare the conversion of saw
in
these two phrases:
library(spacyr)
spacy_parse("Owen saw a rabbit")
## doc_id sentence_id token_id token lemma pos entity
## 1 text1 1 1 Owen Owen PROPN
## 2 text1 1 2 saw see VERB
## 3 text1 1 3 a a DET
## 4 text1 1 4 rabbit rabbit NOUN
spacy_parse("Owen cut a plank with a saw")
## doc_id sentence_id token_id token lemma pos entity
## 1 text1 1 1 Owen Owen PROPN PERSON_B
## 2 text1 1 2 cut cut VERB
## 3 text1 1 3 a a DET
## 4 text1 1 4 plank plank NOUN
## 5 text1 1 5 with with ADP
## 6 text1 1 6 a a DET
## 7 text1 1 7 saw saw NOUN
Step 1: clean
tidy_dat <- dat %>%
mutate(message = str_replace_all(message, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https", "")) %>%
mutate(message = str_replace_all(message, "@\\w+", "")) %>%
unnest_tokens(word, message) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
Step 2: lemmatize
Note to keep things running quickly in this lab, we’ll use textstem’s function for lemmatization. This is not quite as robust as the spacyr library, but substantially faster.
tidy_dat$clean_word <- lemmatize_words(tidy_dat$word)
We can now use the cleaned text data to do some exploration. We’ll
start by making some word clouds. These are a very common visualization
of text data, where words are randomly placed on a figure and scaled
according to their frequency. We’ll use the wordcloud
package to make plots, and create a data frame of the counts of
individual words for use in the cloud.
tidy_count <- tidy_dat %>%
count(clean_word) %>%
arrange(-n)
head(tidy_count)
## clean_word n
## 1 climate 14478
## 2 change 14118
## 3 global 5401
## 4 warm 5138
## 5 trump 1460
## 6 real 793
First, let’s plot all the data. This is, not surprisingly, dominated by the words ‘climate’ and ‘change’
library(wordcloud)
wordcloud(tidy_count$clean_word, tidy_count$n, max.words = 100)
For the next plot, we’ll extract only the ‘pro’ tweets, and skip plotting climate and change by setting them as stopwords
tidy_count_pos <- tidy_dat %>%
filter(sentiment == 1,
!clean_word %in% c("climate", "change", "global", "warm")) %>%
count(clean_word) %>%
arrange(-n)
wordcloud(tidy_count_pos$clean_word, tidy_count_pos$n, max.words = 100)
tidy_count_neg <- tidy_dat %>%
filter(sentiment == -1,
!clean_word %in% c("climate", "change", "global", "warm")) %>%
count(clean_word) %>%
arrange(-n)
wordcloud(tidy_count_neg$clean_word, tidy_count_neg$n, max.words = 100)
Next, we’ll estimate the sentiment of the tweets. The data already
has a column labeled sentiment
, which is a category
describing whether the tweet was for or against climate change (or
neutral). Sentiment analysis is a little different from this, as it
attempts to score some text based on whether the words are overall
positive, neutral or negative, irrespective of the belief in or against
climate change. There are several different lexicons for sentiment
analysis, some of which provide more fine grained detail. We’ll use here
a function from the tidytext library (get_sentiment
), which
scores sentiment values between -5 (negative) and 5 (positive) for each
word. You may be prompted to download the AFINN library when running
this.
get_sentiments("afinn") %>%
head()
## # A tibble: 6 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
We merge this with the cleaned data by joining on the cleaned word:
tidy_sentiment <- inner_join(tidy_dat, get_sentiments("afinn"), by = c("clean_word" = "word"))
head(tidy_sentiment)
## sentiment tweetid word clean_word value
## 1 -1 7.929274e+17 warming warm 1
## 2 -1 7.929274e+17 stopped stop -1
## 3 -1 7.929274e+17 warming warm 1
## 4 1 7.931244e+17 fabulous fabulous 4
## 5 1 7.931244e+17 brilliant brilliant 4
## 6 0 7.931254e+17 warming warm 1
And we can make a word cloud of the positive terms used in conjunction with climate change (I am well aware of the irony of trump being considered positive here, so I’m going to remove it):
tidy_count_pos <- tidy_sentiment %>%
filter(value > 1,
!clean_word %in% c("climate", "change", "global", "warm", "trump")) %>%
count(clean_word) %>%
arrange(-n)
wordcloud(tidy_count_pos$clean_word, tidy_count_pos$n, max.words = 100)
To go further in the analysis of text data, we need to use a text
embedding. This converts the text to a numeric representation in a high
dimensional space. The simplest form of this is one-hot encoding, which
creates a binary matrix with one column per word, and one row per tweet.
If the word occurs in that tweet, then it’s labeled with a
1
, and a 0
if not. One hot encoding works well
with a small number of words, but scales poorly with richer text.
Embeddings are more complex representations of text, usually created by analyzing which words are likely to occur in similar contexts. It has a lot of similarities to principal component analysis for numeric data, in which complex data can be represented by a small number of components that capture correlations between the variables. For text, these means that the embedding for ‘dog’ and ‘cat’ will be similar, but ‘dog’ and ‘car’ will be dissimilar. This can then be used to explore the similarity between pieces of text, or (as we’ll see below) to use text in machine learning models. These embeddings are a key part of large language models (e.g. ChatGPT), where they are used to relate prompts or questions to the appropriate text that makes up a response.
While it’s possible to create your own embedding (which is useful for specific projects), this can be quite time consuming, and can require a substantial amount of text. In the example we’ll use below, we’ll use an embedding that was created using a model called Word2Vec and trained using Google news articles. You can download the file that contains the embedding weights from the Google Drive folder:
https://drive.google.com/drive/folders/1GMEY1fYEj1YMI__u3hU4y6agnrz3ekna?usp=drive_link
A good selection of alternative, pre-trained embeddings can be found at Hugging Face:
https://huggingface.co/models?other=text-embedding
Load the word2vec
package to find the embeddings for
different pieces of text, and we’ll need to load the embeddings
file:
library(word2vec)
model <- read.word2vec(file = "./wgts/GoogleNews-vectors-negative300.bin", normalize = TRUE)
As an example, here is the embedding for the word ‘cat’ (I’ve just printed the first 50 values):
predict(model, "cat", type = "embedding")[1, 1:50]
## [1] 0.07029951 1.16377008 -1.62593722 1.23615777 0.67376167 0.47330365
## [7] 0.28398219 -0.05429071 1.25843084 -0.71830785 0.45938295 -3.34096694
## [13] -0.02540527 -1.69275653 -0.07482374 -0.47608778 0.28815839 0.86308312
## [19] -2.56140804 -0.07725986 1.22502124 -0.84081000 1.28070402 -0.71273959
## [25] -0.55404365 1.41991091 -1.64821029 2.08253598 2.34981346 -0.49000847
## [31] -0.44824639 -1.12479222 -0.51784986 -0.80740035 -0.58466923 0.74614930
## [37] -0.01974999 0.41205257 0.25196460 1.97117043 0.42597327 -0.64035201
## [43] 0.38421118 0.64035201 0.11275763 -0.70438719 1.19717979 -0.41205257
## [49] -0.15869592 0.31599978
It’s pretty meaningless to us mortals, but this is a representation of the word ‘cat’ that a computer can work with. To follow the example given above, we can extract these for ‘cat’, ‘dog’ and ‘car’, and explore the correlations with these
cat_wv = predict(model, "cat", type = "embedding")[1, ]
car_wv = predict(model, "car", type = "embedding")[1, ]
dog_wv = predict(model, "dog", type = "embedding")[1, ]
cor(cat_wv, dog_wv)
## [1] 0.7607611
cor(car_wv, dog_wv)
## [1] 0.3075351
plot(cat_wv, dog_wv, xlab = "cat", ylab = "dog")
plot(cat_wv, car_wv, xlab = "cat", ylab = "car")
vectorized_words = predict(model, tidy_dat$clean_word,
type = "embedding")
The result (vectorized_docs
) is a numeric array with 300
columns and the same number of rows as the cleaned words. We’ll now
collapse the values into mean embeddings for each tweet. To do this we
have to add (and subsequently remove) the tweet id from the cleaned
data.
vectorized_words = as.data.frame(vectorized_words)
vectorized_words$id = tidy_dat$tweetid
vectorized_docs <- vectorized_words %>%
drop_na() %>%
group_by(id) %>%
summarise_all(mean, na.rm = TRUE) %>%
select(-id)
We can now use any of the usual tools for exploring and modeling numeric data,
We’ll first use a K-means cluster function to group the tweets into 4 sets.
tweet_km <- kmeans(vectorized_docs, 4)
We can also visualize the embeddings using other dimension reduction techniques. Here we use UMAP, a non-linear, efficient way of collapsing high-dimensional data to low (usually 2) dimensions
library(uwot)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
viz <- umap(vectorized_docs, n_neighbors = 15,
min_dist = 0.001, spread = 4, n_threads = 2)
This can be plotted - each point here represents an individual tweet, and the colors are the clusters we created in the previous step. Note there are quite a lot of outliers that could be potentially removed, and that one cluster is very distinct from the others. This may suggest a group of tweets that deal with different aspect of climate change. (You could plot the word cloud for these tweets to see if that shows some differences).
library(ggplot2)
df <- data.frame(x = viz[, 1], y = viz[, 2],
cluster = as.factor(tweet_km$cluster),
stringsAsFactors = FALSE)
ggplot(df, aes(x = x, y = y, col = cluster)) +
geom_point() + theme_void()
As a last step, we’ll briefly look at using these embeddings in a machine learning model. We’ll build a model to try and predict the sentiment of a tweet (positive or negative) from it’s content. We’ll use a random forest model with the embeddings as features, and the sentiment value as a label. We’ll first need to integrate our embedding data with the sentiment score we generated earlier. First, we’ll remake the average embedding values per tweet, but this time we’ll keep the tweet id.
vectorized_docs_ml <- vectorized_words %>%
drop_na() %>%
group_by(id) %>%
summarise_all(mean, na.rm = TRUE)
Next, we generate a mean sentiment score for each tweet, and convert it to a bianry (0 = negative, 1 = positive)
tidy_sentiment <- tidy_sentiment %>%
group_by(tweetid) %>%
summarize(value = mean(value)) %>%
mutate(sentiment = ifelse(value > 0, 1, 0))
Then we merge these two datasets together using the tweet id, and remove any columns we do not want to use in the ML model
vectorized_docs_ml = inner_join(vectorized_docs_ml,
tidy_sentiment,
by = c("id" = "tweetid"))
vectorized_docs_ml = vectorized_docs_ml %>%
select(-id, -value) %>%
mutate(sentiment = as.factor(sentiment))
Now we’ll load the caret package. As the dataset is realtively large, we’ll use a different, more efficient, package (ranger) to build the random forest model
library(caret)
library(ranger)
Now form a training and test set (80/20 split):
train_id = createDataPartition(vectorized_docs_ml$sentiment, p = 0.8)
train = vectorized_docs_ml[train_id[[1]], ]
test = vectorized_docs_ml[-train_id[[1]], ]
Train the model:
fit_rf = ranger(sentiment ~ ., train)
Predict for the test dataset:
y_pred = predict(fit_rf, test)$prediction
And get the performance metrics:
confusionMatrix(test$sentiment, y_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1357 183
## 1 333 770
##
## Accuracy : 0.8048
## 95% CI : (0.7891, 0.8197)
## No Information Rate : 0.6394
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5907
##
## Mcnemar's Test P-Value : 5.404e-11
##
## Sensitivity : 0.8030
## Specificity : 0.8080
## Pos Pred Value : 0.8812
## Neg Pred Value : 0.6981
## Prevalence : 0.6394
## Detection Rate : 0.5134
## Detection Prevalence : 0.5827
## Balanced Accuracy : 0.8055
##
## 'Positive' Class : 0
##
There’s a lot of output here, but the key one we’ll use is the accuracy, which is roughly 80%. This suggests that, given a tweet, we’d be able to predict it’s sentiment fairly well. This could be improved by tuning the model, using the sentiment score rather than the 0/1 indicator and, of course, including more data.