Automating Wikipedia Article Categorization with R: A Step-by-Step Guide

Introduction to R and Wikipedia Article Categorization

Background and Motivation

In this article, we will explore the process of automatically categorizing Wikipedia articles using R. This task involves several steps, including data preparation, text processing, and clustering. We will use the tm package for text analysis and hclust for clustering.

The tm package provides a comprehensive set of tools for text mining in R. It includes functions for preprocessing, tokenization, stemming, lemmatization, stopword removal, and more. The hclust function is used for hierarchical clustering, which allows us to group similar articles based on their content.

Prerequisites

To follow along with this article, you should have the following packages installed:

R
tm package
stringi package (part of tm)
proxy package
ggplot2 package (for visualization)

If you haven’t installed these packages yet, you can do so using the following commands:

install.packages("tm")
install.packages("stringi")
install.packages("proxy")
install.packages("ggplot2")

Section 1: Setting Up and Data Preparation

Loading Libraries and Setting Up Variables

First, let’s load the required libraries and set up our variables:

library(tm)
library(stringi)
library(proxy)

wiki <- "https://en.wikipedia.org/wiki/"
titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral",
           "Derivative", "Limit_of_a_sequence", "Edvard_Munch",
           "Vincent_van_Gogh", "Jan_Matejko", "Lev_Tolstoj",
           "Franz_Kafka", "J._R._R._Tolkien")

articles <- character(length(titles))

In this code, we load the tm package and its sub-packages, stringi and proxy. We also set up our variables for the Wikipedia URL and article titles.

Section 2: Reading and Preprocessing Articles

Reading and Preprocessing Articles

Next, let’s read in the articles from Wikipedia using the stri_paste function:

for (i in 1:length(titles)) {
  articles[i] <- stri_flatten(readLines(stri_paste(wiki, titles[i]), encoding = "UTF-8"))
}

In this code, we use a for loop to read in each article from Wikipedia. We use the stri_paste function to combine the URL and title, and then pass it to the readLines function to read in the text.

Section 3: Creating a Corpus Object

Creating a Corpus Object

Now that we have our articles read in, let’s create a corpus object using the Corpus function:

docs <- Corpus(VectorSource(articles))

In this code, we use the Corpus function to create a corpus object from our vector of article texts.

Section 4: Preprocessing Texts

Preprocessing Texts

Next, let’s preprocess our texts using various functions provided by the tm package:

docs[[1]]
docs2 <- tm_map(docs, function(x) stri_replace_all_regex(x, "&lt;.+?&gt;", " "))
docs3 <- tm_map(docs2, function(x) stri_replace_all_fixed(x, "\t", " "))
docs4 <- tm_map(docs3, PlainTextDocument)
docs5 <- tm_map(docs4, stripWhitespace)
docs6 <- tm_map(docs5, removeWords, stopwords("english"))
docs7 <- tm_map(docs6, removePunctuation)
docs8 <- tm_map(docs7, content_transformer(tolower))

In this code, we use a series of tm_map functions to preprocess our texts. We use these functions to replace HTML tags, remove tabs, convert to plain text, strip whitespace, remove stopwords, and convert to lowercase.

Section 5: Creating Term Document Matrices

Creating Term Document Matrices

Now that we have preprocessed our texts, let’s create term document matrices (TDMs) using the DocumentTermMatrix function:

docsTDM <- DocumentTermMatrix(docs8)

In this code, we use the DocumentTermMatrix function to create a TDM from our corpus object.

Section 6: Clustering Articles

Clustering Articles

Next, let’s cluster our articles using hierarchical clustering with the hclust function:

docsdissim <- dist(t(matrix(docsTDM)))
docsTDM2 <- as.matrix(docsTDM)
rownames(docsTDM2) <- titles
colnames(docsTDM2) <- titles
docsdissim2 <- as.matrix(docsdissim)
h <- hclust(docsdissim2, method = "ward")

In this code, we first calculate the distance matrix between our articles using the dist function. We then create a TDM from our corpus object and assign row and column names to it. Finally, we use the hclust function to perform hierarchical clustering on our article clusters.

Section 7: Visualizing Clusters

Visualizing Clusters

Finally, let’s visualize our article clusters using the ggplot2 package:

library(ggplot2)

plot(h)

In this code, we use the ggplot2 package to create a dendrogram of our article clusters. This dendrogram shows us the hierarchical structure of our article clusters.

That’s it! We have now successfully categorized Wikipedia articles using R.

Last modified on 2024-07-21