Creating Output CSV Files for Each Text File with the Same Name Using R

Creating Output CSV Files for Each Text File with the Same Name

In this article, we will explore how to create output CSV files for each text file with the same name in a directory. We will cover the basics of R programming language and provide a step-by-step guide on how to achieve this using R’s built-in functions.

Introduction

R is a popular programming language used for data analysis, statistical computing, and visualization. It has an extensive range of libraries and packages that can be used to perform various tasks, including natural language processing (NLP). In this article, we will focus on how to create output CSV files for each text file with the same name in a directory.

Required Libraries and Packages

Before we begin, let’s ensure that we have the required libraries and packages installed. We will need the following:

  • pacman: a package manager for R
  • tm: a package for text mining
  • SnowballC: a package for text processing
  • dplyr: a package for data manipulation

We can install these packages using the pacman package manager:

pacman::p_load(pacman, tm, SnowballC, dplyr)

Reading Text Files and Creating Corpus

To create output CSV files for each text file with the same name, we first need to read the text files into R. We can use the dir() function to get a list of all text files in the directory:

infiles <- dir(pattern='\\.txt$')

Next, we need to create a corpus from the text files using the Corpus() function:

docs <- Corpus(DirSource("D:\\PavanSOP\\txt"))

Preprocessing Text Data

Before we can create term-document matrices (TDMs) and calculate word frequencies, we need to preprocess the text data. We will remove punctuation, convert all text to lowercase, and remove stopwords.

Here’s how we can do this:

corpusJE <- Corpus(VectorSource(bookJE)) %>% 
  tm_map(removePunctuation) %>% 
  #tm_map(removeNumbers) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english")) %>% 
  tm_map(stripWhitespace)

Creating Term-Document Matrices (TDMs)

Now that we have preprocessed our text data, we can create TDMs using the DocumentTermMatrix() function:

tdmJE <- DocumentTermMatrix(corpusJE) %>% 
  removeSparseTerms(1 - (1/length(corpusJE)))

Calculating Word Frequencies

To calculate word frequencies, we need to sort the columns of the TDM by frequency in descending order. We can use the sort() function for this:

word.freqJE <- sort(colSums(as.matrix(tdmJE)), decreasing = T)

Creating Frequency Tables

Now that we have calculated the word frequencies, we can create frequency tables using the data.frame() function:

tableJE <- data.frame(word = names(word.freqJE), 
                      absolute.frequency = word.freqJE, 
                      relative.frequency = word.freqJE/length(word.freqJE))

Removing Words from Row Names

Finally, we need to remove the words from the row names of our frequency table. We can do this using the rownames() function:

rownames(tableJE) <- NULL

Saving Frequency Tables as CSV Files

Now that we have created our frequency tables, we can save them as CSV files using the write.table() function:

write.table(data = tableJE, quote = FALSE, sep = ", ", sub("\\.txt$",".csv", file))

Note that this will not work as expected because of a missing variable named “file” in the write.table call. This is where our actual question lies.

How to Extract Filename and Save it as CSV?

The question also asks how to extract the filename from each text file and save the output word frequency into a csv file with same name.

To answer this, we need to modify our code slightly. Here’s an updated version of the change.files function:

change.files <- function(file){
  # ... (same code as before)
  
  # Write CSV file with filename as title
  write.table(data = tableJE, quote = FALSE, sep = ", ", 
              sub("\\.txt$",".csv", file), 
              col.names = NA, row.names = NULL)
}

In this updated function, we have added the sub() function to extract the filename from each text file and save the output word frequency into a csv file with the same name.

Conclusion

Creating output CSV files for each text file with the same name in a directory is a relatively simple task that can be achieved using R’s built-in functions. By following the steps outlined in this article, you should be able to accomplish this task with ease.

Remember to install the required libraries and packages, read your text files into R, preprocess your text data, create term-document matrices (TDMs), calculate word frequencies, create frequency tables, remove words from row names, and save the output as CSV files using the write.table() function.


Last modified on 2023-09-06