HDP's Virtual Box Image Modified for STAT 430: Big Data Analysis Foundations

Background

Great news everyone! Next semester, a course that I proposed about a year ago now is going to be offered as a Methods of Applied Statistics course at the University of Illinois at Urbana-Champaign (e.g. STAT 430) taught by Darren Glosemeyer. This course is acting as a trial of run of how the course I proposed would look like. The course I proposed was STAT 490 - Big Data Analysis Foundations. The goal behind the course was to orient students to view large data in a more reasonable scope.

More specifically, the goal is fulfilled by pulling back the mythical curtain on the application of the MapReduce algorithm using Hadoop for historical data and Storm for real time data. This is further accompanied by the use of Pig, Hive, and HBase. The course also views different iterative estimation methods and data manipulation techniques available. Lastly, the course looks at providing visualizations for big data.

Whether all of these sub goals, as stated in the initial proposal, are achieved within the course on the first go around may or may not be the case due to how recent these technologies are. Regardless, I'm very excited that this course is seeing the light of day. One day soon, I hope to be able to teach this course over the summer if my schedule allows.

UIUC's Modified HDP 2.2 Image

Over the last month, the image that will be used by students in the STAT 430 - Big Data Analysis Foundations has been completed. Unfortunately, due to issues with the resources available to virtual environments at UIUC, students will not be able to use this image within a virtual environment supported by ATLAS. The initial tests of very basic code took considerably longer than one would like.

As a result, students are encouraged to download the virtual image and adjust it for their machine.

Click here to download the virtual image via Box.

The virtual image is a modified version of HortonWorks' Sandbox with HDP 2.2 that contains preinstalled and configured versions of R Studio Server and Revolution Analytics rmr2.

The instructions for creating this image are available.

Using the ff package to load the kaggle criteo data set into R

Intro

The goal of this post is to demonstrate how to load criteo data set associated with the kaggle competition into R using the ff and ffbase R packages. I'll also present a way to model the data using biglm R Package that will require you to be able to clean the data before running the modeling command.

In order to proceed in this guide, you will need to be using a computer with AT LEAST 4 gigs of RAM. Preferably, you should also be able to save the data to a magnetic hard drive and NOT a solid state drive (SSD). This remark comes from personal experience of quickly burning out an SSD when manipulating big data.

Background

The size of the dataset is about 10 gigs with 45,840,617 observations and 40 variables. Yes, you read that correctly… The data set has over 45 million observations!

A note…

The file paths provided below correspond to my storage drive that I use for working with bigdata. You may not have an F:/ drive. You may need to place big data on your C:/ drive or within a specific volume on OS X or Linux.

As a result, before running the script, please change the file paths so that they are relative to your machine!

Load Script

# Any package that is required by the script below is given here:
# Check to see if packages are installed, if not install.
inst_pkgs = load_pkgs =  c("ff","ffbase","biglm")
inst_pkgs = inst_pkgs[!(inst_pkgs %in% installed.packages()[,"Package"])]
if(length(inst_pkgs)) install.packages(inst_pkgs)

# Dynamically load packages
pkgs_loaded = lapply(load_pkgs, require, character.only=T)

# Set Working Directory to where big data is
setwd("F:/bigdata/kaggle_criteo/dac/")

# Check temporary directory ff will write to (avoid placing on a drive with SSD)
getOption("fftempdir")

# Set new temporary directory
options(fftempdir = "F:/bigdata/kaggle_criteo/dac/temp")

# Load in the big data
ffx = read.table.ffdf(file="train.txt", # File Name
                      sep="\t",         # Tab separator is used
                      header=FALSE,     # No variable names are included in the file
                      fill = TRUE,      # Missing values are represented by NA
                      colClasses = c(rep("integer",14),rep("factor",26)) 
                      # Specify the import type of the data
                      )

# Assign names to column
colnames(ffx) = c("Label",paste0("I",1:13),paste0("C",1:26))

Quicker load on subsequent runs

Instead of recreating the ffdf object each time we open the workspace, we opt to save the ffdf object using:

# Export created R Object by saving files 
ffsave(ffx, # ffdf object
       file="F:/bigdata/kaggle_criteo/dac/ffdata/ffdac", # Permanent Storage location
       # Last name in the path is the name for the file you want 
       # e.g. ffdac.Rdata and ffdac.ff etc.
       rootpath="F:/bigdata/kaggle_criteo/dac/temp")     # Temporary write directory
       # where data was initially loaded via the options(fftempdir) statement

After the ffdf object has been saved, we are able to open the ffdf object using:

# Load Data R object on subsequent runs (saves ~ 20 mins)
ffload(file="F:/bigdata/kaggle_criteo/dac/ffdata/ffdac", # Load data from archive
       overwrite = TRUE) # Overwrite any existing files with new data

Note, if we modify the ffdf object via data cleaning or et cetera, we need to RESAVE the object! Otherwise, our modifications will not be stored in the permanent directory and will only exist for the duration of the R session.

Sample modeling using ffbase's biglm hook.

One of the nice benefits of using ffbase is the many options it has for working with ff data. In particularly, there is a wrapper that allows us to feed information into the biglm package without having to worry about converting the ffdf object into a bigmemory matrix.

# Model
# Get predictor variable names (only 1 categorical is included)
data_variables = colnames(ffx)[c(-1,-(18:40))]

# Create model formula statement
model_formula = as.formula(paste0("Label ~", paste0(data_variables, collapse="+")))

## YOU MUST CLEAN THE DATA BEFORE RUNNING THE REGRESSION! RUNNING THE REGRESSION WITH MISSING VALUES WILL YIELD AN ETA ERROR!

# Use a modified version of bigglm so that bigglm will not try to convert to a regular data.frame
model_out = bigglm.ffdf(model_formula, family=binomial(), data=ffx,chunksize=100, na.action=na.exclude)

Installing the bigmemory package and biglm on Windows, OS X, or Linux.

Intro

The objective of this post is to be able to install the bigmemory package and the following bigmemory family packages: biganalytics, biglm, and bigalgebra.

Installing bigmemory on Windows

The main reason this post is being written is due to the issues many students are having installing bigmemory on Windows. Looking at the last posting on their website by the bigmemory developers, the authors of the package note that bigmemory support on Windows is lacking and they were working on ways to update it.

What this means is that the CRAN version of bigmemory only works for OS X and Linux at the time of this writing.

To get around this, we download and install the source files located on the project's r-forge page

# Install Bigmemory packages from r-forge page
install.packages("bigmemory", repos="http://R-Forge.R-project.org")
install.packages("bigmemory.sri", repos="http://R-Forge.R-project.org")
install.packages("biganalytics", repos="http://R-Forge.R-project.org")
install.packages("bigalgebra", repos="http://R-Forge.R-project.org")
# Install Boost Header and biglm package from CRAN
install.packages(c("BH","biglm"))

Installing bigmemory on OS X or Linux

To install bigmemory on OS X or Linux all you need to do is issue the following install command:

# Install Bigmemory packages from r-forge page
install.packages(c("bigmemory","bigmemory.sri", "biganalytics", "bigalgebra","BH","biglm"))

Installing R Studio Server on Hortonwork's Virtual Box Image and rmr2 a.k.a RHadoop (R Package)

Intro

This is a guide meant to demonstrate how to install RStudio Server on Hortonwork's Virtual Box Image (HDP) and set up the R package known as RHadoop. You should have already installed Oracle's Virtual Box and have loaded the Hortonwork's Virtual Box Image into Virtual Box.

The use of sudo in the commands below is very liberal given that by default you will access the shell by using root. The reason why sudo is listed is in the event your enterprise has limited your access to root.

Port Forwarding on Virtual Box

Before we start the virtual box image, we must first forward a port so that we can access R Studio via our browser in a similar vein to accessing Hortonwork's web interface.

Access the Setting Options via:

Virtual Box Image Access Settings

Open port forwarding by going to Network and then selecting Port Forwarding:

Virtual Box Network Menu for Port Forwarding

Creating a new port forwarding rule:

Virtual Box create new port forwarding rule

Enter the following for the new rule:

  • Name: rstudioserver
  • Protocol: TCP
  • Host IP: 127.0.0.1
  • Host Port: 8787
  • Guest Port: 8787

Virtual Box forwarding information for new rule

Accept the changes:

Virtual Box accept changes

Start the Hortonworks Virtual Box image:

Virtual Box start image

Installing RStudio Server on Hortonwork's image (based on CENT OS 6)

Now, within the Virtual Box image you will need to access terminal start by setting the network forwarding ports to support

# Step 1: Install Necessary Programs
# R, wget (Download File), openssl098e (SSL), vim (text editor)
# git (Version Control), curl (command line protocols for http)
# sudo allows the execution to happen as a root user (lots of power)
# yum is a package manner for CENT OS (Red Hat)
# The -y means yes to all.
sudo yum -y install R git wget openssl098e vim curl

# Step 3: Download R Studio
wget -O /tmp/rstudio-server-0.98.1091-x86_64.rpm http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm

# Step 4: Install R Studio Server
# --nogpgcheck disables temporarily the signature check of the package.
sudo yum -y install --nogpgcheck /tmp/rstudio-server-0.98.1091-x86_64.rpm

# Step 5: Any issues with the install?
sudo rstudio-server verify-installation

# Step 6: Add user to login to R Studio
sudo adduser rstudio
sudo passwd rstudio
# New password: <your password here> (I chose rstudio)

Now, you should be able to log in to R Studio Server using 127.0.0.1:8787 or localhost:8787 in your browser's URL bar.

Login page

Access R Studio via your browser by entering 127.0.0.1:8787

What R Studio looks like in your web browser (can you tell any difference vs. the client?):

After logging in with rstudio user

When you are done using R Studio, make sure you use: q() to exit the client. This will prevent the following error message from being displayed on subsequent starts…

21 Nov 2014 14:24:38 [rsession-rstudio] ERROR session hadabend;
LOGGED FROM: core::Error<unnamed>::rInit(const r::session::RInitInfo&)
/root/rstudio/src/cpp/session/SessionMain.cpp:1692

In the event RStudio-Server is not able to start on a subsequent run of the image, log into shell and use the following two commands:

# Stop any rstudio-server process
sudo rstudio-server stop

# Start a new rstudio-server process
sudo rstudio-server start

Turning the virtual machine off

To turn off the HDP Virtual Box Image, we can either: "Save the machine state" or exit via shell. Both of these methods are safe and should avoid the RStudio-Server spawning issue. To exit via shell use:

# Stop the virtual image
init 0

Setting up rmr2 (RHadoop)

In order to install and use RHadoop, we must first set two bash variables, install some packages in R, and then install rmr2

Bash Variable Initalization

There are two ways to go about letting R know where certain hadoop components are. Specifically, there is a dependency on whether or not you are using RStudio-Server to access the R session or accessing R from the command line. By initializing the variables using a file, we will avoid having to set them each time we start R. Therefore, we will have to edit R_HOME/etc/Renviron and /etc/profile. Note: R_HOME is the location where R is installed and is given by the R command R.home().

For the first variable, HADOOP_STREAMING, obtaining the path is a bit more complicated depending on the HDP you are using. Specifically, there are two cases I've identified HDP 2.0 - HDP 2.1 vs. HDP 2.2 vs. Future Proofing. Before we begin, try to figure out what the version number of the hadoop streaming file is by using:

# Search for where it is located via:
find / -name 'hadoop-streaming*.jar'

On HDP 2.2, you should receive:


/usr/hdp/2.2.0.0-2041/oozie/share/lib/mapreduce-streaming/hadoop-streaming-2.6.0.2.2.0.0-2041.jar
/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.0.0-2041.jar
/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-streaming.jar

The file path should begin with /usr/lib/hadoop-mapreduce/... or /usr/hdp/<current version>/hadoop-mapreduce/...

Write down the version number at the end of this file. You will need it for the next steps.

Bash variable initialization for RStudio-Server

We set variable such that RStudio-Server is able to recognize them by modifying the Renviron file that is loaded during startup.

This is located at:

R_HOME/etc/Renviron

To obtain R_HOME open R via terminal and use: R.home()

The file should be at either:

/usr/lib/R/etc/Renviron

Or:

/usr/lib64/R/etc/Renviron

Access the file:

sudo vim R_HOME/etc/Renviron

To be able to input into the file press “insert” key on your keyboard. Use the “down arrow” or “page down” to get to the end of the file. Then, add the following to lines at the end:

# All verisons
HADOOP_CMD='/usr/bin/hadoop'

# Version 2.0-2.1 has a symbolic link
HADOOP_STREAMING='/usr/lib/hadoop-mapreduce/hadoop-streaming.jar'

# Version HDP 2.2 use:
HADOOP_STREAMING='/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-streaming.jar'

Press escape and type :wq to save file.

Open R Studio Server web interface and execute:

Sys.getenv("HADOOP_CMD")
Sys.getenv("HADOOP_STREAMING")

This should verify the file paths have been set.

Bash variable initialization for Command Line R

Now, we will open /etc/profile to write the information to the file:

sudo vim /etc/profile

Before you start writing file paths…

NOTE: Do not fight Linux! Use tab to autocomplete words to decrease the amount you need to type in.

Pro tip: Try to use tab when the path is not ambiguous (e.g. not on hadoop-)

For version HDP version 2.0 - HDP 2.1, use:

# Set the HADOOP_STREAMING variable for HDP 2.0 and HDP 2.1: 
export HADOOP_STREAMING=/usr/lib/hadoop-mapreduce/hadoop-streaming.jar  

For version HDP version 2.2, use:

# Set the HADOOP_STREAMING variable for HDP 2.2 Preview: 
export HADOOP_STREAMING=/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/hadoop-streaming.jar  

The second variable, HADOOP_CMD, is straight forward to set on all HDP versions.

# Set the HADOOP_CMD bash variable
export HADOOP_CMD=/usr/bin/hadoop

Save file via :wq

To check to see whether the variables were set write the following in terminal:

echo $HADOOP_CMD
echo $HADOOP_STREAMING

Installing R Packages

First, open R using:

sudo R

Using sudo prefix here is VERY important since this will place the packages in the system wide library instead of a user-specific library. (e.g. usr/lib64/R/)

Then within R type:

install.packages( c("Rcpp","RJSONIO","bitops","digest","functional","itertools","reshape2","stringr","plyr","caTools")),
repos='http://cran.revolutionanalytics.com')

# Exit R
q()

Installing rmr2 and other RevolutionAnalytics Hadoop to R technology

If you are interested in obtaining the latest version of rmr2 or other Hadoop to R technology, then check out the official download page on github for Revolution Analytics software packages.

To complete the guide, we will install rmr2 in shell (e.g. not in R):

# Download the latest version
wget -O /tmp/rmr2_3.3.0.tar.gz https://github.com/RevolutionAnalytics/rmr2/raw/master/build/rmr2_3.3.0.tar.gz

# Trigger install via shell R command. 
# Make sure to use sudo to place in system library!
sudo R CMD INSTALL /tmp/rmr2_3.3.0.tar.gz

We will also need to set up a folder to store logs and ensure we have read and write privileges to it.

Note: rstudio is the username I chose. If you picked a different username, substitute that username where ever you see rstudio in this step.

# Create the log file directory recursively
# (e.g. if anyone directory is missing, then create it)
mkdir -p /var/log/hadoop/rstudio/

# !!THIS NEXT COMMAND IS VERY DANGEROUS FOR A PRODUCTION ENVIRONMENT!!
# A better solution is to remap where the hadoop logs are sent.
# e.g. modify hadoop-env.sh by adding to the end of the file
# the line: export HADOOP_LOG_DIR=<Your Location>

# Allow ANYONE to write to any files within the directory
chmod -R 777 /var/log/hadoop/rstudio

Quick Check

Here is a quick way to check whether the package has been set up correctly.

Note: The following is a modification of the first example on rmr2 tutorial page on GitHub.

Here is some basic R code:

# Create an R vector with values ranging from 1 to 1000
 small.ints = 1:1000
# Apply the function to each element of small.ints
sapply(small.ints, function(x) x^2)

Here is new code written in R, but it uses map reduce algorithm via Hadoop:

# Load the package
require('rmr2')

# Write an R object to the hdfs backend
small.ints = to.dfs(1:1000)

# Create a map reduce job using data on the hdfs backend
small.ints.job =  mapreduce(
                  input = small.ints, 
                  map = function(k, v) cbind(v, v^2))

# Retrieve the results from hdfs
small.ints.df = from.dfs(small.ints.job)

# Results will be in a list form with the list structured:
# the $key [not supplied, so it'll be null]
# the $val (values)

# Display the top 6 observations from results
head(small.ints.df$val)

For some fun, check out these tutorials on using Hadoop within R!

Reproducible Research, R, R Studio, and You! (Also, UIUC Beamer Theme...)

A long long time ago, in a galaxy far far away, I was asked to give a talk about reproducible research in the R ecosystem during the University of Illinois at Urbana-Champaign's Department of Statistics Friday Brownbag series. The talk was a long time coming. In fact, it was rescheduled twice since a guest economics professor appeared for a lecture and I was struck with the worst cold of my life.

Last Friday, the stars aligned and I was able to give the presentation. I really focused the presentation on removing point and click data modification, copy and paste transference of tables and graphs, and documentation duplication through advocating for an automated approach when it comes to data cleaning, data analysis, and generating a presentation. Since the standard at UIUC is to use R and R Studio, the presentation really focuses in on using these two tools to create a successful workflow that enables reproducible research. Other software packages that I talk about are git for version control + GitHub and pandoc for documentation conversion. I extend out on the R platform by recommending the following packages: knitr for dynamic documents, rmarkdown for an agnostic file format, xtable for transferring r data to LaTeX or HTML graphs, and roxygen2 for taking inline document comments and generating .Rd documentation.

If this is interesting, I recommend checking out the slides and r code used to create the presentation here: Reproducible Research via R, LaTeX, and knitr (R Code).

During the course of preparing the last two talks I gave, I was a bit disappointed that UIUC lacked a beamer color theme. Previously, I had obtained a UIUC-oriented theme from a professor, but I thought I could do it better. I modified the theme a bit so that it looks like: UIUC Beamer Theme (sample). If you are interested in using the theme, download: Illinois I Logo and .Rnw File. For those of you who are googling "UIUC Beamer Theme" and are going "why is there no .tex?" Please note that .Rnw is a form of .tex. The main difference is .Rnw allows for r code to be embedded in it. Removing code chunks like <>=@ will revert the file to a normal .tex file.

Till next time....

Talkin' Bout A Big Data Revolution

Big Data is here and there is no turning back.

No longer are there whispers about Big Data... Big Data is being yelled from the mountain tops as a result of the complexity of the NSA programs. Not only that, but students are rising up to demand Big Data courses (e.g. hadoop, storm, pig, hive, etc.) be offered so that they are more attractive as a potential new hire.

Companies are appreciating the additional pressure students are exerting to be trained in Big Data due to the vast troves of data they possess on each individual. They are desperately seeking to capitalize on these data portfolios they have developed instead of letting them bit rot in a dark server room. But, are the companies actually performing a Big Data analysis? Or is it just another "Oooh, they're doing it... So, we're doing it! Even though, we're not" 'cause you want to be in the cool kids club 'n stuff.

To describe this new struggle, the term "Jabberwocky" comes to mind. "Jabberwocky" was used in Better Off Ted (S1E12). To companies and students, this is the next thing, this is something new, this is tomorrow land. But, it really isn't. The fact that "Big Data" can be broken down into keywords such as velocity, volume and variety has helped decrease the activation threshold for what constitutes "Big Data."

Truth be told, "Big Data" was available back in the 1980s (e.g. AA loyalty programs & preferred shoppers cards) and since then no one has really gotten a clue about "Big Data." In fact, the theory and hardware behind it is actively being developed as we speak. In the interim, we try to extend the modeling paradigm by creating data structures that enable us to look at high volumes of data (large n) and globs of predictors (large p).

Recently, I gave a talk at the Big Data and Analytics Council @ UIUC that centered around performing supervised learning (primarily logistic regression), using large amounts of data, and R. The slides are here: Supervised Learning with Big Data using R (R Code). Through the talk, the main focus was showing the current modeling paradigm and how it can extend into working with large n through R's bigmemory, biganalytics, and biglm packages.

STAT 426 - Take Home Midterm 3

The Take Home Midterm 3 was made available on Friday, May 2th, 2014 and was due on May 16th, 2014.

The following is the text framing the midterm examination:

Exam 3
Statistics 426
Due May 16, 2014

This exam will involve analysis of the bacteria dataset from the MASS package, to answer some specific questions that are listed below. Guidelines are given on how to prepare your document, and you should imagine this is a report done for some professional purpose. A substantial part of the scoring will be due to how well your procedures and results are communicated. Keep in mind, this is an exam and there can be absolutely no collaboration or discussion of your analysis or report with other students, or anyone else.

The aim is to model the presence of the bacteria H. influenzae in children with a middle ear infection in the Northern Territory of Australia. Specifically, the aims are listed below:

  1. Fit regression models to uncover treatment effects on presence of H. influenzae, that consider possible effects due to treatment compliance and the timing of the test (week).
  2. Perform this analysis using generalized linear mixed models, generalized estimating equations, and transition models. Compare and contrast the three approaches, and state how they compare with regard to estimated treatment effects. Note that for transition models, some work might need to be done to get the dataset in shape to run with the glm() function.

Outline for reports

  • Introduction Do a little reading so you can give some background information on the bacteria and otitis media. Paraphrase the main objectives of your analysis.
  • The Data Provide some summaries of your data, search for outliers and consider what to do with them in your analyses.
  • Methods Discuss your methods for developing your models, such as variable coding, utilization of interactions (if needed), selection of covariance structure (in gee case), and so forth.
  • Results Show the results and details of the fitted regression models in appropriate tables and possibly any relevant graphs. Provide an interpretation of the results and a comparison of the three techniques for modeling (gee, glmm,transition).
  • Summary Give a very brief summary of what you have found in a non-technical manner
  • Appendix Provide the R-code that you used to conduct your analysis.

The solution posted here is the exam that I turned in on the due date.

The Midterm 3 solution for STAT 426 can be found here: STAT 426 Midterm 3 (PDF)

The file is created using knitr, which allows R code to be intertwined by LaTeX:
STAT 426 Midterm 3 (RNW)

STAT 426 - Take Home Midterm 2

The take home midterm 2 was made available on April 9th, 2014 and was due on May 1st, 2014.

The following is the text framing the midterm examination:

Exam 2
Statistics 426
Due May 1, 2014

This exam will involve analysis of the pima dataset from the faraway package, to answer some specific questions that are listed below. Guidelines are given on how to prepare your document, and you should imagine this is a report done for some professional purpose. A substantial part of the scoring will be due to how well your procedures and results are communicated. Keep in mind, this is an exam and there can be absolutely no collaboration or discussion of your analysis or report with other students, or anyone else.

Below are the main issues to address with your analysis:

  1. Determine a regression model that accurately predicts the outcome of the diabetes test, which is indicated by the variable test, a test for diabetes done 5 years after the other variables were measured on a sample not having diabetes initially. Produce 2 such models, one that includes the predictor glucose and one that does not.
  2. Using the two models from the first aim, construct rules based on these models for assigning a positive or negative diagnosis. Taking test as the true status after 5 years, estimate the sensitivity, specificity, and predictive value of your rules. Try to define the rule (cut-off value) in such a way that the pair (sensitivity, specificity) is as close as possible to the point (1,1).

Outline for reports

  1. Introduction Do a little reading, maybe simply on Wikipedia, so you can give some background information on diabetes, Pima Indians, and the potential significance of the predictors in your dataset. Paraphrase the main objectives of your analysis.
  2. The Data Provide some summaries of your data, search for outliers and consider what to do with them in your analyses. Perhaps do some bivariate analyses, such as smoothing techniques to see how test is functionally related to the predictors.
  3. Methods Discuss your methods for selecting variables and a link function for your models, and assessing the diagnostic properties for the two rules (one with glucose and one without) you choose for diagnosis. Comment on the leverage and influence of cases and how you dealt with this.
  4. Results Show the results and details of the fitted regression models in appropriate tables and possibly any relevant graphs. Provide an interpretation of the results and the role each predictor plays.
  5. Summary Give a very brief summary of what you have found in a non-technical manner
  6. Appendix Provide the R-code that you used to conduct your analysis.

The solution posted here is the exam that I turned in on the due date.

The Midterm 2 solution for STAT 426 can be found here: STAT 426 Midterm 2 (PDF)

The file is created using knitr, which allows R code to be intertwined by LaTeX:
STAT 426 Midterm 2 (RNW)

Setting up RStudio to work with RcppArmadillo

The Motivation

Over the past week, I've been working on converting R scripts into scripts that hook into R's C++ API using RcppArmadillo. The common reaction when I mentioned the project was, “Huh? Why would you bother converting from R language into R C++?” Then, I showed them the benchmarks. Needless to say, I now have several folks who want to learn how to write scripts using RcppArmadillo. Please note, this post is meant as an introduction in a series of posts to writing in Rcpp/RcppArmadillo. If you do not have basic C++ knowledge, please consider acquiring it before proceeding.

The Setup

Before we go further, let's talk about the work environment basics for RcppArmadillo.
First, make sure you install the following packages RcppArmadillo, Rcpp, inline, and rbenchmark

install.packages("Rcpp","RcppArmadillo","inline","rbenchmark");

Development Environment

The development flow that is typical with code development projects is to use an R Studio project space. To create an Rcpp project from within R Studio, go through the normal project creation steps:

  1. File => New Project
  2. New Directory => R Package
  3. Click the Type: dropdown menu to engage options, Select “Package w/ Rcpp”
    • By default it says, “Package”
  4. Fill in Package name
  5. Select appropriate directory for package
  6. Uncheck create a git repository for this project
    • Git is a version control system for code and is outside the scope of this tutorial.
  7. Press create project

Bold words indicate changes from the normal creation process.

If you create a new C++ file from the drop down menu, then note that in the upper right hand corner of the code editor you no longer have “Run” and “Re-run the previous code region.” The only remainder from creating R Scripts in the code editor is that of “Source,” which will compile that Rcpp script using the R console command sourceCpp().

In order to build your package using the Build tab, modifications need to be made to both the DESCRIPTION file and the NAMESPACE file.

The DESCRIPTION file should look like so:

Package: your package name here
Type: Package
Title: package title
Version: 1.0
Date: YYYY-MM-DD
Author: Your Name Here
Maintainer: Person to send complaints to <complain_to_me@gmail.com>
Description: Short description
License: MIT + file LICENSE
Imports: Rcpp (>= 0.11.1), RcppArmadillo (>= 0.4.200.0)
LinkingTo: Rcpp, RcppArmadillo
Depends: RcppArmadillo

Note, the only main differences are the inclusion of RcppArmadillo into Imports, LinkingTo, and Depends!

Within the NAMESPACE file, make the following modifications:

useDynLib(packagename)
import(RcppArmadillo)
importFrom(Rcpp, evalCpp)
exportPattern("^[[:alpha:]]+")

The key is to add import(RcppArmadillo) to the file and substitute your package name in useDynLib(packagename).

With these modifications, RStudio will be able to handle RcppArmadillo based packages.

A short cavat to the above note is having space or any special characters in the file path to the source file.
So, this means you should not try to compile a source file with the following paths:

C:/users/name/what up/did you/know spaces/are very/harmful to/rcpp files.cpp

or

C:/users/name/!@#$%^&*()-=+/rcppfile.cpp

Making a simple script

Please note, there are several alternatives to this workflow. For example, if you only want to outsource one function that is loop-intensive, then using the inline package or cppFunction() is preferred.

Here is an example creating an inline function declaration with cppFunction():

# In R
library("Rcpp")
 
# In R
cppFunction('
  //declare return type, specify function name, and function parameters
int hello_world_rcpp(int n) {
 
   //C++ for loop
   for(int i=0; i<n; i++){
     //prints to R console similar to print() or cat()
     Rcpp::Rcout << "Hello World!" << std::endl;
   }
 
   //send back the number of times hello world was said. 
    return n;
}')

The R equivalent would be:

hello_world_r = function(n) {
    for (i in 1:n) {
        cat("Hello World!")
    }
    return(n)
}

Calling the function results in:

# In R
hello_world_rcpp(n = 2)
Hello World!
Hello World!
[1] 2

Why this wasn't futile…

In the beginning, I mentioned that the primary reason for converting from R scripts to R's C++ API was speed. To illustrate, note the following benchmarks created using library("rbenchmark").

library("rbenchmark")
benchmark(rcpp_v = hello_world_rcpp(2), r_v = hello_world_r(2))
##     test replications elapsed relative user.self sys.self user.child   sys.child
## 2    r_v          100    0.01       NA      0.02        0         NA          NA
## 1 rcpp_v          100    0.00       NA      0.00        0         NA          NA

So, we should prefer the rcpp implementation. Note, this may not always hold when you are echoing statements out to the console. There may be added lag using Rcpp out statements vs. R out statements. However, the looping procedures within Rcpp should be faster than looping procedures in R. Also, the output from this benchmark was suppressed. If you run this benchmark on your system, expect to receive 200 “Hello World!” and 200 returns of the number 2 in addition to the output table displayed above.