This is a guide meant to demonstrate how to install RStudio Server on Hortonwork's Virtual Box Image (HDP) and set up the R package known as RHadoop. You should have already installed Oracle's Virtual Box and have loaded the Hortonwork's Virtual Box Image into Virtual Box.
The use of sudo in the commands below is very liberal given that by default you will access the shell by using root. The reason why sudo is listed is in the event your enterprise has limited your access to root.
Port Forwarding on Virtual Box
Before we start the virtual box image, we must first forward a port so that we can access R Studio via our browser in a similar vein to accessing Hortonwork's web interface.
Access the Setting Options via:
Open port forwarding by going to Network and then selecting Port Forwarding:
Creating a new port forwarding rule:
Enter the following for the new rule:
- Name: rstudioserver
- Protocol: TCP
- Host IP: 127.0.0.1
- Host Port: 8787
- Guest Port: 8787
Accept the changes:
Start the Hortonworks Virtual Box image:
Installing RStudio Server on Hortonwork's image (based on CENT OS 6)
Now, within the Virtual Box image you will need to access terminal start by setting the network forwarding ports to support
# Step 1: Install Necessary Programs
# R, wget (Download File), openssl098e (SSL), vim (text editor)
# git (Version Control), curl (command line protocols for http)
# sudo allows the execution to happen as a root user (lots of power)
# yum is a package manner for CENT OS (Red Hat)
# The -y means yes to all.
sudo yum -y install R git wget openssl098e vim curl
# Step 3: Download R Studio
wget -O /tmp/rstudio-server-0.98.1091-x86_64.rpm http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm
# Step 4: Install R Studio Server
# --nogpgcheck disables temporarily the signature check of the package.
sudo yum -y install --nogpgcheck /tmp/rstudio-server-0.98.1091-x86_64.rpm
# Step 5: Any issues with the install?
sudo rstudio-server verify-installation
# Step 6: Add user to login to R Studio
sudo adduser rstudio
sudo passwd rstudio
# New password: <your password here> (I chose rstudio)
Now, you should be able to log in to R Studio Server using 127.0.0.1:8787 or localhost:8787 in your browser's URL bar.
What R Studio looks like in your web browser (can you tell any difference vs. the client?):
When you are done using R Studio, make sure you use:
q() to exit the client. This will prevent the following error message from being displayed on subsequent starts…
21 Nov 2014 14:24:38 [rsession-rstudio] ERROR session hadabend;
LOGGED FROM: core::Error<unnamed>::rInit(const r::session::RInitInfo&)
In the event RStudio-Server is not able to start on a subsequent run of the image, log into shell and use the following two commands:
# Stop any rstudio-server process
sudo rstudio-server stop
# Start a new rstudio-server process
sudo rstudio-server start
Turning the virtual machine off
To turn off the HDP Virtual Box Image, we can either: "Save the machine state" or exit via shell. Both of these methods are safe and should avoid the RStudio-Server spawning issue. To exit via shell use:
# Stop the virtual image
Setting up rmr2 (RHadoop)
In order to install and use RHadoop, we must first set two bash variables, install some packages in R, and then install rmr2
Bash Variable Initalization
There are two ways to go about letting R know where certain hadoop components are. Specifically, there is a dependency on whether or not you are using RStudio-Server to access the R session or accessing R from the command line. By initializing the variables using a file, we will avoid having to set them each time we start R. Therefore, we will have to edit
R_HOME is the location where R is installed and is given by the R command
For the first variable,
HADOOP_STREAMING, obtaining the path is a bit more complicated depending on the HDP you are using. Specifically, there are two cases I've identified HDP 2.0 - HDP 2.1 vs. HDP 2.2 vs. Future Proofing. Before we begin, try to figure out what the version number of the hadoop streaming file is by using:
# Search for where it is located via:
find / -name 'hadoop-streaming*.jar'
On HDP 2.2, you should receive:
The file path should begin with
Write down the version number at the end of this file. You will need it for the next steps.
Bash variable initialization for RStudio-Server
We set variable such that RStudio-Server is able to recognize them by modifying the Renviron file that is loaded during startup.
This is located at:
R_HOME open R via terminal and use:
The file should be at either:
Access the file:
sudo vim R_HOME/etc/Renviron
To be able to input into the file press “insert” key on your keyboard. Use the “down arrow” or “page down” to get to the end of the file. Then, add the following to lines at the end:
# All verisons
# Version 2.0-2.1 has a symbolic link
# Version HDP 2.2 use:
Press escape and type
:wq to save file.
Open R Studio Server web interface and execute:
This should verify the file paths have been set.
Bash variable initialization for Command Line R
Now, we will open
/etc/profile to write the information to the file:
sudo vim /etc/profile
Before you start writing file paths…
NOTE: Do not fight Linux! Use tab to autocomplete words to decrease the amount you need to type in.
Pro tip: Try to use tab when the path is not ambiguous (e.g. not on hadoop-)
For version HDP version 2.0 - HDP 2.1, use:
# Set the HADOOP_STREAMING variable for HDP 2.0 and HDP 2.1:
For version HDP version 2.2, use:
# Set the HADOOP_STREAMING variable for HDP 2.2 Preview:
The second variable,
HADOOP_CMD, is straight forward to set on all HDP versions.
# Set the HADOOP_CMD bash variable
Save file via
To check to see whether the variables were set write the following in terminal:
Installing R Packages
First, open R using:
Using sudo prefix here is VERY important since this will place the packages in the system wide library instead of a user-specific library. (e.g. usr/lib64/R/)
Then within R type:
# Exit R
Installing rmr2 and other RevolutionAnalytics Hadoop to R technology
If you are interested in obtaining the latest version of rmr2 or other Hadoop to R technology, then check out the official download page on github for Revolution Analytics software packages.
To complete the guide, we will install rmr2 in shell (e.g. not in R):
# Download the latest version
wget -O /tmp/rmr2_3.3.0.tar.gz https://github.com/RevolutionAnalytics/rmr2/raw/master/build/rmr2_3.3.0.tar.gz
# Trigger install via shell R command.
# Make sure to use sudo to place in system library!
sudo R CMD INSTALL /tmp/rmr2_3.3.0.tar.gz
We will also need to set up a folder to store logs and ensure we have read and write privileges to it.
Note: rstudio is the username I chose. If you picked a different username, substitute that username where ever you see rstudio in this step.
# Create the log file directory recursively
# (e.g. if anyone directory is missing, then create it)
mkdir -p /var/log/hadoop/rstudio/
# !!THIS NEXT COMMAND IS VERY DANGEROUS FOR A PRODUCTION ENVIRONMENT!!
# A better solution is to remap where the hadoop logs are sent.
# e.g. modify hadoop-env.sh by adding to the end of the file
# the line: export HADOOP_LOG_DIR=<Your Location>
# Allow ANYONE to write to any files within the directory
chmod -R 777 /var/log/hadoop/rstudio
Here is a quick way to check whether the package has been set up correctly.
Note: The following is a modification of the first example on rmr2 tutorial page on GitHub.
Here is some basic R code:
# Create an R vector with values ranging from 1 to 1000
small.ints = 1:1000
# Apply the function to each element of small.ints
sapply(small.ints, function(x) x^2)
Here is new code written in R, but it uses map reduce algorithm via Hadoop:
# Load the package
# Write an R object to the hdfs backend
small.ints = to.dfs(1:1000)
# Create a map reduce job using data on the hdfs backend
small.ints.job = mapreduce(
input = small.ints,
map = function(k, v) cbind(v, v^2))
# Retrieve the results from hdfs
small.ints.df = from.dfs(small.ints.job)
# Results will be in a list form with the list structured:
# the $key [not supplied, so it'll be null]
# the $val (values)
# Display the top 6 observations from results
For some fun, check out these tutorials on using Hadoop within R!