Installing R Studio Server on Hortonwork's Virtual Box Image and rmr2 a.k.a RHadoop (R Package)

Intro

This is a guide meant to demonstrate how to install RStudio Server on Hortonwork's Virtual Box Image (HDP) and set up the R package known as RHadoop. You should have already installed Oracle's Virtual Box and have loaded the Hortonwork's Virtual Box Image into Virtual Box.

Port Forwarding on Virtual Box

Before we start the virtual box image, we must first forward a port so that we can access R Studio via our browser in a similar vein to accessing Hortonwork's web interface.

Access the Setting Options via:

Virtual Box Image Access Settings

Open port forwarding by going to Network and then selecting Port Forwarding:

Virtual Box Network Menu for Port Forwarding

Creating a new port forwarding rule:

Virtual Box create new port forwarding rule

Enter the following for the new rule:

  • Name: rstudioserver
  • Protocol: TCP
  • Host IP: 127.0.0.1
  • Host Port: 8787
  • Guest Port: 8787

Virtual Box forwarding information for new rule

Accept the changes:

Virtual Box accept changes

Start the Hortonworks Virtual Box image:

Virtual Box start image

Installing RStudio Server on Hortonwork's image (based on CENT OS 6)

Now, within the Virtual Box image you will need to access terminal start by using ALT+F5 (Windows / Linux) or CMD+ALT+F5. With terminal now open, enter the following commands to install RStudio Server

# Step 1: Install Necessary Programs
# R, wget (Download File), openssl098e (SSL), vim (text editor)
# git (Version Control), curl (command line protocols for http)
# sudo allows the execution to happen as a root user (lots of power)
# yum is a package manner for CENT OS (Red Hat)
# The -y means yes to all.
sudo yum -y install R git wget openssl098e vim curl

# Step 3: Download R Studio
wget -O /tmp/rstudio-server-0.98.1091-x86_64.rpm http://download2.rstudio.org/rstudio-server-0.98.1091-x86_64.rpm

# Step 4: Install R Studio Server
# --nogpgcheck disables temporarily the signature check of the package.
sudo yum -y install --nogpgcheck /tmp/rstudio-server-0.98.1091-x86_64.rpm

# Step 5: Any issues with the install?
sudo rstudio-server verify-installation

# Step 6: Add user to login to R Studio
sudo adduser rstudio
sudo passwd rstudio
# New password: <your password here> (I chose rstudio)

Now, you should be able to log in to R Studio Server using 127.0.0.1:8787 or localhost:8787 in your browser's URL bar.

Login page

Access R Studio via your browser by entering 127.0.0.1:8787

What R Studio looks like in your web browser (can you tell any difference vs. the client?):

After logging in with rstudio user

When you are done using R Studio, make sure you use: q() to exit the client. This will prevent the following error message from being displayed on subsequent starts…

21 Nov 2014 14:24:38 [rsession-rstudio] ERROR session hadabend;
LOGGED FROM: core::Error<unnamed>::rInit(const r::session::RInitInfo&)
/root/rstudio/src/cpp/session/SessionMain.cpp:1692

In the event RStudio-Server is not able to start on a subsequent run of the image, log into shell and use the following two commands:

# Stop any rstudio-server process
sudo rstudio-server stop

# Start a new rstudio-server process
sudo rstudio-server start

Setting up rmr2 (RHadoop)

In order to install and use RHadoop, we must first set two bash variables, install some packages in R, and then install rmr2

Bash Variable Initalization

The first variable, HADOOP_CMD, is straight forward to set on all HDP versions.

# Set the HADOOP_CMD bash variable
export HADOOP_CMD=/usr/bin/hadoop

For the second variable, HADOOP_STREAMING, it is a bit more complicated depending on the HDP you are using. Specifically, there are two cases I've identified HDP 2.0 - HDP 2.1 vs. HDP 2.2 Preview. Before we begin, we'll figure out what the version number of the hadoop streaming file is by using:

# Search for where it is located via:
find / -name 'hadoop-streaming*.jar'

Write down the version number at the end of this file. You will need it for the next step.

Before you start writing file paths…

NOTE: Do not fight Linux! Use tab to autocomplete words to decrease the amount you need to type in.

Pro tip: Try to use tab when the path is not ambiguous (e.g. not on hadoop-)

For version HDP version 2.0 - HDP 2.1, use:

# Set the HADOOP_STREAMING variable for HDP 2.0 and HDP 2.1: 
export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-<YOUR VERSION>.jar  

For version HDP version 2.2 Preview, use:

# Set the HADOOP_STREAMING variable for HDP 2.2 Preview: 
export HADOOP_STREAMING=/usr/hdp/2.2.0.0-913/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.0.0-913.jar  

To check to see whether the variables were set write:

echo $HADOOP_CMD
echo $HADOOP_STREAMING

Installing R Packages

First, open R using:

sudo R

Then within R type:

install.packages( c('RJSONIO', 'itertools', 'digest',
                    'Rcpp', 'functional', 'plyr',
                    'stringr', 'reshape2'),
repos='http://cran.revolutionanalytics.com')

# Exit R
q()

Installing rmr2 via shell

Back in shell use:

wget -O /tmp/rmr2_3.2.0.tar.gz https://github.com/RevolutionAnalytics/rmr2/raw/master/build/rmr2_3.2.0.tar.gz
R CMD INSTALL /tmp/rmr2_3.2.0.tar.gz

For some fun, check out these tutorials on using Hadoop within R!

Reproducible Research, R, R Studio, and You! (Also, UIUC Beamer Theme...)

A long long time ago, in a galaxy far far away, I was asked to give a talk about reproducible research in the R ecosystem during the University of Illinois at Urbana-Champaign's Department of Statistics Friday Brownbag series. The talk was a long time coming. In fact, it was rescheduled twice since a guest economics professor appeared for a lecture and I was struck with the worst cold of my life.

Last Friday, the stars aligned and I was able to give the presentation. I really focused the presentation on removing point and click data modification, copy and paste transference of tables and graphs, and documentation duplication through advocating for an automated approach when it comes to data cleaning, data analysis, and generating a presentation. Since the standard at UIUC is to use R and R Studio, the presentation really focuses in on using these two tools to create a successful workflow that enables reproducible research. Other software packages that I talk about are git for version control + GitHub and pandoc for documentation conversion. I extend out on the R platform by recommending the following packages: knitr for dynamic documents, rmarkdown for an agnostic file format, xtable for transferring r data to LaTeX or HTML graphs, and roxygen2 for taking inline document comments and generating .Rd documentation.

If this is interesting, I recommend checking out the slides and r code used to create the presentation here: Reproducible Research via R, LaTeX, and knitr (R Code).

During the course of preparing the last two talks I gave, I was a bit disappointed that UIUC lacked a beamer color theme. Previously, I had obtained a UIUC-oriented theme from a professor, but I thought I could do it better. I modified the theme a bit so that it looks like: UIUC Beamer Theme (sample). If you are interested in using the theme, download: Illinois I Logo and .Rnw File. For those of you who are googling "UIUC Beamer Theme" and are going "why is there no .tex?" Please note that .Rnw is a form of .tex. The main difference is .Rnw allows for r code to be embedded in it. Removing code chunks like <>=@ will revert the file to a normal .tex file.

Till next time....

Talkin' Bout A Big Data Revolution

Big Data is here and there is no turning back.

No longer are there whispers about Big Data... Big Data is being yelled from the mountain tops as a result of the complexity of the NSA programs. Not only that, but students are rising up to demand Big Data courses (e.g. hadoop, storm, pig, hive, etc.) be offered so that they are more attractive as a potential new hire.

Companies are appreciating the additional pressure students are exerting to be trained in Big Data due to the vast troves of data they possess on each individual. They are desperately seeking to capitalize on these data portfolios they have developed instead of letting them bit rot in a dark server room. But, are the companies actually performing a Big Data analysis? Or is it just another "Oooh, they're doing it... So, we're doing it! Even though, we're not" 'cause you want to be in the cool kids club 'n stuff.

To describe this new struggle, the term "Jabberwocky" comes to mind. "Jabberwocky" was used in Better Off Ted (S1E12). To companies and students, this is the next thing, this is something new, this is tomorrow land. But, it really isn't. The fact that "Big Data" can be broken down into keywords such as velocity, volume and variety has helped decrease the activation threshold for what constitutes "Big Data."

Truth be told, "Big Data" was available back in the 1980s (e.g. AA loyalty programs & preferred shoppers cards) and since then no one has really gotten a clue about "Big Data." In fact, the theory and hardware behind it is actively being developed as we speak. In the interim, we try to extend the modeling paradigm by creating data structures that enable us to look at high volumes of data (large n) and globs of predictors (large p).

Recently, I gave a talk at the Big Data and Analytics Council @ UIUC that centered around performing supervised learning (primarily logistic regression), using large amounts of data, and R. The slides are here: Supervised Learning with Big Data using R (R Code). Through the talk, the main focus was showing the current modeling paradigm and how it can extend into working with large n through R's bigmemory, biganalytics, and biglm packages.

STAT 426 - Take Home Midterm 3

The Take Home Midterm 3 was made available on Friday, May 2th, 2014 and was due on May 16th, 2014.

The following is the text framing the midterm examination:

Exam 3
Statistics 426
Due May 16, 2014

This exam will involve analysis of the bacteria dataset from the MASS package, to answer some specific questions that are listed below. Guidelines are given on how to prepare your document, and you should imagine this is a report done for some professional purpose. A substantial part of the scoring will be due to how well your procedures and results are communicated. Keep in mind, this is an exam and there can be absolutely no collaboration or discussion of your analysis or report with other students, or anyone else.

The aim is to model the presence of the bacteria H. influenzae in children with a middle ear infection in the Northern Territory of Australia. Specifically, the aims are listed below:

  1. Fit regression models to uncover treatment effects on presence of H. influenzae, that consider possible effects due to treatment compliance and the timing of the test (week).
  2. Perform this analysis using generalized linear mixed models, generalized estimating equations, and transition models. Compare and contrast the three approaches, and state how they compare with regard to estimated treatment effects. Note that for transition models, some work might need to be done to get the dataset in shape to run with the glm() function.

Outline for reports

  • Introduction Do a little reading so you can give some background information on the bacteria and otitis media. Paraphrase the main objectives of your analysis.
  • The Data Provide some summaries of your data, search for outliers and consider what to do with them in your analyses.
  • Methods Discuss your methods for developing your models, such as variable coding, utilization of interactions (if needed), selection of covariance structure (in gee case), and so forth.
  • Results Show the results and details of the fitted regression models in appropriate tables and possibly any relevant graphs. Provide an interpretation of the results and a comparison of the three techniques for modeling (gee, glmm,transition).
  • Summary Give a very brief summary of what you have found in a non-technical manner
  • Appendix Provide the R-code that you used to conduct your analysis.

The solution posted here is the exam that I turned in on the due date.

The Midterm 3 solution for STAT 426 can be found here: STAT 426 Midterm 3 (PDF)

The file is created using knitr, which allows R code to be intertwined by LaTeX:
STAT 426 Midterm 3 (RNW)

STAT 426 - Take Home Midterm 2

The take home midterm 2 was made available on April 9th, 2014 and was due on May 1st, 2014.

The following is the text framing the midterm examination:

Exam 2
Statistics 426
Due May 1, 2014

This exam will involve analysis of the pima dataset from the faraway package, to answer some specific questions that are listed below. Guidelines are given on how to prepare your document, and you should imagine this is a report done for some professional purpose. A substantial part of the scoring will be due to how well your procedures and results are communicated. Keep in mind, this is an exam and there can be absolutely no collaboration or discussion of your analysis or report with other students, or anyone else.

Below are the main issues to address with your analysis:

  1. Determine a regression model that accurately predicts the outcome of the diabetes test, which is indicated by the variable test, a test for diabetes done 5 years after the other variables were measured on a sample not having diabetes initially. Produce 2 such models, one that includes the predictor glucose and one that does not.
  2. Using the two models from the first aim, construct rules based on these models for assigning a positive or negative diagnosis. Taking test as the true status after 5 years, estimate the sensitivity, specificity, and predictive value of your rules. Try to define the rule (cut-off value) in such a way that the pair (sensitivity, specificity) is as close as possible to the point (1,1).

Outline for reports

  1. Introduction Do a little reading, maybe simply on Wikipedia, so you can give some background information on diabetes, Pima Indians, and the potential significance of the predictors in your dataset. Paraphrase the main objectives of your analysis.
  2. The Data Provide some summaries of your data, search for outliers and consider what to do with them in your analyses. Perhaps do some bivariate analyses, such as smoothing techniques to see how test is functionally related to the predictors.
  3. Methods Discuss your methods for selecting variables and a link function for your models, and assessing the diagnostic properties for the two rules (one with glucose and one without) you choose for diagnosis. Comment on the leverage and influence of cases and how you dealt with this.
  4. Results Show the results and details of the fitted regression models in appropriate tables and possibly any relevant graphs. Provide an interpretation of the results and the role each predictor plays.
  5. Summary Give a very brief summary of what you have found in a non-technical manner
  6. Appendix Provide the R-code that you used to conduct your analysis.

The solution posted here is the exam that I turned in on the due date.

The Midterm 2 solution for STAT 426 can be found here: STAT 426 Midterm 2 (PDF)

The file is created using knitr, which allows R code to be intertwined by LaTeX:
STAT 426 Midterm 2 (RNW)

Setting up RStudio to work with RcppArmadillo

The Motivation

Over the past week, I've been working on converting R scripts into scripts that hook into R's C++ API using RcppArmadillo. The common reaction when I mentioned the project was, “Huh? Why would you bother converting from R language into R C++?” Then, I showed them the benchmarks. Needless to say, I now have several folks who want to learn how to write scripts using RcppArmadillo. Please note, this post is meant as an introduction in a series of posts to writing in Rcpp/RcppArmadillo. If you do not have basic C++ knowledge, please consider acquiring it before proceeding.

The Setup

Before we go further, let's talk about the work environment basics for RcppArmadillo.
First, make sure you install the following packages RcppArmadillo, Rcpp, inline, and rbenchmark

install.packages("Rcpp","RcppArmadillo","inline","rbenchmark");

Development Environment

The development flow that is typical with code development projects is to use an R Studio project space. To create an Rcpp project from within R Studio, go through the normal project creation steps:

  1. File => New Project
  2. New Directory => R Package
  3. Click the Type: dropdown menu to engage options, Select “Package w/ Rcpp”
    • By default it says, “Package”
  4. Fill in Package name
  5. Select appropriate directory for package
  6. Uncheck create a git repository for this project
    • Git is a version control system for code and is outside the scope of this tutorial.
  7. Press create project

Bold words indicate changes from the normal creation process.

If you create a new C++ file from the drop down menu, then note that in the upper right hand corner of the code editor you no longer have “Run” and “Re-run the previous code region.” The only remainder from creating R Scripts in the code editor is that of “Source,” which will compile that Rcpp script using the R console command sourceCpp().

In order to build your package using the Build tab, modifications need to be made to both the DESCRIPTION file and the NAMESPACE file.

The DESCRIPTION file should look like so:

Package: your package name here
Type: Package
Title: package title
Version: 1.0
Date: YYYY-MM-DD
Author: Your Name Here
Maintainer: Person to send complaints to <complain_to_me@gmail.com>
Description: Short description
License: MIT + file LICENSE
Imports: Rcpp (>= 0.11.1), RcppArmadillo (>= 0.4.200.0)
LinkingTo: Rcpp, RcppArmadillo
Depends: RcppArmadillo

Note, the only main differences are the inclusion of RcppArmadillo into Imports, LinkingTo, and Depends!

Within the NAMESPACE file, make the following modifications:

useDynLib(packagename)
import(RcppArmadillo)
importFrom(Rcpp, evalCpp)
exportPattern("^[[:alpha:]]+")

The key is to add import(RcppArmadillo) to the file and substitute your package name in useDynLib(packagename).

With these modifications, RStudio will be able to handle RcppArmadillo based packages.

A short cavat to the above note is having space or any special characters in the file path to the source file.
So, this means you should not try to compile a source file with the following paths:

C:/users/name/what up/did you/know spaces/are very/harmful to/rcpp files.cpp

or

C:/users/name/!@#$%^&*()-=+/rcppfile.cpp

Making a simple script

Please note, there are several alternatives to this workflow. For example, if you only want to outsource one function that is loop-intensive, then using the inline package or cppFunction() is preferred.

Here is an example creating an inline function declaration with cppFunction():

# In R
library("Rcpp")
 
# In R
cppFunction('
  //declare return type, specify function name, and function parameters
int hello_world_rcpp(int n) {
 
   //C++ for loop
   for(int i=0; i<n; i++){
     //prints to R console similar to print() or cat()
     Rcpp::Rcout << "Hello World!" << std::endl;
   }
 
   //send back the number of times hello world was said. 
    return n;
}')

The R equivalent would be:

hello_world_r = function(n) {
    for (i in 1:n) {
        cat("Hello World!")
    }
    return(n)
}

Calling the function results in:

# In R
hello_world_rcpp(n = 2)
Hello World!
Hello World!
[1] 2

Why this wasn't futile…

In the beginning, I mentioned that the primary reason for converting from R scripts to R's C++ API was speed. To illustrate, note the following benchmarks created using library("rbenchmark").

library("rbenchmark")
benchmark(rcpp_v = hello_world_rcpp(2), r_v = hello_world_r(2))
##     test replications elapsed relative user.self sys.self user.child   sys.child
## 2    r_v          100    0.01       NA      0.02        0         NA          NA
## 1 rcpp_v          100    0.00       NA      0.00        0         NA          NA

So, we should prefer the rcpp implementation. Note, this may not always hold when you are echoing statements out to the console. There may be added lag using Rcpp out statements vs. R out statements. However, the looping procedures within Rcpp should be faster than looping procedures in R. Also, the output from this benchmark was suppressed. If you run this benchmark on your system, expect to receive 200 “Hello World!” and 200 returns of the number 2 in addition to the output table displayed above.

Rcpp, RcppArmadillo and OS X Mavericks "-lgfortran" and "-lquadmath" error

Intro

Over the past weekend, I've been working on translating R code into Rcpp, a plugin C++ for R, to increase the speed of Markov Chain Monte Carlo (MCMC). There have been quite a few things I've learned. From the top: 1.) R Studio is fantastic since not only does it handle R, but it also works splendidly with Rcpp! 2.) Be prepared to write your own operations (i.e. matrix multiplication, solve, etc.) or use RcppArmadillo. 3.) Buzz Lightyear voice, “Compiler errors … everywhere!” 4.) Benchmark, benchmark, and then benchmark. I had such a positive experience with Rcpp, I most likely will be talking about it more in the future. The same cannot be said for initial experience I had with RcppArmadillo, a critical extension package of Rcpp that all Statisticians will want to use.

The Problem

The initial experience I had with RcppArmadillo was problematic since it didn't work out of the box even with the slew of development tools (XCode) I have installed on OS X. Therefore, I would like to draw attention to a wee bit of a hiccup users can potential experience on OS X Mavericks (OS X 10.9). Chiefly, the show-stopping errors of “-lgfortran” and “-lquadmath.” Examples of the error texts are:

“-lgfortran” Error

ld: warning: directory not found for option

'-L/usr/local/lib/gcc/i686-apple-darwin8/4.2.3/x86_64'

ld: warning: directory not found for option '-L/usr/local/lib/x86_64'

ld: warning: directory not found for option
'-L/usr/local/lib/gcc/i686-apple-darwin8/4.2.3'

ld: library not found for -lgfortran

clang: error: linker command failed with exit code 1 (use -v to see
invocation)

“-lquadmath” Error

ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'

ld: library not found for -lquadmath 

clang: error: linker command failed with exit code 1 (use -v to see invocation) make: ***
[sourceCpp_93074.so] Error 1 clang++
 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include  -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include"
-I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/RcppArmadillo/include"
-fPIC  -Wall -mtune=core2 -g -O2  -c rmvrnorm_arma.cpp -o rmvrnorm_arma.o clang++ -dynamiclib -Wl,-headerpad_max_install_names
-undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o sourceCpp_93074.so rmvrnorm_arma.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2 -lgfortran -lquadmath -lm -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

The good news is that both errors originate from the same problem. Mainly, R for OS X Maverick was compiled using gfortran-4.8. What does this mean? For starters, the official cran mirror of R tools for OS X and R tools for OS X on r.research.att.com that lists the gfortran binary are out of date. Furthermore, how NOT to solve this problem, but make it more complicated then it was before, is to go to and download the gcc gfortran binaries straight from the source. The reason why this is a very bad idea and will not work is the binaries listed were not used to compile R. As a result, the using the gcc binaries will lead you to switch from receiving a “-lgfortran” error to a “-lquadmath” error since a few files are missing from the expected locations.

The Solution

Go to the optional libraries, frameworks and applications for Mac OS X on r.research.att.com and download gfortran-4.8.2-darwin13.tar.bz2. Extract the package in ~/, which is root. The files should be unpacked into /usr/local/…

Alternatively, open terminal and type:

curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2
sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /

Happy Rcpp and RcppArmadillo 'in on OS X Mavericks!

Creating Stacked Barplot and Grouped Barplot in R using Base Graphics (no ggplot2)

Stacked Barplots, or graphs that depict conditional distributions of data, are great for being able to see a level-wise breakdown of the data. Unfortunately, R has no easily built in functions for generating a stacked barplot. In fact, the majority of R users suggest creating stacked barplots using ggplot2 since it is easier and also looks a bit better. However, for those not yet aboard the ggplot2 train or for those that prefer base R, fear not! For we are about to embark on an adventure that will bring stacked barplots to base R as well as the ability to group bar plots together!

Examining the Data

For this example, we will use data concerning web domains in different journals. The dataset looks like the following:

internetrefs
Domain Journal Count
gov NEJM 41
gov JAMA 103
gov Science 111
org NEJM 37
org JAMA 46
org Science 162
com NEJM 6
com JAMA 17
com Science 14
edu NEJM 4
edu JAMA 8
edu Science 47
other NEJM 9
other JAMA 15
other Science 52

Notice, the data has consistent order such that all Domain entries are grouped together and the Journal variable has a cyclic ordering where: NEJM, then JAMA, and finally Science appears. We will exploit this feature of the data. I will also provide a method that isolates the data when such feature does not exist.

Creating the Matrix

In order to create a stacked bar plot, we must first know a variable's levels(). Levels is synonymous with factors that are associated with categorical (string) variables. The levels() function returns the unique values of the strings that a variable takes on. So:

# Attach object so that we can reference by Domain, Journal, and Count
  # Instead of internetrefs$Count
  attach(internetrefs)
levels(Journal)
## [1] "JAMA"    "NEJM"    "Science"
  
levels(Domain)
## [1] "com"   "edu"   "gov"   "org"   "other"
  

Then, using the levels() information, the data must be transformed into a matrix structure. The matrix structure takes on the following form: the rows are representative of the levels() of the Domain variable and the columns represent the levels() of the Journal variable. We induce this by:

### USING THE PRE-EXISTING ORDER FEATURE OF THE DATA ###
 
  # Load the count values
  data = Count
 
# Place data in a matrix that will have 3 columns since the number levels()
# for Journal is 3. Also, based on the ordering feature of the data, load
# the matrix such that we fill the matrix by row.
data = matrix(data, ncol = 3, byrow = T)
 
# Label the columns and rows
colnames(data) = levels(Journal)
rownames(data) = levels(Domain)

The matrix structure of data is then:

id JAMA NEJM Science
com 41 103 111
edu 37 46 162
gov 6 17 14
org 4 8 47
other 9 15 52

But, let's say that you lack that nice feature of the data that we discussed earlier. One way to obtain it is by going through and ordering the columns of the initial dataframe:

internetrefs_ordered = internetrefs[with(internetrefs, order(Domain, Journal)), 
                                                    ]
id Domain Journal Count
8 com JAMA 17
7 com NEJM 6
9 com Science 14
11 edu JAMA 8
10 edu NEJM 4
12 edu Science 47
2 gov JAMA 103
1 gov NEJM 41
3 gov Science 111
5 org JAMA 46
4 org NEJM 37
6 org Science 162
14 other JAMA 15
13 other NEJM 9
15 other Science 52

From here, the problem then simplifies to the code used for the ordered data:

### AFTER CREATING THE ORDER FEATURE OF THE DATA ###
 
  # Access and load the count values
  data_ordered = internetrefs_ordered$Count
 
# Place data in a matrix that will have 3 columns since the number levels()
# for Journal is 3. Also, based on the ordering feature of the data, load
# the matrix such that we fill the matrix by row.
data_ordered = matrix(data_ordered, ncol = 3, byrow = T)
 
# Label the columns and rows
colnames(data_ordered) = levels(internetrefs_ordered$Journal)
rownames(data_ordered) = levels(internetrefs_ordered$Domain)
id JAMA NEJM Science
com 41 103 111
edu 37 46 162
gov 6 17 14
org 4 8 47
other 9 15 52

Building the Stacked Barplot

There are two ways to build a stacked barplot: percentage-based and counts-based

However, special care needs to be taken when including a legend. By default, barplot()'s legend generating capabilities are pretty lacking. As a result, one needs to modify the margin space and how clipping is handled. This is achieved by setting par():

# mar is defined to receive: c(bottom, left, top, right) . The default
  # margin is: c(5, 4, 4, 2) + 0.1 . As a result, we have exploded the
  # right-hand side of the figure to hold legend.
 
  # xpd=TRUE forces all plotting to be clipped to the figure region
  par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE)

To build percentage-based barplots, we must use prop.table() to generate each percentage for the columns:

# Here, margin represents whether it will be run on rows (1) or columns (2)
  # We&#39;ve selected to use prop.table() on columns since that was how we built
  # our data.
  prop = prop.table(data, margin = 2)

If we are looking to build a percentage-based vertical stacked barplot then:

par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE)
barplot(prop, col = heat.colors(length(rownames(prop))), width = 2)
legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(prop))), 
       legend = rownames(data))

plot of chunk unnamed-chunk-10

If we are looking to build a counts-based vertical stacked barplot then:

par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE)
       barplot(data, col = heat.colors(length(rownames(data))), width = 2)
       legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(data))), 
       legend = rownames(data))

plot of chunk unnamed-chunk-13

Grouped Barplots

If we are looking to build a percentage-based horizontal grouped barplot then:

par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE)
       barplot(prop, col = heat.colors(length(rownames(prop))), width = 2, beside = TRUE)
       legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(prop))), 
       legend = rownames(data))

plot of chunk unnamed-chunk-11

If we are looking to build a counts-based horizontal group barplot then:

par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE)
       barplot(data, col = heat.colors(length(rownames(data))), width = 2, beside = TRUE)
       legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(data))), 
       legend = rownames(data))

plot of chunk unnamed-chunk-12

In Summary

We've talked about a lot of different components required to pull off a stacked barplot and group bar plot in R. Below is the script that you will want to modify to suit your own data:

internetrefs = read.delim("~/Desktop/internetrefs.txt")
 
       # Force Order
       data_ordered = internetrefs[with(internetrefs, order(Domain, Journal)), ]
 
       # load the count values
       data = data_ordered$Count
 
       data = matrix(data, ncol = 3, byrow = T)
 
       colnames(data) = levels(data_ordered$Journal)
       rownames(data) = levels(data_ordered$Domain)
 
       prop = prop.table(data, margin = 2)
 
       par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE)
 
       # Percent-based vertically stacked barplot
       barplot(prop, col = heat.colors(length(rownames(prop))), width = 2)
       legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(prop))), 
       legend = rownames(data))
 
       # Percent-based grouped bar barplot
       barplot(prop, col = heat.colors(length(rownames(prop))), width = 2, beside = TRUE)
       legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(prop))), 
       legend = rownames(data))
 
       # Counts-based vertically stacked barplot
       barplot(data, col = heat.colors(length(rownames(data))), width = 2)
       legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(data))), 
       legend = rownames(data))
 
       # Counts-based grouped barplot
       barplot(data, col = heat.colors(length(rownames(data))), width = 2, beside = TRUE)
       legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(data))), 
       legend = rownames(data))

Thanks!

Special thanks go out to Weihong Huang!

Tukey Studentized Range and You!

Definition of Tukey's Test Statistic

Suppose r independent observations denoted:

{Y_1}, \cdots ,{Y_r} \mathop \sim\limits^{iid} N\left( {\mu ,{\sigma ^2}} \right)


Let

W = range(Y) = max(Y_{i})-min(Y_{i})

Now, suppose that we have an estimate s^2 of the variance \sigma^2 which is based on v degrees of freedom and is independent of the Y_i (i = 1,\cdots,r). v is usually derived from analysis of variance.

Then, the Tukey's Test Statistic is:

q_{r,v} = \frac{W}{s}

Ranges: Internalized

Let W=X_{n}-X_{1}

Internally Studentized Range: Population \sigma^2 is unknown

q_{n,n-1}=\frac{W}{s}=\frac{X_{n}-X_{1}}{s},

where s=\left( \frac{1}{\left( {n - 1} \right)}{\sum\limits_{i = 1}^n {{{\left( {{X_i} - \bar X} \right)}^2}} } \right)^{1/2}

Ranges: Externalized

Let W=X_{n}-X_{1}

Externally Studentized Range: Population \sigma^2 is unknown AND an independent estimator s_{v}^2 of \sigma^2 is available with degrees of freedom v.

T=\frac{W}{s_v}=\frac{X_{n}-X_{1}}{s_v}

The dependence of the distribution W on unknown \sigma can be removed by studentization. So, S_{v}^{2} is changable to vS_{v}^{2}/\sigma^2 \sim \chi_{v}^2, independent of W.

Ranges: Both

Let W=X_{n}-X_{1}

Externally and Internally Studentized Range:

T=\frac{W}{\tilde{s}}=\frac{X_{n}-X_{1}}{\tilde{s}},

where \tilde{s} = \left( \frac{1}{\left( {n - 1 + v} \right)}{\sum\limits_{i = 1}^n {{{\left( {{X_i} - \bar X} \right)}^2}} } +vs_{v}^2 \right)^{}

Our use

We will be using the Externally Studentized Range….

Specifically, the first definition resembles….

Tukey's Studentized Range Test Statistic ANOVA

Let {Y_{ij}} \mathop \sim\limits^{iid} N\left( {0 ,{\sigma ^2}} \right), where j=1,\cdots,n and i=1,\cdots,k be independent observations in a balanced one-way ANOVA with k treatments.

Then \bar{Y}_{1},\cdots,\bar{Y}_{k} are the sample averages and S^2 is the independent and unbiased estimator of \sigma^2 based on v=k\left({n-1}\right).

Let W be the range of \bar{Y}_i.

q_{n,v} = \frac{W}{S/\sqrt{n}} = \frac{max\left({\bar{Y_{i\cdot}}}\right)-min\left({\bar{Y_{i\cdot}}}\right)}{ \sqrt{\left({MS_{error}/n}\right)}}

Example - Data

Consider a dosage regiment with repeated measures:

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Row Total
Dose 1 27.00 26.20 28.80 33.50 28.80 144.30
Dose 2 22.80 23.10 27.70 27.60 24.00 125.20
Dose 3 21.90 23.40 20.10 27.80 19.30 112.50
Dose 4 23.50 19.60 23.70 20.80 23.90 111.50
Col Total 95.20 92.30 100.30 109.70 96.00 493.50

In R

(dose_type = c(rep("1", 5), rep("2", 5), rep("3", 5), rep("4", 5)))
##  [1] "1" "1" "1" "1" "1" "2" "2" "2" "2" "2" "3" "3" "3" "3" "3" "4" "4"
## [18] "4" "4" "4"
samples = c(data[1, ], data[2, ], data[3, ], data[4, ])
##  [1] 27.0 26.2 28.8 33.5 28.8 22.8 23.1 27.7 27.6 24.0 21.9 23.4 20.1 27.8
## [15] 19.3 23.5 19.6 23.7 20.8 23.9

ANOVA Table:

model = glm(samples ~ factor(dose_type))
aov(model)
## Call:
##    aov(formula = model)
## 
## Terms:
##                 factor(dose_type) Residuals
## Sum of Squares              140.1     116.3
## Deg. of Freedom                 3        16
## 
## Residual standard error: 2.696
## Estimated effects may be unbalanced

Tukey's 95% C.I. pairwise comparison process

All pairwise differences \mu_{i}-\mu_{j} are given by 100 \times \left({1-\alpha} \right)% C.I.

\left( {{{\bar Y}_i} - {{\bar Y}_j}} \right) \pm \frac{{{q_{\alpha ;J,N - J}}}}{{\sqrt 2 }} \cdot {s_{pooled}} \cdot \sqrt {\frac{1}{{{n_1}}} + \frac{1}{{{n_2}}}} ,{\text{ }}{s_{pooled}} = \sqrt {MSW}

Note: Always start with the largest mean and smallest mean pair, if the result is not significant, then the result will hold for all means between the largest and smallest.

Largest vs. Smallest

\left( {28.86 - 22.30} \right) \pm \frac{{{q_{.05 ;4,20 - 4}}}}{{\sqrt 2 }} \cdot \sqrt{\left(116.3/16\right)} \cdot \sqrt {\frac{1}{{{5}}} + \frac{1}{{{5}}}} ,

q_{.05 ;4,20 - 4}=qtukey(0.95,4,20-4)= 4.0461

6.56 \pm 4.8789

In R:

TukeyHSD(aov(model))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = model)
## 
## $`factor(dose_type)`
##      diff     lwr    upr  p adj
## 2-1 -3.82  -8.699  1.059 0.1546
## 3-1 -6.36 -11.239 -1.481 0.0089
## 4-1 -6.56 -11.439 -1.681 0.0070
## 3-2 -2.54  -7.419  2.339 0.4662
## 4-2 -2.74  -7.619  2.139 0.4027
## 4-3 -0.20  -5.079  4.679 0.9994

So, only 3-1 and 4-1 are significant.

That's All folks!

References

  • David, H. A.. “Studentized Range.” Encyclopedia of statistical sciences. New York: Wiley, 2006. 1-3. Print.
  • Falk, Michael, and Frank Marohn. “The One-Way Analysis of Variance.” Foundations of statistical analyses and applications with SAS. Basel: Birkhauser Verlag, 2002. 193-194. Print.
  • Harter, H. Leon, and N. Balakrishnan. “The Studentized Range of Samples from a Normal Population.” Tables for the use of range and studentized range in tests of hypotheses. Boca Raton, Fla.: CRC Press, 1998. 52-53. Print.
  • Hochberg, Yosef. “Studentized Range.” Encyclopedia of Biostatistics. Hoboken, NJ: John Wiley & Sons, Ltd, 2005. 1-3. Print.
  • Lowry, Richard. “Ch 14: One-Way Analysis of Variance for Independent Samples.” Concepts & Applications of Inferential Statistics. 2000. 1-3. Print.

Alternative way to write the Tukey's Test Statistic

Let {X_1}, \cdots ,{X_m} \mathop \sim\limits^{iid} N\left( {0 ,{\sigma ^2}} \right), where n \ge 2, and let Z be \chi^{2} with n degrees of freedom.

q = \frac{{\mathop {\max }\limits_{1 \leqslant i,j \leqslant m} \left| {{X_i} - {X_j}} \right|}}{{\sqrt {Z/n} }} = \frac{{{X_{m:m}} - {X_{1:m}}}}{{\sqrt {Z/n} }}

In the case of m=2, q closely resembles the two sample t test statistic.

That is, X_{1} and X_{2} are are taken to be the standardized sample means of the two samples and Z/n is the pooled sample variance, S_p^2 = \frac{{S_1^2\left( {{n_1} - 1} \right) + S_2^2\left( {{n_2} - 1} \right)}}{{{n_1} + {n_2} - 2}},{\text{ }}{S_E} = {S_p}\sqrt {\frac{1}{{{n_1}}} + \frac{1}{{{n_2}}}}