## Talkin' Bout A Big Data Revolution

Big Data is here and there is no turning back.

No longer are there whispers about Big Data... Big Data is being yelled from the mountain tops as a result of the complexity of the NSA programs. Not only that, but students are rising up to demand Big Data courses (e.g. hadoop, storm, pig, hive, etc.) be offered so that they are more attractive as a potential new hire.

Companies are appreciating the additional pressure students are exerting to be trained in Big Data due to the vast troves of data they possess on each individual. They are desperately seeking to capitalize on these data portfolios they have developed instead of letting them bit rot in a dark server room. But, are the companies actually performing a Big Data analysis? Or is it just another "Oooh, they're doing it... So, we're doing it! Even though, we're not" 'cause you want to be in the cool kids club 'n stuff.

To describe this new struggle, the term "Jabberwocky" comes to mind. "Jabberwocky" was used in Better Off Ted (S1E12). To companies and students, this is the next thing, this is something new, this is tomorrow land. But, it really isn't. The fact that "Big Data" can be broken down into keywords such as velocity, volume and variety has helped decrease the activation threshold for what constitutes "Big Data."

Truth be told, "Big Data" was available back in the 1980s (e.g. AA loyalty programs & preferred shoppers cards) and since then no one has really gotten a clue about "Big Data." In fact, the theory and hardware behind it is actively being developed as we speak. In the interim, we try to extend the modeling paradigm by creating data structures that enable us to look at high volumes of data (large n) and globs of predictors (large p).

Recently, I gave a talk at the Big Data and Analytics Council @ UIUC that centered around performing supervised learning (primarily logistic regression), using large amounts of data, and R. The slides are here: Supervised Learning with Big Data using R (R Code). Through the talk, the main focus was showing the current modeling paradigm and how it can extend into working with large n through R's bigmemory, biganalytics, and biglm packages.

## "The Qual"

It's been a while since I've written anything for TCP....
I lay blame on the fact that I've been studying up a storm for the qualification exam. Or, as the majority of graduate students and faculty say, "The Qual." Many people have asked why exactly I've been a bit anti-social over the last few weeks. When I respond with, "Qual." I receive looks of confusion. So, I thought I would share with you what a qualification exam looks like with some in-line solutions.

*Drum roll* The moment you've been waiting for... Check out a STATS qual here*

*I make no guarantee that the solutions are fully there or accurate.

## STAT 426 - Take Home Midterm 3

The Take Home Midterm 3 was made available on Friday, May 2th, 2014 and was due on May 16th, 2014.

The following is the text framing the midterm examination:

Exam 3
Statistics 426
Due May 16, 2014

This exam will involve analysis of the bacteria dataset from the MASS package, to answer some speciﬁc questions that are listed below. Guidelines are given on how to prepare your document, and you should imagine this is a report done for some professional purpose. A substantial part of the scoring will be due to how well your procedures and results are communicated. Keep in mind, this is an exam and there can be absolutely no collaboration or discussion of your analysis or report with other students, or anyone else.

The aim is to model the presence of the bacteria H. inﬂuenzae in children with a middle ear infection in the Northern Territory of Australia. Speciﬁcally, the aims are listed below:

1. Fit regression models to uncover treatment eﬀects on presence of H. inﬂuenzae, that consider possible eﬀects due to treatment compliance and the timing of the test (week).
2. Perform this analysis using generalized linear mixed models, generalized estimating equations, and transition models. Compare and contrast the three approaches, and state how they compare with regard to estimated treatment eﬀects. Note that for transition models, some work might need to be done to get the dataset in shape to run with the glm() function.

Outline for reports

• Introduction Do a little reading so you can give some background information on the bacteria and otitis media. Paraphrase the main objectives of your analysis.
• The Data Provide some summaries of your data, search for outliers and consider what to do with them in your analyses.
• Methods Discuss your methods for developing your models, such as variable coding, utilization of interactions (if needed), selection of covariance structure (in gee case), and so forth.
• Results Show the results and details of the ﬁtted regression models in appropriate tables and possibly any relevant graphs. Provide an interpretation of the results and a comparison of the three techniques for modeling (gee, glmm,transition).
• Summary Give a very brief summary of what you have found in a non-technical manner
• Appendix Provide the R-code that you used to conduct your analysis.

The solution posted here is the exam that I turned in on the due date.

The Midterm 3 solution for STAT 426 can be found here: STAT 426 Midterm 3 (PDF)

The file is created using knitr, which allows R code to be intertwined by LaTeX:
STAT 426 Midterm 3 (RNW)

## STAT 426 - Take Home Midterm 2

The take home midterm 2 was made available on April 9th, 2014 and was due on May 1st, 2014.

The following is the text framing the midterm examination:

Exam 2
Statistics 426
Due May 1, 2014

This exam will involve analysis of the pima dataset from the faraway package, to answer some speciﬁc questions that are listed below. Guidelines are given on how to prepare your document, and you should imagine this is a report done for some professional purpose. A substantial part of the scoring will be due to how well your procedures and results are communicated. Keep in mind, this is an exam and there can be absolutely no collaboration or discussion of your analysis or report with other students, or anyone else.

1. Determine a regression model that accurately predicts the outcome of the diabetes test, which is indicated by the variable test, a test for diabetes done 5 years after the other variables were measured on a sample not having diabetes initially. Produce 2 such models, one that includes the predictor glucose and one that does not.
2. Using the two models from the ﬁrst aim, construct rules based on these models for assigning a positive or negative diagnosis. Taking test as the true status after 5 years, estimate the sensitivity, speciﬁcity, and predictive value of your rules. Try to deﬁne the rule (cut-oﬀ value) in such a way that the pair (sensitivity, speciﬁcity) is as close as possible to the point (1,1).

Outline for reports

1. Introduction Do a little reading, maybe simply on Wikipedia, so you can give some background information on diabetes, Pima Indians, and the potential signiﬁcance of the predictors in your dataset. Paraphrase the main objectives of your analysis.
2. The Data Provide some summaries of your data, search for outliers and consider what to do with them in your analyses. Perhaps do some bivariate analyses, such as smoothing techniques to see how test is functionally related to the predictors.
3. Methods Discuss your methods for selecting variables and a link function for your models, and assessing the diagnostic properties for the two rules (one with glucose and one without) you choose for diagnosis. Comment on the leverage and inﬂuence of cases and how you dealt with this.
4. Results Show the results and details of the ﬁtted regression models in appropriate tables and possibly any relevant graphs. Provide an interpretation of the results and the role each predictor plays.
5. Summary Give a very brief summary of what you have found in a non-technical manner
6. Appendix Provide the R-code that you used to conduct your analysis.

The solution posted here is the exam that I turned in on the due date.

The Midterm 2 solution for STAT 426 can be found here: STAT 426 Midterm 2 (PDF)

The file is created using knitr, which allows R code to be intertwined by LaTeX:
STAT 426 Midterm 2 (RNW)

## STAT 426 - Inclass Midterm 1

The In-class Midterm 1 was given on March 13th, 2014 at 12:30 PM - 2:50 PM. The solutions posted here have been typed up from the solutions written on the in-class midterm.

The Midterm 1 Solutions for STAT 426 can be found here: STAT 426 Midterm 1 (PDF)

The file is created using knitr, which allows R code to be intertwined by LaTeX.
STAT 426 Midterm 1 (RNW)

# The Motivation

Over the past week, I've been working on converting R scripts into scripts that hook into R's C++ API using RcppArmadillo. The common reaction when I mentioned the project was, “Huh? Why would you bother converting from R language into R C++?” Then, I showed them the benchmarks. Needless to say, I now have several folks who want to learn how to write scripts using RcppArmadillo. Please note, this post is meant as an introduction in a series of posts to writing in Rcpp/RcppArmadillo. If you do not have basic C++ knowledge, please consider acquiring it before proceeding.

# The Setup

Before we go further, let's talk about the work environment basics for RcppArmadillo.
First, make sure you install the following packages RcppArmadillo, Rcpp, inline, and rbenchmark

install.packages("Rcpp","RcppArmadillo","inline","rbenchmark");


# Development Environment

The development flow that is typical with code development projects is to use an R Studio project space. To create an Rcpp project from within R Studio, go through the normal project creation steps:

1. File => New Project
2. New Directory => R Package
3. Click the Type: dropdown menu to engage options, Select “Package w/ Rcpp”
• By default it says, “Package”
4. Fill in Package name
5. Select appropriate directory for package
6. Uncheck create a git repository for this project
• Git is a version control system for code and is outside the scope of this tutorial.
7. Press create project

Bold words indicate changes from the normal creation process.

If you create a new C++ file from the drop down menu, then note that in the upper right hand corner of the code editor you no longer have “Run” and “Re-run the previous code region.” The only remainder from creating R Scripts in the code editor is that of “Source,” which will compile that Rcpp script using the R console command sourceCpp().

In order to build your package using the Build tab, modifications need to be made to both the DESCRIPTION file and the NAMESPACE file.

The DESCRIPTION file should look like so:

Package: your package name here
Type: Package
Title: package title
Version: 1.0
Date: YYYY-MM-DD
Maintainer: Person to send complaints to <complain_to_me@gmail.com>
Description: Short description
Imports: Rcpp (>= 0.11.1), RcppArmadillo (>= 0.4.200.0)


Note, the only main differences are the inclusion of RcppArmadillo into Imports, LinkingTo, and Depends!

Within the NAMESPACE file, make the following modifications:

useDynLib(packagename)
importFrom(Rcpp, evalCpp)
exportPattern("^[[:alpha:]]+")


The key is to add import(RcppArmadillo) to the file and substitute your package name in useDynLib(packagename).

With these modifications, RStudio will be able to handle RcppArmadillo based packages.

A short cavat to the above note is having space or any special characters in the file path to the source file.
So, this means you should not try to compile a source file with the following paths:

C:/users/name/what up/did you/know spaces/are very/harmful to/rcpp files.cpp

or

C:/users/name/!@#$%^&*()-=+/rcppfile.cpp # Making a simple script Please note, there are several alternatives to this workflow. For example, if you only want to outsource one function that is loop-intensive, then using the inline package or cppFunction() is preferred. Here is an example creating an inline function declaration with cppFunction(): # In R library("Rcpp") # In R cppFunction(' //declare return type, specify function name, and function parameters int hello_world_rcpp(int n) { //C++ for loop for(int i=0; i<n; i++){ //prints to R console similar to print() or cat() Rcpp::Rcout << "Hello World!" << std::endl; } //send back the number of times hello world was said. return n; }') The R equivalent would be: hello_world_r = function(n) { for (i in 1:n) { cat("Hello World!") } return(n) } Calling the function results in: # In R hello_world_rcpp(n = 2) Hello World! Hello World! [1] 2  # Why this wasn't futile… In the beginning, I mentioned that the primary reason for converting from R scripts to R's C++ API was speed. To illustrate, note the following benchmarks created using library("rbenchmark"). library("rbenchmark") benchmark(rcpp_v = hello_world_rcpp(2), r_v = hello_world_r(2)) ## test replications elapsed relative user.self sys.self user.child sys.child ## 2 r_v 100 0.01 NA 0.02 0 NA NA ## 1 rcpp_v 100 0.00 NA 0.00 0 NA NA  So, we should prefer the rcpp implementation. Note, this may not always hold when you are echoing statements out to the console. There may be added lag using Rcpp out statements vs. R out statements. However, the looping procedures within Rcpp should be faster than looping procedures in R. Also, the output from this benchmark was suppressed. If you run this benchmark on your system, expect to receive 200 “Hello World!” and 200 returns of the number 2 in addition to the output table displayed above. ## Rcpp, RcppArmadillo and OS X Mavericks "-lgfortran" and "-lquadmath" error # Intro Over the past weekend, I've been working on translating R code into Rcpp, a plugin C++ for R, to increase the speed of Markov Chain Monte Carlo (MCMC). There have been quite a few things I've learned. From the top: 1.) R Studio is fantastic since not only does it handle R, but it also works splendidly with Rcpp! 2.) Be prepared to write your own operations (i.e. matrix multiplication, solve, etc.) or use RcppArmadillo. 3.) Buzz Lightyear voice, “Compiler errors … everywhere!” 4.) Benchmark, benchmark, and then benchmark. I had such a positive experience with Rcpp, I most likely will be talking about it more in the future. The same cannot be said for initial experience I had with RcppArmadillo, a critical extension package of Rcpp that all Statisticians will want to use. # The Problem The initial experience I had with RcppArmadillo was problematic since it didn't work out of the box even with the slew of development tools (XCode) I have installed on OS X. Therefore, I would like to draw attention to a wee bit of a hiccup users can potential experience on OS X Mavericks (OS X 10.9). Chiefly, the show-stopping errors of “-lgfortran” and “-lquadmath.” Examples of the error texts are: “-lgfortran” Error ld: warning: directory not found for option '-L/usr/local/lib/gcc/i686-apple-darwin8/4.2.3/x86_64' ld: warning: directory not found for option '-L/usr/local/lib/x86_64' ld: warning: directory not found for option '-L/usr/local/lib/gcc/i686-apple-darwin8/4.2.3' ld: library not found for -lgfortran clang: error: linker command failed with exit code 1 (use -v to see invocation)  “-lquadmath” Error ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2' ld: library not found for -lquadmath clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [sourceCpp_93074.so] Error 1 clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/RcppArmadillo/include" -fPIC -Wall -mtune=core2 -g -O2 -c rmvrnorm_arma.cpp -o rmvrnorm_arma.o clang++ -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o sourceCpp_93074.so rmvrnorm_arma.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2 -lgfortran -lquadmath -lm -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation  The good news is that both errors originate from the same problem. Mainly, R for OS X Maverick was compiled using gfortran-4.8. What does this mean? For starters, the official cran mirror of R tools for OS X and R tools for OS X on r.research.att.com that lists the gfortran binary are out of date. Furthermore, how NOT to solve this problem, but make it more complicated then it was before, is to go to and download the gcc gfortran binaries straight from the source. The reason why this is a very bad idea and will not work is the binaries listed were not used to compile R. As a result, the using the gcc binaries will lead you to switch from receiving a “-lgfortran” error to a “-lquadmath” error since a few files are missing from the expected locations. # The Solution Go to the optional libraries, frameworks and applications for Mac OS X on r.research.att.com and download gfortran-4.8.2-darwin13.tar.bz2. Extract the package in ~/, which is root. The files should be unpacked into /usr/local/… Alternatively, open terminal and type: curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2 sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /  Happy Rcpp and RcppArmadillo 'in on OS X Mavericks! ## Creating Stacked Barplot and Grouped Barplot in R using Base Graphics (no ggplot2) Stacked Barplots, or graphs that depict conditional distributions of data, are great for being able to see a level-wise breakdown of the data. Unfortunately, R has no easily built in functions for generating a stacked barplot. In fact, the majority of R users suggest creating stacked barplots using ggplot2 since it is easier and also looks a bit better. However, for those not yet aboard the ggplot2 train or for those that prefer base R, fear not! For we are about to embark on an adventure that will bring stacked barplots to base R as well as the ability to group bar plots together! ## Examining the Data For this example, we will use data concerning web domains in different journals. The dataset looks like the following: internetrefs Domain Journal Count gov NEJM 41 gov JAMA 103 gov Science 111 org NEJM 37 org JAMA 46 org Science 162 com NEJM 6 com JAMA 17 com Science 14 edu NEJM 4 edu JAMA 8 edu Science 47 other NEJM 9 other JAMA 15 other Science 52 Notice, the data has consistent order such that all Domain entries are grouped together and the Journal variable has a cyclic ordering where: NEJM, then JAMA, and finally Science appears. We will exploit this feature of the data. I will also provide a method that isolates the data when such feature does not exist. ### Creating the Matrix In order to create a stacked bar plot, we must first know a variable's levels(). Levels is synonymous with factors that are associated with categorical (string) variables. The levels() function returns the unique values of the strings that a variable takes on. So: # Attach object so that we can reference by Domain, Journal, and Count # Instead of internetrefs$Count attach(internetrefs)
levels(Journal)
## [1] "JAMA"    "NEJM"    "Science"

levels(Domain)
## [1] "com"   "edu"   "gov"   "org"   "other"


Then, using the levels() information, the data must be transformed into a matrix structure. The matrix structure takes on the following form: the rows are representative of the levels() of the Domain variable and the columns represent the levels() of the Journal variable. We induce this by:

### USING THE PRE-EXISTING ORDER FEATURE OF THE DATA ###   # Load the count values data = Count   # Place data in a matrix that will have 3 columns since the number levels() # for Journal is 3. Also, based on the ordering feature of the data, load # the matrix such that we fill the matrix by row. data = matrix(data, ncol = 3, byrow = T)   # Label the columns and rows colnames(data) = levels(Journal) rownames(data) = levels(Domain)

The matrix structure of data is then:

id JAMA NEJM Science
com 41 103 111
edu 37 46 162
gov 6 17 14
org 4 8 47
other 9 15 52

But, let's say that you lack that nice feature of the data that we discussed earlier. One way to obtain it is by going through and ordering the columns of the initial dataframe:

internetrefs_ordered = internetrefs[with(internetrefs, order(Domain, Journal)), ]
id Domain Journal Count
8 com JAMA 17
7 com NEJM 6
9 com Science 14
11 edu JAMA 8
10 edu NEJM 4
12 edu Science 47
2 gov JAMA 103
1 gov NEJM 41
3 gov Science 111
5 org JAMA 46
4 org NEJM 37
6 org Science 162
14 other JAMA 15
13 other NEJM 9
15 other Science 52

From here, the problem then simplifies to the code used for the ordered data:

### AFTER CREATING THE ORDER FEATURE OF THE DATA ###   # Access and load the count values data_ordered = internetrefs_ordered$Count # Place data in a matrix that will have 3 columns since the number levels() # for Journal is 3. Also, based on the ordering feature of the data, load # the matrix such that we fill the matrix by row. data_ordered = matrix(data_ordered, ncol = 3, byrow = T) # Label the columns and rows colnames(data_ordered) = levels(internetrefs_ordered$Journal) rownames(data_ordered) = levels(internetrefs_ordered$Domain) id JAMA NEJM Science com 41 103 111 edu 37 46 162 gov 6 17 14 org 4 8 47 other 9 15 52 ## Building the Stacked Barplot There are two ways to build a stacked barplot: percentage-based and counts-based However, special care needs to be taken when including a legend. By default, barplot()'s legend generating capabilities are pretty lacking. As a result, one needs to modify the margin space and how clipping is handled. This is achieved by setting par(): # mar is defined to receive: c(bottom, left, top, right) . The default # margin is: c(5, 4, 4, 2) + 0.1 . As a result, we have exploded the # right-hand side of the figure to hold legend. # xpd=TRUE forces all plotting to be clipped to the figure region par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE) To build percentage-based barplots, we must use prop.table() to generate each percentage for the columns: # Here, margin represents whether it will be run on rows (1) or columns (2) # We&#39;ve selected to use prop.table() on columns since that was how we built # our data. prop = prop.table(data, margin = 2) If we are looking to build a percentage-based vertical stacked barplot then: par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE) barplot(prop, col = heat.colors(length(rownames(prop))), width = 2) legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(prop))), legend = rownames(data)) If we are looking to build a counts-based vertical stacked barplot then: par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE) barplot(data, col = heat.colors(length(rownames(data))), width = 2) legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(data))), legend = rownames(data)) ## Grouped Barplots If we are looking to build a percentage-based horizontal grouped barplot then: par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE) barplot(prop, col = heat.colors(length(rownames(prop))), width = 2, beside = TRUE) legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(prop))), legend = rownames(data)) If we are looking to build a counts-based horizontal group barplot then: par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE) barplot(data, col = heat.colors(length(rownames(data))), width = 2, beside = TRUE) legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(data))), legend = rownames(data)) ## In Summary We've talked about a lot of different components required to pull off a stacked barplot and group bar plot in R. Below is the script that you will want to modify to suit your own data: internetrefs = read.delim("~/Desktop/internetrefs.txt") # Force Order data_ordered = internetrefs[with(internetrefs, order(Domain, Journal)), ] # load the count values data = data_ordered$Count   data = matrix(data, ncol = 3, byrow = T)   colnames(data) = levels(data_ordered$Journal) rownames(data) = levels(data_ordered$Domain)   prop = prop.table(data, margin = 2)   par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE)   # Percent-based vertically stacked barplot barplot(prop, col = heat.colors(length(rownames(prop))), width = 2) legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(prop))), legend = rownames(data))   # Percent-based grouped bar barplot barplot(prop, col = heat.colors(length(rownames(prop))), width = 2, beside = TRUE) legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(prop))), legend = rownames(data))   # Counts-based vertically stacked barplot barplot(data, col = heat.colors(length(rownames(data))), width = 2) legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(data))), legend = rownames(data))   # Counts-based grouped barplot barplot(data, col = heat.colors(length(rownames(data))), width = 2, beside = TRUE) legend("topright", inset = c(-0.25, 0), fill = heat.colors(length(rownames(data))), legend = rownames(data))

### Thanks!

Special thanks go out to Weihong Huang!

# Definition of Tukey's Test Statistic

Suppose $r$ independent observations denoted:

Let

Now, suppose that we have an estimate $s^2$ of the variance $\sigma^2$ which is based on $v$ degrees of freedom and is independent of the $Y_i$ ($i = 1,\cdots,r$). $v$ is usually derived from analysis of variance.

Then, the Tukey's Test Statistic is:

# Ranges: Internalized

Let $W=X_{n}-X_{1}$

Internally Studentized Range: Population $\sigma^2$ is unknown

where $s=\left( \frac{1}{\left( {n - 1} \right)}{\sum\limits_{i = 1}^n {{{\left( {{X_i} - \bar X} \right)}^2}} } \right)^{1/2}$

# Ranges: Externalized

Let $W=X_{n}-X_{1}$

Externally Studentized Range: Population $\sigma^2$ is unknown AND an independent estimator $s_{v}^2$ of $\sigma^2$ is available with degrees of freedom $v$.

The dependence of the distribution $W$ on unknown $\sigma$ can be removed by studentization. So, $S_{v}^{2}$ is changable to $vS_{v}^{2}/\sigma^2 \sim \chi_{v}^2$, independent of $W$.

# Ranges: Both

Let $W=X_{n}-X_{1}$

Externally and Internally Studentized Range:

where $\tilde{s} = \left( \frac{1}{\left( {n - 1 + v} \right)}{\sum\limits_{i = 1}^n {{{\left( {{X_i} - \bar X} \right)}^2}} } +vs_{v}^2 \right)^{}$

# Our use

We will be using the Externally Studentized Range….

Specifically, the first definition resembles….

# Tukey's Studentized Range Test Statistic ANOVA

Let ${Y_{ij}} \mathop \sim\limits^{iid} N\left( {0 ,{\sigma ^2}} \right)$, where $j=1,\cdots,n$ and $i=1,\cdots,k$ be independent observations in a balanced one-way ANOVA with $k$ treatments.

Then $\bar{Y}_{1},\cdots,\bar{Y}_{k}$ are the sample averages and $S^2$ is the independent and unbiased estimator of $\sigma^2$ based on $v=k\left({n-1}\right)$.

Let $W$ be the range of $\bar{Y}_i$.

# Example - Data

Consider a dosage regiment with repeated measures:

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Row Total
Dose 1 27.00 26.20 28.80 33.50 28.80 144.30
Dose 2 22.80 23.10 27.70 27.60 24.00 125.20
Dose 3 21.90 23.40 20.10 27.80 19.30 112.50
Dose 4 23.50 19.60 23.70 20.80 23.90 111.50
Col Total 95.20 92.30 100.30 109.70 96.00 493.50

# In R

(dose_type = c(rep("1", 5), rep("2", 5), rep("3", 5), rep("4", 5)))
##  [1] "1" "1" "1" "1" "1" "2" "2" "2" "2" "2" "3" "3" "3" "3" "3" "4" "4"
## [18] "4" "4" "4"

samples = c(data[1, ], data[2, ], data[3, ], data[4, ])
##  [1] 27.0 26.2 28.8 33.5 28.8 22.8 23.1 27.7 27.6 24.0 21.9 23.4 20.1 27.8
## [15] 19.3 23.5 19.6 23.7 20.8 23.9


# ANOVA Table:

model = glm(samples ~ factor(dose_type)) aov(model)
## Call: ## aov(formula = model) ## ## Terms: ## factor(dose_type) Residuals ## Sum of Squares 140.1 116.3 ## Deg. of Freedom 3 16 ## ## Residual standard error: 2.696 ## Estimated effects may be unbalanced

# Tukey's 95% C.I. pairwise comparison process

All pairwise differences $\mu_{i}-\mu_{j}$ are given by $100 \times \left({1-\alpha} \right)$% C.I.

Note: Always start with the largest mean and smallest mean pair, if the result is not significant, then the result will hold for all means between the largest and smallest.

# Largest vs. Smallest

$q_{.05 ;4,20 - 4}=qtukey(0.95,4,20-4)=$ 4.0461

$6.56 \pm 4.8789$

# In R:

TukeyHSD(aov(model))
## Tukey multiple comparisons of means ## 95% family-wise confidence level ## ## Fit: aov(formula = model) ## ## \$factor(dose_type) ## diff lwr upr p adj ## 2-1 -3.82 -8.699 1.059 0.1546 ## 3-1 -6.36 -11.239 -1.481 0.0089 ## 4-1 -6.56 -11.439 -1.681 0.0070 ## 3-2 -2.54 -7.419 2.339 0.4662 ## 4-2 -2.74 -7.619 2.139 0.4027 ## 4-3 -0.20 -5.079 4.679 0.9994

So, only 3-1 and 4-1 are significant.

That's All folks!

# References

• David, H. A.. “Studentized Range.” Encyclopedia of statistical sciences. New York: Wiley, 2006. 1-3. Print.
• Falk, Michael, and Frank Marohn. “The One-Way Analysis of Variance.” Foundations of statistical analyses and applications with SAS. Basel: Birkhauser Verlag, 2002. 193-194. Print.
• Harter, H. Leon, and N. Balakrishnan. “The Studentized Range of Samples from a Normal Population.” Tables for the use of range and studentized range in tests of hypotheses. Boca Raton, Fla.: CRC Press, 1998. 52-53. Print.
• Hochberg, Yosef. “Studentized Range.” Encyclopedia of Biostatistics. Hoboken, NJ: John Wiley & Sons, Ltd, 2005. 1-3. Print.
• Lowry, Richard. “Ch 14: One-Way Analysis of Variance for Independent Samples.” Concepts & Applications of Inferential Statistics. 2000. 1-3. Print.

# Alternative way to write the Tukey's Test Statistic

Let ${X_1}, \cdots ,{X_m} \mathop \sim\limits^{iid} N\left( {0 ,{\sigma ^2}} \right)$, where $n \ge 2$, and let $Z$ be $\chi^{2}$ with $n$ degrees of freedom.

In the case of $m=2$, $q$ closely resembles the two sample $t$ test statistic.

That is, $X_{1}$ and $X_{2}$ are are taken to be the standardized sample means of the two samples and $Z/n$ is the pooled sample variance, $S_p^2 = \frac{{S_1^2\left( {{n_1} - 1} \right) + S_2^2\left( {{n_2} - 1} \right)}}{{{n_1} + {n_2} - 2}},{\text{ }}{S_E} = {S_p}\sqrt {\frac{1}{{{n_1}}} + \frac{1}{{{n_2}}}}$

## Installing the Open Source Shiny Server on Dreamhost VPS / Debian

So, you've decided you want to make your shiny application available to the world? Good news! You can do that easily with a VPS or a dedicated server, since shared hosting will not allow you to have the necessary shell access to complete the installation. One day, RStudio may open up the Shiny cloud. Until then, I am using Dreamhost's VPS to house my shiny creations. These are the steps that I took:

## Special Install Instructions for Dreamhost VPS!

For those using Dreamhost VPS, note that the OS is Debian Squeeze (6.0.x) and NOT Debian Wheezy (7.~)! As a result, a modification to the /etc/apt/sources.list must be made:

#Make /etc/apt/sources.list writable sudo touch /etc/apt/sources.list sudo chmod a+w /etc/apt/sources.list #Opens VIM to edit the file vim /etc/apt/sources.list #Add to the last line of /etc/apt/sources.list deb http://cran.wustl.edu/bin/linux/debian squeeze-cran3/ #If you are far away from wustl, view the list of cran mirros to choose from here: http://cran.r-project.org/mirrors.html #To exit out of vim use :wq

Back in terminal execute the following:

#Add the key signed by Debian backports archives on CRAN apt-key adv --keyserver keys.gnupg.net --recv-key 381BA480

End Dreamhost Specific Install Instructions

## General Install Guidelines:

#Download newest information on the package lists sudo apt-get update #Install the latest and greatest r/r-dev. sudo apt-get install r-base r-base-dev #Make sure we are fully up-to date by fetching new versions of existing packages sudo apt-get upgrade
#Install Shiny package in R terminal sudo su - \ -c "R -e \"install.packages('shiny', repos='http://cran.rstudio.com/')\""   #Error came up? See below for specifics!   #Install gdebi sudo apt-get install gdebi-core #Download the prebuilt server image wget http://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-1.0.0.42-amd64.deb #Install image sudo gdebi shiny-server-1.0.0.42-amd64.deb

Assuming there are no errors, you are done! But, alas, life isn't so simple. So, here are a few errors I ran into.

## Errors. Errors Where art thou?

Do NOT open R without sudo! Using regular R will give:

#In terminal, open r without sudo! R

Inside r:

install.packages('shiny',repos='http://cran.rstudio.com') Installing package into '/usr/local/lib/R/site-library' (as 'lib' is unspecified) Warning in install.packages("shiny", repos = "http://cran.rstudio.com") : 'lib = "/usr/local/lib/R/site-library"' is not writable Would you like to use a personal library instead? (y/n) y Would you like to create a personal library ~/R/x86_64-pc-linux-gnu-library/3.0 to install packages into? (y/n) n

This does NOT work well, since the packages are then installed to a specific library. They need to be installed into: "/usr/local/lib/R/site-library", which is the site wide library. Therefore, make sure you do:

#In terminal, open r WITH sudo! sudo R

The dreaded configure error courtesy of RJSONIO. The error will be given as:

* installing *source* package 'RJSONIO' ... ** package 'RJSONIO' successfully unpacked and MD5 sums checked ERROR: 'configure' exists but is not executable -- see the 'R Installation and Administration Manual' * removing '/usr/local/lib/R/site-library/RJSONIO' ERROR: dependency 'RJSONIO' is not available for package 'shiny' * removing '/usr/local/lib/R/site-library/shiny'   The downloaded source packages are in '/tmp/RtmpWWV73L/downloaded_packages' Warning messages: 1: In install.packages("shiny", repos = "http://cran.rstudio.com/") : installation of package 'RJSONIO' had non-zero exit status 2: In install.packages("shiny", repos = "http://cran.rstudio.com/") : installation of package 'shiny' had non-zero exit status

The solution?

#Option 1 #create directory mkdir ~/tmp #Make read/write. chmod 777 ~/tmp export TMPDIR=~/tmp   #Option 2 mkdir /home/YOUR-USERNAME-HERE/tmp chmod 777 /home/YOUR-USERNAME-HERE/tmp export TMPDIR=/home/YOUR-USERNAME-HERE/tmp

Within sudo R:

#First, check to see if TMPDIR was set: Sys.getenv("TMPDIR") #IF this responds with the above path, skip this #Set the TMPDIR variable to be the correct folder. Sys.setenv("TMPDIR"="/home/YOUR-USERNAME-HERE/tmp")   #install packages like normal: install.packages('shiny',repos='http://cran.rstudio.com')