Stochastic Nonsense

Put something smart here.

Visualizing and Comparing Distributions -- Part 8 of a Series

This is post #08 in a running series about plotting in R.

Last time, I talked about visualizing the Uniform, Normal, Exponential, and Poisson Distributions. However, there are more useful methods than just plotting the density and distribution functions.

Of course, you can always simply ask R to output summary statistics: `

> n <- 10000
> d <- data.frame(unif=runif(n=n), norm=rnorm(n=n), exp=rexp(n=n),
+   pois=rpois(n, lambda=1))
>
> summary(d)
> summary(d)
      unif                norm                exp                 pois      
 Min.   :2.449e-05   Min.   :-3.603527   Min.   :0.0001143   Min.   :0.000  
 1st Qu.:2.489e-01   1st Qu.:-0.657498   1st Qu.:0.2963731   1st Qu.:0.000  
 Median :4.991e-01   Median :-0.003342   Median :0.6927739   Median :1.000  
 Mean   :5.001e-01   Mean   : 0.008527   Mean   :0.9958194   Mean   :1.006  
 3rd Qu.:7.498e-01   3rd Qu.: 0.662084   3rd Qu.:1.3718103   3rd Qu.:2.000  
 Max.   :1.000e+00   Max.   : 3.512077   Max.   :9.4726291   Max.   :6.000  
> 

`

Or perhaps, we could be even more detailed and ask for more finely detailed order statistics. The quantile function will calculate arbitrary order statistics: `

> quantile(d$unif, probs=seq(0,1,by=0.05))
          0%           5%          10%          15%          20%          25%          30%          35%          40%          45%          50%          55% 
2.448517e-05 4.870828e-02 9.949762e-02 1.450521e-01 1.953495e-01 2.488873e-01 3.011948e-01 3.541679e-01 4.020943e-01 4.508098e-01 4.991203e-01 5.527021e-01 
         60%          65%          70%          75%          80%          85%          90%          95%         100% 
6.038723e-01 6.540685e-01 6.990784e-01 7.498041e-01 7.989884e-01 8.480684e-01 8.982745e-01 9.496064e-01 9.999980e-01 
> t(t( quantile(d$unif, probs=seq(0,1,by=0.05)) ))
             [,1]
0%   2.448517e-05
5%   4.870828e-02
10%  9.949762e-02
15%  1.450521e-01
20%  1.953495e-01
25%  2.488873e-01
30%  3.011948e-01
35%  3.541679e-01
40%  4.020943e-01
45%  4.508098e-01
50%  4.991203e-01
55%  5.527021e-01
60%  6.038723e-01
65%  6.540685e-01
70%  6.990784e-01
75%  7.498041e-01
80%  7.989884e-01
85%  8.480684e-01
90%  8.982745e-01
95%  9.496064e-01
100% 9.999980e-01
> 
> s <- seq(0,1,by=0.02)
> e <- sapply(d, FUN=function(x) quantile(x, probs=s))
> e <- as.data.frame(e)
> e[1:10,]
            unif       norm          exp pois
0%  2.448517e-05 -3.6035270 0.0001142514    0
2%  1.900289e-02 -1.9884255 0.0226651895    0
4%  3.844486e-02 -1.7161347 0.0440634106    0
6%  6.012384e-02 -1.5041045 0.0634013971    0
8%  8.089568e-02 -1.3566198 0.0899735826    0
10% 9.949762e-02 -1.2407524 0.1103189153    0
12% 1.191770e-01 -1.1392377 0.1342348062    0
14% 1.369209e-01 -1.0483029 0.1590086026    0
16% 1.555910e-01 -0.9618975 0.1829207834    0
18% 1.748496e-01 -0.8799168 0.2056880019    0
> 

`

Plotting the quantile values is worth a shot, just to see what we get: `

>
> plot(x=s, y=e$unif, ylim=c(min(e), max(e)), type='b', pch='.', cex=3, lwd=2, col='black',
+   xlab='quantiles [0,100]', ylab='dist values', main='Quantile Plots, Four Distributions')
> for(j in 2:4){
+   points(x=s, y=e[,j], type='b', pch='.', cex=3, lwd=2, col=c('', 'red', 'blue', 'green')[j])
+ }
> legend(x=0, y=8, legend=colnames(e), col=c('black', 'red', 'blue', 'green'), lwd=3)
> 

` [caption id=“attachment_164” align=“aligncenter” width=“480” caption=“Quantile Plots, Four Distributions”]Quantile Plots, Four Distributions[/caption]

Quantile quantile plots are typically used when you want to see if a sample is a particular distribution. You plot the quantiles of the sample versus the assumed distribution and compare; if they are the same distribution you should get a straight line at roughly a forty five degree angle. `

> par(mfrow=c(1,2), oma=c(0,0,2,0))
> 
> qqplot(x=d$exp, y=d$pois, 
+   xlab='exponential', ylab='poisson', main = 'qq plot: different distributions')
> qqplot(x=d$norm, y=rnorm(n=n), 
+   xlab='our normal sample in d', ylab='new normal sample', main = 'qq plot: same distribution, resampled')
> 
> mtext('QQ plots -- different vs same distributions', side=3, outer=T)

` [caption id=“attachment_160” align=“aligncenter” width=“675” caption=“QQ Plots – Different Distribution and Same Distribution”]QQ Plots - Different Distribution and Same Distribution[/caption]

Box and whisker plots `

> dev.set(which=1)
> boxplot(x=d, xlab='distributions', main='Distributions Visualization - boxplot')

` [caption id=“attachment_161” align=“aligncenter” width=“800” caption=“Box and Whisker Plot, Four Distributions”]Box and Whisker Plot, Four Distributions[/caption]

Finally, for univariate distributions, I prefer box-percentile plots. These are similar to boxplots, but the width of the distribution graphs are proportional to the percent of observations more extreme in that direction. They are also marked at the 25th, 50th, and 75th percentiles.

`

> library(Hmisc)
>
> dev.set(which=1)
> bpplot(d, xlab='distributions', main='Distributions Visualization - box percentile plot')
> 

` [caption id=“attachment_162” align=“aligncenter” width=“800” caption=“Box-Percentile Plot, 4 Distributions”]Box-Percentile Plot, 4 Distributions[/caption]

And if the distributions are genuinely different, let’s examine 5 univariate Normal Distributions: our already sampled N(0,1), a resampled N(0,1), N(1,1), N(0,3), N(-1,0.5). Side by side, the box percentile plot really draws out the differences in the distributions. `

> 
> d2 = data.frame(norm=d$norm, resample=rnorm(n), mean1=rnorm(n=n, mean=1), sd3=rnorm(n=n, sd=3), sdhalf=rnorm(n=n, sd=0.5, mean=-2))
> 
> bpplot(d2, xlab='distributions', main='Multiple Normal Distribution Visualization - box percentile plot')
> 

` [caption id=“attachment_171” align=“aligncenter” width=“800” caption=“Box Percentile Plot, Five Different Normal Distributions”]Box Percentile Plot, Five Different Normal Distributions[/caption]