Visualizing and Comparing Distributions -- Part 8 of a Series

This is post #08 in a running series about plotting in R.

Last time, I talked about visualizing the Uniform, Normal, Exponential, and Poisson Distributions. However, there are more useful methods than just plotting the density and distribution functions.

Of course, you can always simply ask R to output summary statistics:

> n <- 10000
> d <- data.frame(unif=runif(n=n), norm=rnorm(n=n), exp=rexp(n=n),
+   pois=rpois(n, lambda=1))
>
> summary(d)
> summary(d)
unif                norm                exp                 pois
Min.   :2.449e-05   Min.   :-3.603527   Min.   :0.0001143   Min.   :0.000
1st Qu.:2.489e-01   1st Qu.:-0.657498   1st Qu.:0.2963731   1st Qu.:0.000
Median :4.991e-01   Median :-0.003342   Median :0.6927739   Median :1.000
Mean   :5.001e-01   Mean   : 0.008527   Mean   :0.9958194   Mean   :1.006
3rd Qu.:7.498e-01   3rd Qu.: 0.662084   3rd Qu.:1.3718103   3rd Qu.:2.000
Max.   :1.000e+00   Max.   : 3.512077   Max.   :9.4726291   Max.   :6.000

Or perhaps, we could be even more detailed and ask for more finely detailed order statistics. The quantile function will calculate arbitrary order statistics:

> quantile(d$unif, probs=seq(0,1,by=0.05))
0%           5%          10%          15%          20%          25%          30%          35%          40%          45%          50%          55% 
2.448517e-05 4.870828e-02 9.949762e-02 1.450521e-01 1.953495e-01 2.488873e-01 3.011948e-01 3.541679e-01 4.020943e-01 4.508098e-01 4.991203e-01 5.527021e-01 
60%          65%          70%          75%          80%          85%          90%          95%         100% 
6.038723e-01 6.540685e-01 6.990784e-01 7.498041e-01 7.989884e-01 8.480684e-01 8.982745e-01 9.496064e-01 9.999980e-01 
> t(t( quantile(d$unif, probs=seq(0,1,by=0.05)) ))
[,1]
0%   2.448517e-05
5%   4.870828e-02
10%  9.949762e-02
15%  1.450521e-01
20%  1.953495e-01
25%  2.488873e-01
30%  3.011948e-01
35%  3.541679e-01
40%  4.020943e-01
45%  4.508098e-01
50%  4.991203e-01
55%  5.527021e-01
60%  6.038723e-01
65%  6.540685e-01
70%  6.990784e-01
75%  7.498041e-01
80%  7.989884e-01
85%  8.480684e-01
90%  8.982745e-01
95%  9.496064e-01
100% 9.999980e-01
> 
> s <- seq(0,1,by=0.02)
> e <- sapply(d, FUN=function(x) quantile(x, probs=s))
> e <- as.data.frame(e)
> e[1:10,]
unif       norm          exp pois
0%  2.448517e-05 -3.6035270 0.0001142514    0
2%  1.900289e-02 -1.9884255 0.0226651895    0
4%  3.844486e-02 -1.7161347 0.0440634106    0
6%  6.012384e-02 -1.5041045 0.0634013971    0
8%  8.089568e-02 -1.3566198 0.0899735826    0
10% 9.949762e-02 -1.2407524 0.1103189153    0
12% 1.191770e-01 -1.1392377 0.1342348062    0
14% 1.369209e-01 -1.0483029 0.1590086026    0
16% 1.555910e-01 -0.9618975 0.1829207834    0
18% 1.748496e-01 -0.8799168 0.2056880019    0
>

Plotting the quantile values is worth a shot, just to see what we get:

>
> plot(x=s, y=e$unif, ylim=c(min(e), max(e)), type='b', pch='.', cex=3, lwd=2, col='black',
+     xlab='quantiles [0,100]', ylab='dist values', main='Quantile Plots, Four Distributions')
> for(j in 2:4){
+     points(x=s, y=e[,j], type='b', pch='.', cex=3, lwd=2, col=c('', 'red', 'blue', 'green')[j])
+ }
> legend(x=0, y=8, legend=colnames(e), col=c('black', 'red', 'blue', 'green'), lwd=3)
>

Quantile Plots, Four Distributions

Quantile quantile plots are typically used when you want to see if a sample is a particular distribution. You plot the quantiles of the sample versus the assumed distribution and compare; if they are the same distribution you should get a straight line at roughly a forty five degree angle.

> par(mfrow=c(1,2), oma=c(0,0,2,0))
> 
> qqplot(x=d$exp, y=d$pois, 
+     xlab='exponential', ylab='poisson', main = 'qq plot: different distributions')
> qqplot(x=d$norm, y=rnorm(n=n), 
+     xlab='our normal sample in d', ylab='new normal sample', main = 'qq plot: same distribution, resampled')
> 
> mtext('QQ plots -- different vs same distributions', side=3, outer=T)

QQ Plots - Different Distribution and Same Distribution

Box and whisker plots

> dev.set(which=1)
> boxplot(x=d, xlab='distributions', main='Distributions Visualization - boxplot')

Box and Whisker Plot, Four Distributions

Finally, for univariate distributions, I prefer box-percentile plots. These are similar to boxplots, but the width of the distribution graphs are proportional to the percent of observations more extreme in that direction. They are also marked at the 25th, 50th, and 75th percentiles.

> library(Hmisc)
>
> dev.set(which=1)
> bpplot(d, xlab='distributions', main='Distributions Visualization - box percentile plot')
>

Box-Percentile Plot, 4 Distributions

And if the distributions are genuinely different, let’s examine 5 univariate Normal Distributions: our already sampled N(0,1), a resampled N(0,1), N(1,1), N(0,3), N(-1,0.5). Side by side, the box percentile plot really draws out the differences in the distributions.

>
> d2 = data.frame(norm=d$norm, resample=rnorm(n), mean1=rnorm(n=n, mean=1), sd3=rnorm(n=n, sd=3), sdhalf=rnorm(n=n, sd=0.5, mean=-2))
> 
> bpplot(d2, xlab='distributions', main='Multiple Normal Distribution Visualization - box percentile plot')
>

Box Percentile Plot, Five Different Normal Distributions

Stochastic Nonsense

Put something smart here.

Visualizing and Comparing Distributions -- Part 8 of a Series