GroupBy summary statistics in R

First we install and load few packages – “data.table”, “Hmisc” and “doBy”. Here, we require data.table for its data input and data manipulation techniques, Hmisc for its data analysis techniques and doBy for its grouping and data analysis techniques. After loading the libraries/packages, we input the diamonds data using fread() and find the number of observations in the dataset using nrow().

library(data.table)
library(Hmisc)

## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: ‘Hmisc’
##
## The following objects are masked from ‘package:base’:
##
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(doBy)
diamonds = fread("diamonds.csv")
nrow(diamonds)

## [1] 53940

str() gives the basic structure of objects.

str(diamonds)

## Classes ‘data.table’ and ‘data.frame’:   53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 …
##  $ cut    : chr  “Ideal” “Premium” “Good” “Premium” …
##  $ color  : chr  “E” “E” “E” “I” …
##  $ clarity: chr  “SI2” “SI1” “VS1” “VS2” …
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 …
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 …
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 …
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 …
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 …
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 …
##  – attr(*, “.internal.selfref”)=<externalptr>

head() and tail() show the top six and bottom six observations, respectively.

head(diamonds)

##    carat       cut color clarity depth table price    x    y    z
## 1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

tail(diamonds)

##    carat       cut color clarity depth table price    x    y    z
## 1:  0.72   Premium     D     SI1  62.7    59  2757 5.69 5.73 3.58
## 2:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 3:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 4:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 5:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 6:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

summary() provides the minimum, maximum, mean, median, 1st and 3rd quartiles for numeric variables and length(count) and type for character variables.

summary(diamonds)

##      carat            cut               color             clarity
##  Min.   :0.2000   Length:53940       Length:53940       Length:53940
##  1st Qu.:0.4000   Class :character   Class :character   Class :character
##  Median :0.7000   Mode  :character   Mode  :character   Mode  :character
##  Mean   :0.7979
##  3rd Qu.:1.0400
##  Max.   :5.0100
##      depth           table           price             x
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740
##        y                z
##  Min.   : 0.000   Min.   : 0.000
##  1st Qu.: 4.720   1st Qu.: 2.910
##  Median : 5.710   Median : 3.530
##  Mean   : 5.735   Mean   : 3.539
##  3rd Qu.: 6.540   3rd Qu.: 4.040
##  Max.   :58.900   Max.   :31.800

describe() from Hmisc package provides number of observations, number of missing values, number of unique values, mean, percentiles in multiples of 5 and lowest and highest values(5 each) for each numeric variable.

For each character variable describe() displays frequency count and frequency ratio for each of its categories.

describe(diamonds)

## diamonds
##
##  10  Variables      53940  Observations
## —————————————————————————
## carat
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     273       1  0.7979    0.30    0.31    0.40    0.70
##     .75     .90     .95
##    1.04    1.51    1.70
##
## lowest : 0.20 0.21 0.22 0.23 0.24, highest: 4.00 4.01 4.13 4.50 5.01
## —————————————————————————
## cut
##       n missing  unique
##   53940       0       5
##
##           Fair Good Ideal Premium Very Good
## Frequency 1610 4906 21551   13791     12082
## %            3    9    40      26        22
## —————————————————————————
## color
##       n missing  unique
##   53940       0       7
##
##              D    E    F     G    H    I    J
## Frequency 6775 9797 9542 11292 8304 5422 2808
## %           13   18   18    21   15   10    5
## —————————————————————————
## clarity
##       n missing  unique
##   53940       0       8
##
##            I1   IF   SI1  SI2  VS1   VS2 VVS1 VVS2
## Frequency 741 1790 13065 9194 8171 12258 3655 5066
## %           1    3    24   17   15    23    7    9
## —————————————————————————
## depth
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     184       1   61.75    59.3    60.0    61.0    61.8
##     .75     .90     .95
##    62.5    63.3    63.8
##
## lowest : 43.0 44.0 50.8 51.0 52.2, highest: 72.2 72.9 73.6 78.2 79.0
## —————————————————————————
## table
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     127    0.98   57.46      54      55      56      57
##     .75     .90     .95
##      59      60      61
##
## lowest : 43.0 44.0 49.0 50.0 50.1, highest: 71.0 73.0 76.0 79.0 95.0
## —————————————————————————
## price
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0   11602       1    3933     544     646     950    2401
##     .75     .90     .95
##    5324    9821   13107
##
## lowest :   326   327   334   335   336
## highest: 18803 18804 18806 18818 18823
## —————————————————————————
## x
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     554       1   5.731    4.29    4.36    4.71    5.70
##     .75     .90     .95
##    6.54    7.31    7.66
##
## lowest :  0.00  3.73  3.74  3.76  3.77
## highest: 10.01 10.02 10.14 10.23 10.74
## —————————————————————————
## y
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     552       1   5.735    4.30    4.36    4.72    5.71
##     .75     .90     .95
##    6.54    7.30    7.65
##
## lowest :  0.00  3.68  3.71  3.72  3.73
## highest: 10.10 10.16 10.54 31.80 58.90
## —————————————————————————
## z
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     375       1   3.539    2.65    2.69    2.91    3.53
##     .75     .90     .95
##    4.04    4.52    4.73
##
## lowest :  0.00  1.07  1.41  1.53  2.06
## highest:  6.43  6.72  6.98  8.06 31.80
## —————————————————————————

summaryBy() is present in doBY package. It provides summary statistics for numeric variables by group/category. Here, we find mean of price for each category of cut.

summaryBy(price ~ cut, data = diamonds, FUN = mean)

##          cut price.mean
## 1:      Fair   4358.758
## 2:      Good   3928.864
## 3:     Ideal   3457.542
## 4:   Premium   4584.258
## 5: Very Good   3981.760

Next, we find mean of price and cut for each combination of cut and color categories.

summaryBy(price + carat ~ cut + color, data = diamonds, FUN = mean)

##           cut color price.mean carat.mean
##  1:      Fair     D   4291.061  0.9201227
##  2:      Fair     E   3682.312  0.8566071
##  3:      Fair     F   3827.003  0.9047115
##  4:      Fair     G   4239.255  1.0238217
##  5:      Fair     H   5135.683  1.2191749
##  6:      Fair     I   4685.446  1.1980571
##  7:      Fair     J   4975.655  1.3411765
##  8:      Good     D   3405.382  0.7445166
##  9:      Good     E   3423.644  0.7451340
## 10:      Good     F   3495.750  0.7759296
## 11:      Good     G   4123.482  0.8508955
## 12:      Good     H   4276.255  0.9147293
## 13:      Good     I   5078.533  1.0572222
## 14:      Good     J   4574.173  1.0995440
## 15:     Ideal     D   2629.095  0.5657657
## 16:     Ideal     E   2597.550  0.5784012
## 17:     Ideal     F   3374.939  0.6558285
## 18:     Ideal     G   3720.706  0.7007146
## 19:     Ideal     H   3889.335  0.7995249
## 20:     Ideal     I   4451.970  0.9130291
## 21:     Ideal     J   4918.186  1.0635937
## 22:   Premium     D   3631.293  0.7215471
## 23:   Premium     E   3538.914  0.7177450
## 24:   Premium     F   4324.890  0.8270356
## 25:   Premium     G   4500.742  0.8414877
## 26:   Premium     H   5216.707  1.0164492
## 27:   Premium     I   5946.181  1.1449370
## 28:   Premium     J   6294.592  1.2930941
## 29: Very Good     D   3470.467  0.6964243
## 30: Very Good     E   3214.652  0.6763167
## 31: Very Good     F   3778.820  0.7409612
## 32: Very Good     G   3872.754  0.7667986
## 33: Very Good     H   4535.390  0.9159485
## 34: Very Good     I   5255.880  1.0469518
## 35: Very Good     J   5103.513  1.1332153
##           cut color price.mean carat.mean

Finally, we find frequency/count, mean, median and standard deviation for price and carat for each combination of cut and color categories.

summaryBy(price + carat ~ cut + color, data = diamonds, FUN = function(x)c(count = length(x), mean = mean(x), median = median(x), sd = sd(x)))

##           cut color price.count price.mean price.median price.sd
##  1:      Fair     D         163   4291.061       3730.0 3286.114
##  2:      Fair     E         224   3682.312       2956.0 2976.652
##  3:      Fair     F         312   3827.003       3035.0 3223.303
##  4:      Fair     G         314   4239.255       3057.0 3609.644
##  5:      Fair     H         303   5135.683       3816.0 3886.482
##  6:      Fair     I         175   4685.446       3246.0 3730.271
##  7:      Fair     J         119   4975.655       3302.0 4050.459
##  8:      Good     D         662   3405.382       2728.5 3175.149
##  9:      Good     E         933   3423.644       2420.0 3330.702
## 10:      Good     F         909   3495.750       2647.0 3202.411
## 11:      Good     G         871   4123.482       3340.0 3702.505
## 12:      Good     H         702   4276.255       3468.5 4020.660
## 13:      Good     I         522   5078.533       3639.5 4631.702
## 14:      Good     J         307   4574.173       3733.0 3707.791
## 15:     Ideal     D        2834   2629.095       1576.0 3001.070
## 16:     Ideal     E        3903   2597.550       1437.0 2956.007
## 17:     Ideal     F        3826   3374.939       1775.0 3766.635
## 18:     Ideal     G        4884   3720.706       1857.5 4006.262
## 19:     Ideal     H        3115   3889.335       2278.0 4013.375
## 20:     Ideal     I        2093   4451.970       2659.0 4505.150
## 21:     Ideal     J         896   4918.186       4096.0 4476.207
## 22:   Premium     D        1603   3631.293       2009.0 3711.634
## 23:   Premium     E        2337   3538.914       1928.0 3794.987
## 24:   Premium     F        2331   4324.890       2841.0 4012.023
## 25:   Premium     G        2924   4500.742       2745.0 4356.571
## 26:   Premium     H        2360   5216.707       4511.0 4466.190
## 27:   Premium     I        1428   5946.181       4640.0 5053.746
## 28:   Premium     J         808   6294.592       5063.0 4788.937
## 29: Very Good     D        1513   3470.467       2310.0 3523.753
## 30: Very Good     E        2400   3214.652       1989.5 3408.024
## 31: Very Good     F        2164   3778.820       2471.0 3786.124
## 32: Very Good     G        2299   3872.754       2437.0 3861.375
## 33: Very Good     H        1824   4535.390       3734.0 4185.798
## 34: Very Good     I        1204   5255.880       3888.0 4687.105
## 35: Very Good     J         678   5103.513       4113.0 4135.653
##           cut color price.count price.mean price.median price.sd
##     carat.count carat.mean carat.median  carat.sd
##  1:         163  0.9201227        0.900 0.4054185
##  2:         224  0.8566071        0.900 0.3645848
##  3:         312  0.9047115        0.900 0.4188899
##  4:         314  1.0238217        0.980 0.4927241
##  5:         303  1.2191749        1.010 0.5482389
##  6:         175  1.1980571        1.010 0.5219776
##  7:         119  1.3411765        1.030 0.7339713
##  8:         662  0.7445166        0.700 0.3631169
##  9:         933  0.7451340        0.700 0.3808900
## 10:         909  0.7759296        0.710 0.3700142
## 11:         871  0.8508955        0.900 0.4327176
## 12:         702  0.9147293        0.900 0.4977162
## 13:         522  1.0572222        1.000 0.5756366
## 14:         307  1.0995440        1.020 0.5371248
## 15:        2834  0.5657657        0.500 0.2993503
## 16:        3903  0.5784012        0.500 0.3125406
## 17:        3826  0.6558285        0.530 0.3745245
## 18:        4884  0.7007146        0.540 0.4106182
## 19:        3115  0.7995249        0.700 0.4868741
## 20:        2093  0.9130291        0.740 0.5537277
## 21:         896  1.0635937        1.030 0.5821001
## 22:        1603  0.7215471        0.580 0.3974635
## 23:        2337  0.7177450        0.580 0.4097847
## 24:        2331  0.8270356        0.760 0.4201959
## 25:        2924  0.8414877        0.755 0.4795344
## 26:        2360  1.0164492        1.010 0.5440777
## 27:        1428  1.1449370        1.140 0.6136041
## 28:         808  1.2930941        1.250 0.6137086
## 29:        1513  0.6964243        0.610 0.3692291
## 30:        2400  0.6763167        0.570 0.3779140
## 31:        2164  0.7409612        0.700 0.3888827
## 32:        2299  0.7667986        0.700 0.4180156
## 33:        1824  0.9159485        0.900 0.5029465
## 34:        1204  1.0469518        1.005 0.5519840
## 35:         678  1.1332153        1.060 0.5559197
##     carat.count carat.mean carat.median  carat.sd

New York Flights Dataset

Let us understand the New York Flights Dataset. This dataset is available in the package called nycflights13. I already have it installed. You can install the package by the command install.packages(“nycflights13”).

2015-02-4

Package

> library(nycflights13)

library command loads the package

Data Inspection

> dim(flights)

dim command gives the dimensions of the dataset = no. of observation * no. of variables

[1] 336776     16
> head(flights)

head gives the first few observations

  year month day dep_time dep_delay arr_time arr_delay carrier tailnum
1 2013     1   1      517         2      830        11      UA  N14228
2 2013     1   1      533         4      850        20      UA  N24211
3 2013     1   1      542         2      923        33      AA  N619AA
4 2013     1   1      544        -1     1004       -18      B6  N804JB
5 2013     1   1      554        -6      812       -25      DL  N668DN
6 2013     1   1      554        -4      740        12      UA  N39463
  flight origin dest air_time distance hour minute
1   1545    EWR  IAH      227     1400    5     17
2   1714    LGA  IAH      227     1416    5     33
3   1141    JFK  MIA      160     1089    5     42
4    725    JFK  BQN      183     1576    5     44
5    461    LGA  ATL      116      762    5     54
6   1696    EWR  ORD      150      719    5     54
> tail(flights)

tail gives the last few observations

With the help of head and tail  commands we inspect whether the data has loaded properly or not.

       year month day dep_time dep_delay arr_time arr_delay carrier
336771 2013     9  30       NA        NA       NA        NA      EV
336772 2013     9  30       NA        NA       NA        NA      9E
336773 2013     9  30       NA        NA       NA        NA      9E
336774 2013     9  30       NA        NA       NA        NA      MQ
336775 2013     9  30       NA        NA       NA        NA      MQ
336776 2013     9  30       NA        NA       NA        NA      MQ
       tailnum flight origin dest air_time distance hour minute
336771  N740EV   5274    LGA  BNA       NA      764   NA     NA
336772           3393    JFK  DCA       NA      213   NA     NA
336773           3525    LGA  SYR       NA      198   NA     NA
336774  N535MQ   3461    LGA  BNA       NA      764   NA     NA
336775  N511MQ   3572    LGA  CLE       NA      419   NA     NA
336776  N839MQ   3531    LGA  RDU       NA      431   NA     NA
> str(flights)

str command gives the structure of dataset. It briefs us about the variable names and variable types.

Classes 'tbl_df', 'tbl' and 'data.frame':   336776 obs. of  16 variables:
 $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
 $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
 $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier  : chr  "UA" "UA" "AA" "B6" ...
 $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
 $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
 $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
 $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
 $ distance : num  1400 1416 1089 1576 762 ...
 $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
 $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...
> summary(flights)

summary gives the minimum and maximum values, mean and median and quartiles of all variables.

      year          month             day           dep_time   
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907  
 Median :2013   Median : 7.000   Median :16.00   Median :1401  
 Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400  
                                                 NA's   :8255  
   dep_delay          arr_time      arr_delay          carrier         
 Min.   : -43.00   Min.   :   1   Min.   : -86.000   Length:336776     
 1st Qu.:  -5.00   1st Qu.:1104   1st Qu.: -17.000   Class :character  
 Median :  -2.00   Median :1535   Median :  -5.000   Mode  :character  
 Mean   :  12.64   Mean   :1502   Mean   :   6.895                     
 3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:  14.000                     
 Max.   :1301.00   Max.   :2400   Max.   :1272.000                     
 NA's   :8255      NA's   :8713   NA's   :9430                         
   tailnum              flight        origin              dest          
 Length:336776      Min.   :   1   Length:336776      Length:336776     
 Class :character   1st Qu.: 553   Class :character   Class :character  
 Mode  :character   Median :1496   Mode  :character   Mode  :character  
                    Mean   :1972                                        
                    3rd Qu.:3465                                        
                    Max.   :8500                                        

    air_time        distance         hour           minute     
 Min.   : 20.0   Min.   :  17   Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00   1st Qu.:16.00  
 Median :129.0   Median : 872   Median :14.00   Median :31.00  
 Mean   :150.7   Mean   :1040   Mean   :13.17   Mean   :31.76  
 3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00   3rd Qu.:49.00  
 Max.   :695.0   Max.   :4983   Max.   :24.00   Max.   :59.00  
 NA's   :9430                   NA's   :8255    NA's   :8255