GroupBy summary statistics in R

First we install and load few packages – “data.table”, “Hmisc” and “doBy”. Here, we require data.table for its data input and data manipulation techniques, Hmisc for its data analysis techniques and doBy for its grouping and data analysis techniques. After loading the libraries/packages, we input the diamonds data using fread() and find the number of observations in the dataset using nrow().

library(data.table)
library(Hmisc)

## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: ‘Hmisc’
##
## The following objects are masked from ‘package:base’:
##
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(doBy)
diamonds = fread("diamonds.csv")
nrow(diamonds)

## [1] 53940

str() gives the basic structure of objects.

str(diamonds)

## Classes ‘data.table’ and ‘data.frame’:   53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 …
##  $ cut    : chr  “Ideal” “Premium” “Good” “Premium” …
##  $ color  : chr  “E” “E” “E” “I” …
##  $ clarity: chr  “SI2” “SI1” “VS1” “VS2” …
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 …
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 …
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 …
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 …
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 …
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 …
##  – attr(*, “.internal.selfref”)=<externalptr>

head() and tail() show the top six and bottom six observations, respectively.

head(diamonds)

##    carat       cut color clarity depth table price    x    y    z
## 1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

tail(diamonds)

##    carat       cut color clarity depth table price    x    y    z
## 1:  0.72   Premium     D     SI1  62.7    59  2757 5.69 5.73 3.58
## 2:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
## 3:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
## 4:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
## 5:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
## 6:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

summary() provides the minimum, maximum, mean, median, 1st and 3rd quartiles for numeric variables and length(count) and type for character variables.

summary(diamonds)

##      carat            cut               color             clarity
##  Min.   :0.2000   Length:53940       Length:53940       Length:53940
##  1st Qu.:0.4000   Class :character   Class :character   Class :character
##  Median :0.7000   Mode  :character   Mode  :character   Mode  :character
##  Mean   :0.7979
##  3rd Qu.:1.0400
##  Max.   :5.0100
##      depth           table           price             x
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740
##        y                z
##  Min.   : 0.000   Min.   : 0.000
##  1st Qu.: 4.720   1st Qu.: 2.910
##  Median : 5.710   Median : 3.530
##  Mean   : 5.735   Mean   : 3.539
##  3rd Qu.: 6.540   3rd Qu.: 4.040
##  Max.   :58.900   Max.   :31.800

describe() from Hmisc package provides number of observations, number of missing values, number of unique values, mean, percentiles in multiples of 5 and lowest and highest values(5 each) for each numeric variable.

For each character variable describe() displays frequency count and frequency ratio for each of its categories.

describe(diamonds)

## diamonds
##
##  10  Variables      53940  Observations
## —————————————————————————
## carat
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     273       1  0.7979    0.30    0.31    0.40    0.70
##     .75     .90     .95
##    1.04    1.51    1.70
##
## lowest : 0.20 0.21 0.22 0.23 0.24, highest: 4.00 4.01 4.13 4.50 5.01
## —————————————————————————
## cut
##       n missing  unique
##   53940       0       5
##
##           Fair Good Ideal Premium Very Good
## Frequency 1610 4906 21551   13791     12082
## %            3    9    40      26        22
## —————————————————————————
## color
##       n missing  unique
##   53940       0       7
##
##              D    E    F     G    H    I    J
## Frequency 6775 9797 9542 11292 8304 5422 2808
## %           13   18   18    21   15   10    5
## —————————————————————————
## clarity
##       n missing  unique
##   53940       0       8
##
##            I1   IF   SI1  SI2  VS1   VS2 VVS1 VVS2
## Frequency 741 1790 13065 9194 8171 12258 3655 5066
## %           1    3    24   17   15    23    7    9
## —————————————————————————
## depth
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     184       1   61.75    59.3    60.0    61.0    61.8
##     .75     .90     .95
##    62.5    63.3    63.8
##
## lowest : 43.0 44.0 50.8 51.0 52.2, highest: 72.2 72.9 73.6 78.2 79.0
## —————————————————————————
## table
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     127    0.98   57.46      54      55      56      57
##     .75     .90     .95
##      59      60      61
##
## lowest : 43.0 44.0 49.0 50.0 50.1, highest: 71.0 73.0 76.0 79.0 95.0
## —————————————————————————
## price
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0   11602       1    3933     544     646     950    2401
##     .75     .90     .95
##    5324    9821   13107
##
## lowest :   326   327   334   335   336
## highest: 18803 18804 18806 18818 18823
## —————————————————————————
## x
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     554       1   5.731    4.29    4.36    4.71    5.70
##     .75     .90     .95
##    6.54    7.31    7.66
##
## lowest :  0.00  3.73  3.74  3.76  3.77
## highest: 10.01 10.02 10.14 10.23 10.74
## —————————————————————————
## y
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     552       1   5.735    4.30    4.36    4.72    5.71
##     .75     .90     .95
##    6.54    7.30    7.65
##
## lowest :  0.00  3.68  3.71  3.72  3.73
## highest: 10.10 10.16 10.54 31.80 58.90
## —————————————————————————
## z
##       n missing  unique    Info    Mean     .05     .10     .25     .50
##   53940       0     375       1   3.539    2.65    2.69    2.91    3.53
##     .75     .90     .95
##    4.04    4.52    4.73
##
## lowest :  0.00  1.07  1.41  1.53  2.06
## highest:  6.43  6.72  6.98  8.06 31.80
## —————————————————————————

summaryBy() is present in doBY package. It provides summary statistics for numeric variables by group/category. Here, we find mean of price for each category of cut.

summaryBy(price ~ cut, data = diamonds, FUN = mean)

##          cut price.mean
## 1:      Fair   4358.758
## 2:      Good   3928.864
## 3:     Ideal   3457.542
## 4:   Premium   4584.258
## 5: Very Good   3981.760

Next, we find mean of price and cut for each combination of cut and color categories.

summaryBy(price + carat ~ cut + color, data = diamonds, FUN = mean)

##           cut color price.mean carat.mean
##  1:      Fair     D   4291.061  0.9201227
##  2:      Fair     E   3682.312  0.8566071
##  3:      Fair     F   3827.003  0.9047115
##  4:      Fair     G   4239.255  1.0238217
##  5:      Fair     H   5135.683  1.2191749
##  6:      Fair     I   4685.446  1.1980571
##  7:      Fair     J   4975.655  1.3411765
##  8:      Good     D   3405.382  0.7445166
##  9:      Good     E   3423.644  0.7451340
## 10:      Good     F   3495.750  0.7759296
## 11:      Good     G   4123.482  0.8508955
## 12:      Good     H   4276.255  0.9147293
## 13:      Good     I   5078.533  1.0572222
## 14:      Good     J   4574.173  1.0995440
## 15:     Ideal     D   2629.095  0.5657657
## 16:     Ideal     E   2597.550  0.5784012
## 17:     Ideal     F   3374.939  0.6558285
## 18:     Ideal     G   3720.706  0.7007146
## 19:     Ideal     H   3889.335  0.7995249
## 20:     Ideal     I   4451.970  0.9130291
## 21:     Ideal     J   4918.186  1.0635937
## 22:   Premium     D   3631.293  0.7215471
## 23:   Premium     E   3538.914  0.7177450
## 24:   Premium     F   4324.890  0.8270356
## 25:   Premium     G   4500.742  0.8414877
## 26:   Premium     H   5216.707  1.0164492
## 27:   Premium     I   5946.181  1.1449370
## 28:   Premium     J   6294.592  1.2930941
## 29: Very Good     D   3470.467  0.6964243
## 30: Very Good     E   3214.652  0.6763167
## 31: Very Good     F   3778.820  0.7409612
## 32: Very Good     G   3872.754  0.7667986
## 33: Very Good     H   4535.390  0.9159485
## 34: Very Good     I   5255.880  1.0469518
## 35: Very Good     J   5103.513  1.1332153
##           cut color price.mean carat.mean

Finally, we find frequency/count, mean, median and standard deviation for price and carat for each combination of cut and color categories.

summaryBy(price + carat ~ cut + color, data = diamonds, FUN = function(x)c(count = length(x), mean = mean(x), median = median(x), sd = sd(x)))

##           cut color price.count price.mean price.median price.sd
##  1:      Fair     D         163   4291.061       3730.0 3286.114
##  2:      Fair     E         224   3682.312       2956.0 2976.652
##  3:      Fair     F         312   3827.003       3035.0 3223.303
##  4:      Fair     G         314   4239.255       3057.0 3609.644
##  5:      Fair     H         303   5135.683       3816.0 3886.482
##  6:      Fair     I         175   4685.446       3246.0 3730.271
##  7:      Fair     J         119   4975.655       3302.0 4050.459
##  8:      Good     D         662   3405.382       2728.5 3175.149
##  9:      Good     E         933   3423.644       2420.0 3330.702
## 10:      Good     F         909   3495.750       2647.0 3202.411
## 11:      Good     G         871   4123.482       3340.0 3702.505
## 12:      Good     H         702   4276.255       3468.5 4020.660
## 13:      Good     I         522   5078.533       3639.5 4631.702
## 14:      Good     J         307   4574.173       3733.0 3707.791
## 15:     Ideal     D        2834   2629.095       1576.0 3001.070
## 16:     Ideal     E        3903   2597.550       1437.0 2956.007
## 17:     Ideal     F        3826   3374.939       1775.0 3766.635
## 18:     Ideal     G        4884   3720.706       1857.5 4006.262
## 19:     Ideal     H        3115   3889.335       2278.0 4013.375
## 20:     Ideal     I        2093   4451.970       2659.0 4505.150
## 21:     Ideal     J         896   4918.186       4096.0 4476.207
## 22:   Premium     D        1603   3631.293       2009.0 3711.634
## 23:   Premium     E        2337   3538.914       1928.0 3794.987
## 24:   Premium     F        2331   4324.890       2841.0 4012.023
## 25:   Premium     G        2924   4500.742       2745.0 4356.571
## 26:   Premium     H        2360   5216.707       4511.0 4466.190
## 27:   Premium     I        1428   5946.181       4640.0 5053.746
## 28:   Premium     J         808   6294.592       5063.0 4788.937
## 29: Very Good     D        1513   3470.467       2310.0 3523.753
## 30: Very Good     E        2400   3214.652       1989.5 3408.024
## 31: Very Good     F        2164   3778.820       2471.0 3786.124
## 32: Very Good     G        2299   3872.754       2437.0 3861.375
## 33: Very Good     H        1824   4535.390       3734.0 4185.798
## 34: Very Good     I        1204   5255.880       3888.0 4687.105
## 35: Very Good     J         678   5103.513       4113.0 4135.653
##           cut color price.count price.mean price.median price.sd
##     carat.count carat.mean carat.median  carat.sd
##  1:         163  0.9201227        0.900 0.4054185
##  2:         224  0.8566071        0.900 0.3645848
##  3:         312  0.9047115        0.900 0.4188899
##  4:         314  1.0238217        0.980 0.4927241
##  5:         303  1.2191749        1.010 0.5482389
##  6:         175  1.1980571        1.010 0.5219776
##  7:         119  1.3411765        1.030 0.7339713
##  8:         662  0.7445166        0.700 0.3631169
##  9:         933  0.7451340        0.700 0.3808900
## 10:         909  0.7759296        0.710 0.3700142
## 11:         871  0.8508955        0.900 0.4327176
## 12:         702  0.9147293        0.900 0.4977162
## 13:         522  1.0572222        1.000 0.5756366
## 14:         307  1.0995440        1.020 0.5371248
## 15:        2834  0.5657657        0.500 0.2993503
## 16:        3903  0.5784012        0.500 0.3125406
## 17:        3826  0.6558285        0.530 0.3745245
## 18:        4884  0.7007146        0.540 0.4106182
## 19:        3115  0.7995249        0.700 0.4868741
## 20:        2093  0.9130291        0.740 0.5537277
## 21:         896  1.0635937        1.030 0.5821001
## 22:        1603  0.7215471        0.580 0.3974635
## 23:        2337  0.7177450        0.580 0.4097847
## 24:        2331  0.8270356        0.760 0.4201959
## 25:        2924  0.8414877        0.755 0.4795344
## 26:        2360  1.0164492        1.010 0.5440777
## 27:        1428  1.1449370        1.140 0.6136041
## 28:         808  1.2930941        1.250 0.6137086
## 29:        1513  0.6964243        0.610 0.3692291
## 30:        2400  0.6763167        0.570 0.3779140
## 31:        2164  0.7409612        0.700 0.3888827
## 32:        2299  0.7667986        0.700 0.4180156
## 33:        1824  0.9159485        0.900 0.5029465
## 34:        1204  1.0469518        1.005 0.5519840
## 35:         678  1.1332153        1.060 0.5559197
##     carat.count carat.mean carat.median  carat.sd

Advertisements

Creating a Dictionary in Python

Dictionary is similar to a list. It contains key:value pairs. Here, we will learn about creating and modifying dictionaries.
Creating Dictionary
dictionary_name = {key:value, key:value,…..}

We create a dictionary with colors of jams as keys and fruit names as values.

jam = {'red':'strawberry', 'yellow': 'mango', 'orange':'orange'}
print jam['orange']
print jam['red']

orange
strawberry

Insertion – Here, we add a new key/value pair to the existing dictionary.

jam['blue'] = 'blueberry'

{‘blue’: ‘blueberry’, ‘red’: ‘strawberry’, ‘yellow’: ‘mango’, ‘orange’: ‘orange’}

Deletion – Here, we delete a key/value pair

del jam['orange']
jam

{‘blue’: ‘blueberry’, ‘red’: ‘strawberry’, ‘yellow’: ‘mango’}

Replacing values – We replace value for key ‘red’ from strawberry to cherry.

jam ['red'] = 'cherry'
jam

{‘blue’: ‘blueberry’, ‘red’: ‘cherry’, ‘yellow’: ‘mango’}

Nesting : We nest a new dictionary within an existing dictionary.

 jam['red'] = {'light' : 'cherry', 'dark' : 'strawberry' }

{‘blue’: ‘blueberry’, ‘red’: {‘dark’: ‘strawberry’, ‘light’: ‘cherry’}, ‘yellow’: ‘mango’}
We set a list as a value for key ‘blue’.

jam['blue'] = ['blueberry','plum']
jam

{‘blue’: [‘blueberry’, ‘plum’], ‘red’: {‘dark’: ‘strawberry’, ‘light’: ‘cherry’}, ‘yellow’: ‘mango’}

Append to a list

jam['blue'].append('jamun')  
print jam['blue']

[‘blueberry’, ‘plum’, ‘jamun’]

Indexing the nested dictionary

print jam['red']['dark']

strawberry

jam

{‘blue’: [‘blueberry’, ‘plum’, ‘jamun’], ‘red’: {‘dark’: ‘strawberry’, ‘light’: ‘cherry’}, ‘yellow’: ‘mango’}