Let us understand the New York Flights Dataset. This dataset is available in the package called nycflights13. I already have it installed. You can install the package by the command install.packages(“nycflights13”).
2015-02-4
Package
> library(nycflights13)
library command loads the package
Data Inspection
> dim(flights)
dim command gives the dimensions of the dataset = no. of observation * no. of variables
[1] 336776 16
> head(flights)
head gives the first few observations
year month day dep_time dep_delay arr_time arr_delay carrier tailnum
1 2013 1 1 517 2 830 11 UA N14228
2 2013 1 1 533 4 850 20 UA N24211
3 2013 1 1 542 2 923 33 AA N619AA
4 2013 1 1 544 -1 1004 -18 B6 N804JB
5 2013 1 1 554 -6 812 -25 DL N668DN
6 2013 1 1 554 -4 740 12 UA N39463
flight origin dest air_time distance hour minute
1 1545 EWR IAH 227 1400 5 17
2 1714 LGA IAH 227 1416 5 33
3 1141 JFK MIA 160 1089 5 42
4 725 JFK BQN 183 1576 5 44
5 461 LGA ATL 116 762 5 54
6 1696 EWR ORD 150 719 5 54
> tail(flights)
tail gives the last few observations
With the help of head and tail commands we inspect whether the data has loaded properly or not.
year month day dep_time dep_delay arr_time arr_delay carrier
336771 2013 9 30 NA NA NA NA EV
336772 2013 9 30 NA NA NA NA 9E
336773 2013 9 30 NA NA NA NA 9E
336774 2013 9 30 NA NA NA NA MQ
336775 2013 9 30 NA NA NA NA MQ
336776 2013 9 30 NA NA NA NA MQ
tailnum flight origin dest air_time distance hour minute
336771 N740EV 5274 LGA BNA NA 764 NA NA
336772 3393 JFK DCA NA 213 NA NA
336773 3525 LGA SYR NA 198 NA NA
336774 N535MQ 3461 LGA BNA NA 764 NA NA
336775 N511MQ 3572 LGA CLE NA 419 NA NA
336776 N839MQ 3531 LGA RDU NA 431 NA NA
> str(flights)
str command gives the structure of dataset. It briefs us about the variable names and variable types.
Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 16 variables:
$ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
$ month : int 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 1 1 1 1 1 1 1 1 1 ...
$ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
$ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
$ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
$ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ...
$ carrier : chr "UA" "UA" "AA" "B6" ...
$ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
$ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
$ origin : chr "EWR" "LGA" "JFK" "JFK" ...
$ dest : chr "IAH" "IAH" "MIA" "BQN" ...
$ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
$ distance : num 1400 1416 1089 1576 762 ...
$ hour : num 5 5 5 5 5 5 5 5 5 5 ...
$ minute : num 17 33 42 44 54 54 55 57 57 58 ...
> summary(flights)
summary gives the minimum and maximum values, mean and median and quartiles of all variables.
year month day dep_time
Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907
Median :2013 Median : 7.000 Median :16.00 Median :1401
Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744
Max. :2013 Max. :12.000 Max. :31.00 Max. :2400
NA's :8255
dep_delay arr_time arr_delay carrier
Min. : -43.00 Min. : 1 Min. : -86.000 Length:336776
1st Qu.: -5.00 1st Qu.:1104 1st Qu.: -17.000 Class :character
Median : -2.00 Median :1535 Median : -5.000 Mode :character
Mean : 12.64 Mean :1502 Mean : 6.895
3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.: 14.000
Max. :1301.00 Max. :2400 Max. :1272.000
NA's :8255 NA's :8713 NA's :9430
tailnum flight origin dest
Length:336776 Min. : 1 Length:336776 Length:336776
Class :character 1st Qu.: 553 Class :character Class :character
Mode :character Median :1496 Mode :character Mode :character
Mean :1972
3rd Qu.:3465
Max. :8500
air_time distance hour minute
Min. : 20.0 Min. : 17 Min. : 0.00 Min. : 0.00
1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00 1st Qu.:16.00
Median :129.0 Median : 872 Median :14.00 Median :31.00
Mean :150.7 Mean :1040 Mean :13.17 Mean :31.76
3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00 3rd Qu.:49.00
Max. :695.0 Max. :4983 Max. :24.00 Max. :59.00
NA's :9430 NA's :8255 NA's :8255