heart <- read.csv("https://www.openml.org/data/get_csv/51/dataset_51_heart-h",
na.strings = "?",
stringsAsFactors = TRUE)
19 Summarizing Data
Let’s read in a dataset on heart disease from OpenML:
One of the first things you might want to know is the size of the dataset:
dim(heart)
[1] 294 14
Since it does not contain too many columns, you can use str()
to get the type of each and a preview of some of the data:
str(heart)
'data.frame': 294 obs. of 14 variables:
$ age : int 28 29 29 30 31 32 32 32 33 34 ...
$ sex : Factor w/ 2 levels "female","male": 2 2 2 1 1 1 2 2 2 1 ...
$ chest_pain: Factor w/ 4 levels "asympt","atyp_angina",..: 2 2 2 4 2 2 2 2 3 2 ...
$ trestbps : int 130 120 140 170 100 105 110 125 120 130 ...
$ chol : int 132 243 NA 237 219 198 225 254 298 161 ...
$ fbs : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
$ restecg : Factor w/ 3 levels "left_vent_hyper",..: 1 2 2 3 3 2 2 2 2 2 ...
$ thalach : int 185 160 170 170 150 165 184 155 185 190 ...
$ exang : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ oldpeak : num 0 0 0 0 0 0 0 0 0 0 ...
$ slope : Factor w/ 3 levels "down","flat",..: NA NA NA NA NA NA NA NA NA NA ...
$ ca : int NA NA NA NA NA NA NA NA NA NA ...
$ thal : Factor w/ 3 levels "fixed_defect",..: NA NA NA 1 NA NA NA NA NA NA ...
$ num : Factor w/ 2 levels "'<50'","'>50_1'": 1 1 1 1 1 1 1 1 1 1 ...
You might want to take a look at the first few rows (defaults to 6):
head(heart)
age sex chest_pain trestbps chol fbs restecg thalach exang
1 28 male atyp_angina 130 132 f left_vent_hyper 185 no
2 29 male atyp_angina 120 243 f normal 160 no
3 29 male atyp_angina 140 NA f normal 170 no
4 30 female typ_angina 170 237 f st_t_wave_abnormality 170 no
5 31 female atyp_angina 100 219 f st_t_wave_abnormality 150 no
6 32 female atyp_angina 105 198 f normal 165 no
oldpeak slope ca thal num
1 0 <NA> NA <NA> '<50'
2 0 <NA> NA <NA> '<50'
3 0 <NA> NA <NA> '<50'
4 0 <NA> NA fixed_defect '<50'
5 0 <NA> NA <NA> '<50'
6 0 <NA> NA <NA> '<50'
There is the equivalent tail()
to print the last few rows:
tail(heart)
age sex chest_pain trestbps chol fbs restecg thalach
289 52 male asympt 140 266 f normal 134
290 52 male asympt 160 331 f normal 94
291 54 female non_anginal 130 294 f st_t_wave_abnormality 100
292 56 male asympt 155 342 t normal 150
293 58 female atyp_angina 180 393 f normal 110
294 65 male asympt 130 275 f st_t_wave_abnormality 115
exang oldpeak slope ca thal num
289 yes 2.0 flat NA <NA> '>50_1'
290 yes 2.5 <NA> NA <NA> '>50_1'
291 yes 0.0 flat NA <NA> '>50_1'
292 yes 3.0 flat NA <NA> '>50_1'
293 yes 1.0 flat NA reversable_defect '>50_1'
294 yes 1.0 flat NA <NA> '>50_1'
19.1 Get summary of an R object with summary()
R includes a summary()
method for a number of different objects, including (of course) data.frames:
summary(heart)
age sex chest_pain trestbps chol
Min. :28.00 female: 81 asympt :123 Min. : 92.0 Min. : 85.0
1st Qu.:42.00 male :213 atyp_angina:106 1st Qu.:120.0 1st Qu.:209.0
Median :49.00 non_anginal: 54 Median :130.0 Median :243.0
Mean :47.83 typ_angina : 11 Mean :132.6 Mean :250.8
3rd Qu.:54.00 3rd Qu.:140.0 3rd Qu.:282.5
Max. :66.00 Max. :200.0 Max. :603.0
NA's :1 NA's :23
fbs restecg thalach exang
f :266 left_vent_hyper : 6 Min. : 82.0 no :204
t : 20 normal :235 1st Qu.:122.0 yes : 89
NA's: 8 st_t_wave_abnormality: 52 Median :140.0 NA's: 1
NA's : 1 Mean :139.1
3rd Qu.:155.0
Max. :190.0
NA's :1
oldpeak slope ca thal
Min. :0.0000 down: 1 Min. :0 fixed_defect : 10
1st Qu.:0.0000 flat: 91 1st Qu.:0 normal : 7
Median :0.0000 up : 12 Median :0 reversable_defect: 11
Mean :0.5861 NA's:190 Mean :0 NA's :266
3rd Qu.:1.0000 3rd Qu.:0
Max. :5.0000 Max. :0
NA's :291
num
'<50' :188
'>50_1':106
19.2 Fast builtin column and row operations
R has optimized builtin functions for some very common row and columns operations, with self-explanatory names that can be applied to matrices and data.frames:
-
colSums()
: column sums -
rowSums()
: row sums -
colMeans()
: column means -
rowMeans()
: row means
a <- data.frame(matrix(1:20, nrow = 5))
a
X1 X2 X3 X4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
colSums(a)
[1] 15 40 65 90
rowSums(a)
[1] 34 38 42 46 50
colMeans(a)
[1] 3 8 13 18
rowMeans(a)
[1] 8.5 9.5 10.5 11.5 12.5
19.3 See also
-
aggregate()
for grouped summary statistics. - Loop Functions for applying any function on subsets of data.