R dataset

A dataset is a data collection presented in a table.

The R programming language has tons of built-in datasets that can generally be used as a demo data to illustrate how the R functions work.


Most Used built-in Datasets in R

In R, there are tons of datasets we can try but the mostly used built-in datasets are:

  • airquality - New York Air Quality Measurements
  • AirPassengers - Monthly Airline Passenger Numbers 1949-1960
  • mtcars - Motor Trend Car Road Tests
  • iris - Edgar Anderson's Iris Data

These are few of the most used built-in data sets. If you want to learn about other built-in datasets, please visit The R Datasets Package.

In this tutorial we will be using the airquality dataset to demonstrate the use of datasets in R.


Display R datasets

To display the dataset, we simply write the name of the dataset inside the print() function. For example,

# display airquality dataset
print(airquality)

Output

   Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      NA      NA 14.3   56     5   5
6      28      NA 14.9   66     5   6
7      23     299  8.6   65     5   7
8      19      99 13.8   59     5   8
9       8      19 20.1   61     5   9
10     NA     194  8.6   69     5  10
11      7      NA  6.9   74     5  11
12     16     256  9.7   69     5  12
13     11     290  9.2   66     5  13
14     14     274 10.9   68     5  14
15     18      65 13.2   58     5  15
16     14     334 11.5   64     5  16
17     34     307 12.0   66     5  17
18      6      78 18.4   57     5  18
19     30     322 11.5   68     5  19
20     11      44  9.7   62     5  20
21      1       8  9.7   59     5  21

Here, we have displayed the airquality dataset from 1 to 21 but there are a total of 153 datasets .

The dataset contains the New York air quality measurements.


Get Informations of Dataset

In R, there are various functions we can use to get information about the dataset like: dimensions of dataset, number of rows and columns, name of variables and so on. For example,

# use dim() to get dimension of dataset
cat("Dimension:",dim(airquality))

# use nrow() to get number of rows
cat("\nRow:",nrow(airquality))

# use ncol() to get number of columns
cat("\nColumn:",ncol(airquality))

# use names() to get name of variable of dataset
cat("\nName of Variables:",names(airquality))

Output

Dimension: 153 6
Row: 153
Column: 6
Name of Variables: Ozone Solar.R Wind Temp Month Day

In the above example, we have used various functions to get information about the airquality dataset.

  • dim() - returns the dimension of the dataset i.e. 153 6
  • nrow() - returns the number of row (observations) i.e. 153
  • ncol() - returns the number of column (variables) i.e. 6
  • names() - returns all the name of variables

Display Variables Value in R

To display all the values of the specified variable in R, we use the $ operator and the name of the variable. For example,

# display all values of Temp variable
print(airquality$Temp)

Output

 [1] 67 72 74 62 56 66 65 59 61 69 74 69 66 68 58 64 66 57 68 62 59 73 61 61 57
 [26] 58 57 67 81 79 76 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79 77 72 65 73
 [51] 76 77 76 76 76 75 78 73 80 77 83 84 85 81 84 83 83 88 92 92 89 82 73 81 91
 [76] 80 81 82 84 87 85 74 81 82 86 85 82 86 88 86 83 81 81 81 82 86 85 87 89 90
[101] 90 92 86 86 82 80 79 77 79 76 78 78 77 72 75 79 81 86 88 97 94 96 94 91 92
[126] 93 93 87 84 80 78 75 73 81 76 77 71 71 78 67 76 68 82 64 71 81 69 63 70 77
[151] 75 76 68

In the above example, we have used the $ operator and the name of the variable to display all values of a variable.

airquality$Temp

Here, we have displayed all values of Temp variable of the airquality dataset.


Sort Variables Value in R

In R, we use the sort() function to sort values of variables in ascending order. For example,

# sort values of Temp variable
sort(airquality$Temp)

Output

 [1] 56 57 57 57 58 58 59 59 61 61 61 62 62 63 64 64 65 65 66 66 66 67 67 67 67
 [26] 68 68 68 68 69 69 69 70 71 71 71 72 72 72 73 73 73 73 73 74 74 74 74 75 75
 [51] 75 75 76 76 76 76 76 76 76 76 76 77 77 77 77 77 77 77 78 78 78 78 78 78 79
 [76] 79 79 79 79 79 80 80 80 80 80 81 81 81 81 81 81 81 81 81 81 81 82 82 82 82
[101] 82 82 82 82 82 83 83 83 83 84 84 84 84 84 85 85 85 85 85 86 86 86 86 86 86
[126] 86 87 87 87 87 87 88 88 88 89 89 90 90 90 91 91 92 92 92 92 92 93 93 93 94
[151] 94 96 97

Statistical Summary of Data in R

We use the summary() function to get statistical information about the dataset.

The summary() function returns six statistical summaries:

  • min
  • First Quartile
  • Median
  • Mean
  • Third Quartile
  • Max

Let's take a look at example,

# get statistical summary of Temp variable
summary(airquality$Temp)

Output

  Min.    1st Qu.  Median   Mean   3rd Qu.   Max. 
  56.00   72.00     79.00     77.88    85.00      97.00

In the above example, we have used the summary() function to get statistical summary of the Temp variable of airquality dataset.

Here,

  • Min - is the minimum value i.e. 56.00
  • 1st Qu. - is the first quartile i.e. 72.00
  • Median - is the median value i.e. 79.00
  • Mean - is the mean value i.e. 77.88
  • 3rd Qu. - is the third quartile i.e. 85.00
  • Max - is the maximum value i.e. 97.00