Data Science & R Programming

Data Science and using R for Data Science

R is a data-oriented programming language that uses statistical computing and graphics to model and gain insights from data. It directly addresses questions about data that matter to you as a programmer.

Data Science

Why data science matters and what data scientists can achieve.

§ 01
$100K+
Average US Data Science salary
#1
Job satisfaction, salary & fulfilment ranking
2×+
Above national US median salary
What Data Scientists Do

Humans need to be connected, valued, or have a sense of belonging — these are core fundamentals of human motivation. Data science can help achieve some of these fundamental needs.

Data scientists have valuable qualities such as finding order, meaning, and value in unstructured data (online sources, audio, graphics, videos, and so on).

Predict

Outcomes such as security threats, voting patterns, or which patients are likely to respond to certain treatments.

Automate

Processes such as shopping recommendations, friend identification, and AI chatbot support.

Insights

Reveal hidden insights normally invisible in raw data, providing significant advantages.

Poulson, B. (2019, August 8). Supply and demand for data science [Video]. LinkedIn Learning.
linkedin.com/learning/data-science-foundations-fundamentals-5

Supply & Demand Gap

Job ads for data science significantly outpace job search numbers — high demand with a gap in supply.

Top of the list for:

✓ Job satisfaction

✓ Salary & benefits

✓ Fulfilment & value

R for Data Science

Getting started with R — the environment, interfaces, and key RStudio panes.

§ 02

R programming language makes use of the overwhelming availability of data to model and gain insights via statistical computing and graphics. It is great way to work with data for gaining insights.

R Base (CRAN)

cran.r-project.org

RStudio Desktop

rstudio.com/download

RStudio Cloud

rstudio.cloud

Jupyter (Azure)

notebooks.azure.com

RStudio Interface Panes

Environment — all objects created by user; import datasets here.

Plot — displays plots. Zoom, export as .png or .pdf.

Help — search built-in function docs. Type name, press Enter.

Package — lists installed packages. Install or update here.

Files — browse folders, create/delete/rename, set working directory.

Packages

CRAN hosts over 10,000 packages — collections of functions and datasets. Bioconductor is another popular repository.

§ 03
Tidyverse — the essential R package

Tidyverse is a popular R package on CRAN with a set of tools to process, manipulate, and visualise datasets. Use several specialised packages for machine learning and data visualisation.

packages.R
install.packages("tidyverse") # install library(tidyverse) # load into session help(package="tidyverse") # online help

CRAN — Comprehensive R Archive Network.

10,000+ packages for machine learning, visualisation, bioinformatics, finance and more.

Bioconductor — another popular repository, focused on bioinformatics.

Variables

In R, a variable stores a value, data structure, function, or even a plot. Use <- or = for assignment.

§ 04
Naming rules: Must start with a letter or full-stop (not followed by a number). No symbols other than full-stop and underscore. Give meaningful, self-explanatory names — the coding golden rule.
variables.R
x <- 35 # assign 35 to x x <- 18; x # overwrite x, then print it y <- x + 8; y # add to x, store sum in y y <- 3*x + 5; y # linear equation

<- or = assigns/stores a value or object to a variable.

Continuous Variables — numeric variables that can range from small to large.

Categorical Variables — made up of field label categories such as age, location, country, etc.

Data Types

There are six fundamental data types in R. Every value belongs to one of these types.

§ 05
Character Numeric Integer Logical Complex Raw
"a"; "abc"; "oranges"; "i like apples"
Character — string values defined within double quotes. Can be numbers, letters, symbols, or a combination. Mathematical operations cannot be performed on characters.
Use ; as a separator to run multiple commands on one line.
5L; -3L
Integer — whole numbers (positive or negative), defined by typing the letter L after the number.
TRUE; FALSE
Logical (Boolean) — indicates whether a statement is true or false.

Numeric & Logical

Numeric is a real or decimal value. Logical (boolean) indicates whether a statement is true or false.

§ 06
Numeric — real or decimal values
5+2; 5-2; 6/2; 6*3;
Add, Subtract, Divide, Multiply
5^4; sqrt(5.5); exp(5.5); log(5.5)
Power of, Square root, Exponential, Natural log
Integers are whole numbers. Defined by typing L after the number. e.g. 5L; -3L
Logical — booleans
5==5; 5!=2; 5>2; 5>=2
Exactly equals, Not equals, Greater than, Greater than or equals
!x; x|y; x&y
NOT x, x OR y, x AND y

Data Structures

Tools for holding multiple values. Data analysis is seldom performed on a single value — typical analysis involves working with groups simultaneously.

§ 07

Vector

Collection of elements of the same data type. Created using c(.).

Factor

Like a vector but holds elements from a finite set of values. Created with factor(.).

Matrix

2D array where all elements are the same type. Read column-wise by default.

Data Frame (Tibble)

Most commonly used structure. Columns can be different types; each column must be same length.

List

Most flexible structure. Each component can be a different type, dimension, or length. Can nest lists.

Vector

Collection of elements of the same data type, created with c(.). Use square brackets [] to access elements.

§ 08
Creating Vectors
vectors.R
c() # null vector c(1,2,3) # numeric vector c("A","B","C") # character vector c(TRUE,FALSE,TRUE) # logical vector a <- c(1:5); a # integers 1 to 5 b <- c(6:10); b # integers 6 to 10 c <- c(a,b); c # combine a and b
Access & Operations
a[2]
Access 2nd element
a[3:5]
Access 3rd to 5th elements
a[c(2,5)]
Access 2nd and 5th elements
a*0.25
Multiply all elements of a by 0.25
a+b; b-a; a*b; b/a
Element-wise operations — 1st of a operates with 1st of b, and so on
Note: Square brackets [] access specific elements within a vector. The same principle applies with other mathematical operations.

Factor

Similar to a vector but holds elements from a finite set of values (levels). Create a vector first, then convert using factor(.).

§ 09
factor.R
f1 <- c(1:5) # numeric vector f1 <- factor(f1); f1 # convert f1 to a factor f2 <- c("male","female","female") # character vector f2 <- factor(f2); f2 # factor with 2 levels: female, male f3 <- factor(c("L","M","H","M","L")); f3 # ordered alphabetically by default # Specify desired order of levels: f3 <- factor(f3, levels=c("L","M","H")); f3
By default, factors are ordered alphabetically. Use the levels= argument inside factor(.) to specify a custom order.

Matrix

A 2D array where all elements are the same data type. Data are read column-wise by default (byrow=FALSE).

§ 10
Creating Matrices
matrix.R
# 3x3 matrix, column-wise (default) Mat.A <- matrix(c(1:9), # 9 entries nrow=3, # 3 rows ncol=3, # 3 columns byrow=FALSE); Mat.A # 3x3 matrix, row-wise Mat.B <- matrix(c(1:9), nrow=3, ncol=3, byrow=TRUE); Mat.B
Matrix Binding

Multiple vectors can be bound together to create a matrix using cbind() (column) or rbind() (row).

matrix_bind.R
v1 <- c(1:3); v2 <- c(4:6); v3 <- c(7:9) Mat.A <- cbind(v1,v2,v3); Mat.A # as columns Mat.B <- rbind(v1,v2,v3); Mat.B # as rows Mat.A * Mat.B # element-wise multiply

Matrix Multiplication & Access

Matrix multiplication uses %*%. Access elements using A[row, col] notation.

§ 11
Matrix Multiplication

Columns in A must equal rows in B. Unlike element-wise: A×B ≠ B×A. Rows of A multiply columns of B. Uses %*% operator.

mat_multiply.R
Mat.A %*% Mat.B # A matrix-multiply B Mat.B %*% Mat.A # B matrix-multiply A (not same!)
Matrix Access — A[row, col]
Mat.A[2, 3]
Element at row 2, column 3
Mat.A[1:2, 3]
Rows 1–2 along column 3
Mat.A[1, 2:3]
Row 1 across columns 2–3
Mat.A[1, ]
All elements in row 1
Mat.A[, 3]
All elements in column 3
Mat.A[, -1]
All data except column 1

Data Frame

Most commonly used data structure in R. Columns can be different types; all columns must have the same length. Created with data.frame(.).

§ 12
Creating & Extending a Data Frame
dataframe.R
Name <- c("John","Sarah","Zach","Beth","Lachlan") Age <- c(35,28,33,55,43) Gender <- factor(c("Male","Female","Male","Female","Male")) df <- data.frame(Name,Age,Gender); df # combine into data frame # Add a column — Method 1 (data.frame): Coffee.Drinker <- c(TRUE,TRUE,FALSE,TRUE,FALSE) data.frame(df, Coffee.Drinker) # Add a column — Method 2 (cbind): cbind(df, Coffee.Drinker) # Add using $ (does not overwrite original df): df$Coffee.Drinker <- c(TRUE,TRUE,FALSE,TRUE,FALSE); df df$Diabetes <- factor(c("Yes","No","No","No","Yes")); df

Data Frame Access — same as matrices:

df[1, c(1:3)]
df[2:3, ]
df[, c(1,3)]
df$Name
df$Age

Utility commands:

str(.) — structure
class(.) — class
dim(.) — dimensions
nrow(.) — row count
ncol(.) — col count
colnames(.) — column names

Data Frame — Tibble

A tibble is a modern take on a data frame, defined with tibble(.). It avoids many annoying default behaviours of the classic data frame.

§ 13
tibble.R
str(df) # examine structure tib1 <- tibble(Name,Age,Gender,Coffee.Drinker); tib1 as_tibble(df) # convert data frame to tibble as.data.frame(tib1) # convert tibble back to data frame

Key differences from data.frame:

A tibble never changes input types (won't silently convert character to factor).

Printing shows only the first 10 rows and columns that fit on screen.

An abbreviated column type is shown under each column name.

Some older R packages may not work with tibbles. Use as.data.frame() to convert back when needed.

List

The most flexible data structure in R. Each component can be a different type, dimension, or length. You can even nest a list within a list.

§ 14
Creating a List
lists.R
# List of vector, matrix, data frame: list1 <- list(c,Mat.A,df); list1 str(list1) # examine structure # Labelled list: list1 <- list(VecC=c, MatA=Mat.A, DatFrame=df) str(list1) list1$VecC # access by label list1$MatA list1$DatFrame
List Access — [[ ]] and $
list1[[1]]
Access 1st component (vector c)
list1[[2]]
Access 2nd component (matrix Mat.A)
list1[[3]]
Access 3rd component (data frame df)
list1$VecC
Access by label (if components are named)
list1$MatA
Access matrix component by label
list1$DatFrame
Access data frame component by label

Coercion of Data Types

Elements of a vector or matrix must be the same type. When mixed types are given, R silently coerces to the most flexible type in the mix.

§ 15
Coercion Examples
coercion.R
c(1, "a") # numeric coerced to character c(TRUE, "a") # logical coerced to character c(TRUE, 1) # logical coerced to numeric (1=TRUE, 0=FALSE) # All elements coerced to character in a matrix: matrix(c(5, FALSE, 4.6, "No"), nrow=2, ncol=2, byrow=FALSE)

Flexibility order
least → most flexible

4. Logical — least flexible 3. Integer 2. Numeric 1. Character — most flexible
Watch out: Unintentional coercions can cause hard-to-trace errors. Always verify types with class(.).