Data Science & R Programming
Data Science and using R for Data Science
R is a data-oriented programming language that uses statistical computing and graphics to model and gain insights from data. It directly addresses questions about data that matter to you as a programmer.
Data Science
Why data science matters and what data scientists can achieve.
Humans need to be connected, valued, or have a sense of belonging — these are core fundamentals of human motivation. Data science can help achieve some of these fundamental needs.
Data scientists have valuable qualities such as finding order, meaning, and value in unstructured data (online sources, audio, graphics, videos, and so on).
Predict
Outcomes such as security threats, voting patterns, or which patients are likely to respond to certain treatments.
Automate
Processes such as shopping recommendations, friend identification, and AI chatbot support.
Insights
Reveal hidden insights normally invisible in raw data, providing significant advantages.
Poulson, B. (2019, August 8). Supply and demand for data science [Video]. LinkedIn Learning.
linkedin.com/learning/data-science-foundations-fundamentals-5
Supply & Demand Gap
Job ads for data science significantly outpace job search numbers — high demand with a gap in supply.
Top of the list for:
✓ Job satisfaction
✓ Salary & benefits
✓ Fulfilment & value
R for Data Science
Getting started with R — the environment, interfaces, and key RStudio panes.
R programming language makes use of the overwhelming availability of data to model and gain insights via statistical computing and graphics. It is great way to work with data for gaining insights.
R Base (CRAN)
RStudio Desktop
RStudio Cloud
Jupyter (Azure)
RStudio Interface Panes
Environment — all objects created by user; import datasets here.
Plot — displays plots. Zoom, export as .png or .pdf.
Help — search built-in function docs. Type name, press Enter.
Package — lists installed packages. Install or update here.
Files — browse folders, create/delete/rename, set working directory.
Packages
CRAN hosts over 10,000 packages — collections of functions and datasets. Bioconductor is another popular repository.
Tidyverse is a popular R package on CRAN with a set of tools to process, manipulate, and visualise datasets. Use several specialised packages for machine learning and data visualisation.
CRAN — Comprehensive R Archive Network.
10,000+ packages for machine learning, visualisation, bioinformatics, finance and more.
Bioconductor — another popular repository, focused on bioinformatics.
Variables
In R, a variable stores a value, data structure, function, or even a plot. Use <- or = for assignment.
<- or = assigns/stores a value or object to a variable.
Continuous Variables — numeric variables that can range from small to large.
Categorical Variables — made up of field label categories such as age, location, country, etc.
Data Types
There are six fundamental data types in R. Every value belongs to one of these types.
"a"; "abc"; "oranges"; "i like apples"Use
; as a separator to run multiple commands on one line.5L; -3LL after the number.TRUE; FALSENumeric & Logical
Numeric is a real or decimal value. Logical (boolean) indicates whether a statement is true or false.
5+2; 5-2; 6/2; 6*3;5^4; sqrt(5.5); exp(5.5); log(5.5)L after the number. e.g. 5L; -3L5==5; 5!=2; 5>2; 5>=2!x; x|y; x&yData Structures
Tools for holding multiple values. Data analysis is seldom performed on a single value — typical analysis involves working with groups simultaneously.
Vector
Collection of elements of the same data type. Created using c(.).
Factor
Like a vector but holds elements from a finite set of values. Created with factor(.).
Matrix
2D array where all elements are the same type. Read column-wise by default.
Data Frame (Tibble)
Most commonly used structure. Columns can be different types; each column must be same length.
List
Most flexible structure. Each component can be a different type, dimension, or length. Can nest lists.
Vector
Collection of elements of the same data type, created with c(.). Use square brackets [] to access elements.
a[2]a[3:5]a[c(2,5)]a*0.25a by 0.25a+b; b-a; a*b; b/aa operates with 1st of b, and so on[] access specific elements within a vector. The same principle applies with other mathematical operations.Factor
Similar to a vector but holds elements from a finite set of values (levels). Create a vector first, then convert using factor(.).
levels= argument inside factor(.) to specify a custom order.Matrix
A 2D array where all elements are the same data type. Data are read column-wise by default (byrow=FALSE).
Multiple vectors can be bound together to create a matrix using cbind() (column) or rbind() (row).
Matrix Multiplication & Access
Matrix multiplication uses %*%. Access elements using A[row, col] notation.
Columns in A must equal rows in B. Unlike element-wise: A×B ≠ B×A. Rows of A multiply columns of B. Uses %*% operator.
A[row, col]Mat.A[2, 3]Mat.A[1:2, 3]Mat.A[1, 2:3]Mat.A[1, ]Mat.A[, 3]Mat.A[, -1]Data Frame
Most commonly used data structure in R. Columns can be different types; all columns must have the same length. Created with data.frame(.).
Data Frame Access — same as matrices:
Utility commands:
Data Frame — Tibble
A tibble is a modern take on a data frame, defined with tibble(.). It avoids many annoying default behaviours of the classic data frame.
Key differences from data.frame:
A tibble never changes input types (won't silently convert character to factor).
Printing shows only the first 10 rows and columns that fit on screen.
An abbreviated column type is shown under each column name.
as.data.frame() to convert back when needed.List
The most flexible data structure in R. Each component can be a different type, dimension, or length. You can even nest a list within a list.
[[ ]] and $list1[[1]]c)list1[[2]]Mat.A)list1[[3]]df)list1$VecClist1$MatAlist1$DatFrameCoercion of Data Types
Elements of a vector or matrix must be the same type. When mixed types are given, R silently coerces to the most flexible type in the mix.
Flexibility order
least → most flexible
class(.).