IST 3420 Study Guide 2
Data Visualization
-"A picture is worth a thousand words" -"The purpose of visualization is insight, not pictures"
R Code: Scatter Plot Matrix
-A better way to roughly show the relationship between multiple variables -Scatterplot matrices are good for determining rough linear correlations of metadata that contain continuous variables. -Scatterplot matrices are not so good for looking at discrete variables.
Simpson's Paradox
-A statistical trend that appears in several different groups may disappear or even reverse when all groups are combined.
Merge/Join Methods
-Use merge() function -Use join() function in plyr package -Use dplyr package inner_join() left_join() right_join() full_join()
Sorting Methods
-Use order() function in base R -Use arrange() function in dplyr package
Customize ggplot2 Graphs
-We know that the display of base R graphs can be customized by using par(). -For ggplot2 graphs, we use a set of theme functions. -We can use "+" to chain ggplot2 functions to draw the final graph.
Two common data sources
-Web server log files (e.g., Apache log file) -Page tagging or "Web bugs" (e.g., analytics.js)
Procedure of Tabulating Quantitative Data
1.Calculate Range 2.Determine Break Points 3.Classify Observations into Sub-Intervals 4.Calculate Frequency in Each Sub-Interval
Merge/Join Data Frames
4 Major Types of Merging/Joining x and y -Inner join: Only rows with matching keys in both x and y -Left outer join: All rows in x, adding matching columns from y -Right outer join: All rows in y, adding matching columns from x -Full outer join: All rows in x with matching columns in y, then the rows of y that don't match x.
Correlation vs. Causation
A common mistake is to confuse correlation with causation -Correlation measures the degree that two variables move with each other; -Causation means one variable has effect on another.
Histograms
A histogram reveals the shape of distribution.
What is the result of the following statement? gsub("[A-z]", " ", "a1B2c3") A) A string " 1 2 3" B) A string "aBc" C) A string "a1B2c3" D) A string "a1 2c3"
A string " 1 2 3"
Assume we have a student score dataset shown below. If we want to calculate average student score for each department (for example, the average student score for department IST is 90.67, according to the data shown above), which of the following data manipulation operations is most appropriate? A) Sort B) Aggregate C) Merge D) Subset
Aggregate
Aggregate
Base R -aggregate(): group data and calculate summary statistics dplyr Package -group_by(): group data -summarize(): calculate summary statistics
Cross Tabulate Quantitative Variables
Like creating frequency table, cross-tabulating quantitative variables involves using a sequence of sub-intervals.
Summarize and Visualize Data
Summarizing and visualizing data facilitates communication of data analysis to the users or customers. Data types -Qualitative data (nominal, interval) -Quantitative data (ordinal, ratio) Approaches -Tabular methods -Graphical methods
Why Visualization is Important? Anscombe's Quartet
The dangers of summary statistics: -All four datasets are identical when examined using simple summary statistics, but vary considerably when graphed. -Always plot your data
Subset Methods
Use base R features -Use index -Use which() function -Use subset() function Use dplyr package -Use dplyr::select() to select variables -Use dplyr::filter() to select observations
Reshape
Use tidyr Package -gather(data, key, value, ...): to gather columns into rows -spread(data, key, value): to spread rows into columns
Determine the Number of Bins
We can set the breaks parameter to other numbers. Different bin sizes can reveal different features of the data: -Less bins lead to oversmoothing and bias -More bins result in imprecise estimation due to extra noise There is no "best" number of bins. By default, R hist() function use Sturges' Rule to determine the number of breaks: k = ceiling(log2n+1)
Use ggplot() and Other Functions
We typically construct a plot incrementally, using the "+" operator to add layers to the existing ggplot object.
Tabulate 2 or More Categorical Variables
-A contingency table is a special type of frequency distribution table, where two or more variables are shown simultaneously. -xtabs() returns similar results as table(), takes arguments differently, e.g xtabs( region+income)
Scatter Plots
-A scatter plot presents the relationship between two quantitative variables. -A trend line can be used to approximate the relationship. -A scatterplot matrix is preferred when we want to roughly determine the relationship among multiple variables.
Think Deeper about Regex
-Anchors: do not match any character but match the pattern supplied to a position before, after or between characters -Meta characters: special set of characters not captured by regular expressions, must be prefixed by double backslash (\\) -Quantifiers: act on items to the immediate left and are used to specify the number of times a pattern must appear or be matched -Character Classes: enclosed in a square bracket ([]), regex will match only those characters enclosed in the brackets and it matches only a single character
ggplot2 Package
-Based on the Grammar of Graphics, a general scheme for data visualization A graph can be built from several components: -A dataset -A set of geoms(geometric objects) -A coordinate system ... Two common types of usage -Use quick plotting qplot() function to create basic graphs -Use ggplot() and other functions to create ggplot2 graphs
Why Data Management Matters?
-Data cleansing/transformation is an essential (usually the most time-consuming) part of a data analytics project. -A properly prepared dataset is the prerequisite of statistical modeling, prediction, and inference. -The "Garbage in, garbage out" rule applies.
Kernel Density Plots
-Density, or probability density function (PDF), is a function that describes the relative probability for a variable to hold a given value. -Density is usually a better way to describe the distribution of a quantitative variable. -Use R function density() to calculate univariate kernel density estimation
Histograms vs. Bar Charts
-Histograms are used to visualize the distribution of quantitative data. -Bar charts are used to summarize qualitative data.
Data Management for Data Science
-Our goal is to get prepared datasets that are ready for in-depth data analysis. -In R, the prepared datasets are usually data frames, which mimic the SAS or SPSS data set, i.e. a "cases by variables" matrix of data.
Forward Pipe Operator magrittr::%>%
-Pipe an object forward into a function or call expression. -Forward pipe operator makes R code more readable and elegant.
Tabulate Quantitative Data
-Quantitative data are measured in ordinal and/or ratio scales. -Directly counting the number of unique values is meaningless.
Visualize Data in R
-R provides strong data visualization capabilities -Basic graphs for qualitative variables *bar plots (simple, stacked, grouped) *pie charts (simple, annotated) -Basic graphs for quantitative variables *dot plots *boxplots *density plots (histograms and kernel density plots) *line charts *scatter plots -Advanced graphs
Why We Need String Manipulation?
-Strings usually contain unstructured or semi-structured data -Regular expressions or regexps are a concise language for describing patterns in strings *find a word in a string *replace string *match a single character *match one of any of several letters *match series of range of characters
Subset
-Subset variables -Subset observations
Bar Plots (Qualitative Data)
-Syntax: barplot(height, ...) -Three types *Simple bar plots *Stacked bar plots *Grouped bar plots
Dot Plots (Quantitative Data)
-Syntax: dotchart(x, labels = ,...) *cex: the character size to use (a very useful setting to avoid label overlap). *groups: an optional factor indicating how the elements of x are grouped. *color: the color(s) to be used for points and labels. *gcolor: the single color to be used for group labels and values.
Pie Plots (Qualitative Data)
-Syntax: pie(x, labels = names(x),...)
Sort
-To sort data frames into ascending or descending order along one or more variables
Understand Spatial Data
-US state boundaries map data in "maps" package -Use ggplot2::map_data("state") to create a data frame containing map boundary data for all US states -Check the Missouri state map boundary data
Common reasons to use Web Analytics
-Understand website traffic -Track mass user activity -Improve site design and user experience
Create Variables
-Use index ($ operator) -Use transform() function -Use dplyr::mutate() function
Quick Plotting
-basic quick plotting function in the ggplot2 package. -It is very similar to the base R plot() function. -It's a convenient wrapper for creating a number of different types of plots using a consistent calling scheme.
Use Cases of Regular Expressions
-email validation -password validation -date validation -phone number validation -search and replace in text editors -web scraping
}To find the position of matches in a string, do NOT use _ _ _ _().
-grep() returns the index of matched string in a vector, NOT the position of the match in the text. -Instead, use regexpr() to get the position of the 1st match, and use gregexpr() to get positions of all matches.
To count the number of characters in a string, do NOT use _ _ _ _ _().
-length() returns the length of the vector containing the string. -Instead, use nchar() function.
Spatial Data
-represent the location, size, and shape of physical objects (such as cities, lake, mountains, buildings etc.) by numbers in a geographic coordinate system. -Geographic information systems (GIS) can visualize and analyze the spatial data.
Tabulate A Single Variable
Both qualitative and quantitative data -Frequency distribution: shows the frequency (or count) of items -Relative frequency distribution: shows frequency proportion of items Quantitative data -Accumulative frequency distribution: shows frequency below a level -Accumulative relative frequency distribution: shows frequency proportion below a level
Boxplots (a.k.a. Box-and-Whisker Plots)
Box-and-whisker plots summarize data based on a five- number summary. -First/lower quartile (Q1) -Middle quartile/median (Q2) -Third/upper quartile (Q3) -Interquartile rangeIQR = Q3 - Q1 -Upper limit = Q3 + 1.5IQR -Lower limit = Q1 - 1.5IQR -Outliers: beyond the range[Q1 - 1.5IQR, Q3 + 1.5IQR]
Density Plots (Quantitative Data)
Histogram -A histogram visualize the distribution of a variable in terms of frequency, relative frequency, or percent frequency. =A histogram can be thought of as a simplified kernel density plot. Kernel Density Plot -A kernel density plot uses a kernel algorithm to smooth frequencies over the bins. -This yields a smoother probability density function, which will in general more accurately reflect distribution of the underlying variable.
Which of the following statements about data management is NOT true? A) In R, the prepared datasets are usually represented as matrices. B) Data cleansing/transformation is an essential (usually the most time-consuming) part of a data analytics project. C) The goal is to get a prepared dataset that is ready for in-depth analysis. D) The "garbage in, garbage out" rule applies to data management.
In R, the prepared datasets are usually represented as matrices.
Visualization Amplifies Human Cognition
Information visualization amplifies human cognitive capability in six basic ways: -by increasing cognitive resources, such as by using a visual resource to expand human working memory -by reducing search, such as by representing a large amount of data in a small space -by enhancing the recognition of patterns, such as when information is organized in space by its time relationships -by supporting the easy perceptual inference of relationships that are otherwise more difficult to induce -by perceptual monitoring of a large number of potential events -by providing a manipulable medium that, unlike static diagrams, enables the exploration of a space of parameter values
Explain the meaning of the following statements in the box. library(magrittr) log$remote.host %>% unique %>% length
Loads the package of magrittr. List number of unique IP addresses.
Suppose we have a data frame called log and its structure is shown as below. 'data.frame': 132258 obs. of 14 variables:$ V1 : chr "65.55.147.227" "65.55.86.34" "148.188.55.88" "72.30.57.238" ...$ V2 : chr "-" "-" "-" "-" ...$ V3 : chr "-" "-" "-" "-" ...$ V4 : chr "[15/Oct/2009:02:00:24" "[15/Oct/2009:02:00:58" "[15/Oct/2009:02:01:41" "[15/Oct/2009:02:01:59" ...$ V5 : chr "+0000]" "+0000]" "+0000]" "+0000]" ...$ V6 : chr "GET /index.html HTTP/1.1" "GET /index.html HTTP/1.1" "GET /faq.html HTTP/1.1" ...$ V7 : int 200 200 200 200 200 200 200 200 200 200 ...$ V8 : int 21878 1416 10946 39943 17247 7883 18119 10946 1416 37122 ...$ remote.host : chr "65.55.147.227" "65.55.86.34" "148.188.55.88" "72.30.57.238" ...$ status : Factor w/ 7 levels "200","206","301",..: 1 1 1 1 1 1 1 1 1 1 ...$ request.datetime: POSIXct, format: "2009-10-15 02:00:24" "2009-10-15 02:00:58" "2009-10-15 02:01:41" "2009-10-15 02:01:59" ...$ weekday : chr "Thursday" "Thursday" "Thursday" "Thursday" ...$ request.method : chr "GET" "GET" "GET" "GET" ...$ request.uri : chr "index.html" "index.html" "faq.html" "contribute.txt" ... Explain the meaning of the following statements in the box. library(dplyr) log <- log %>% select(-starts_with("V"))
Select all the columns whose column names are not starting with "V".
What is the difference between paste() function and paste0() function?
They are quite similar except for paste0() the argument sep by default is "". Paste0() doesn't have any spaces in it, so the default separation is actually "" instead of " ", whereas the default separation in paste() is " ".
A Five-Number Summary (Quantitative Data)
Use the following 5 numbers to summarize data 1. Smallest value 2. First quartile (Q1) 3. Median (Q2) 4. Third quartile (Q3) 5. Largest value Procedure: Step 1: Sort the data in ascending order; Step 2:Calculate the smallest value, the three quartiles, and the largest value.