LSUS MKT 715 Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

How do you select a variable from a data frame?

$ "extraction operator", which is used to view a single variable/column in a data frame airlines$name (airline = vector and name = variable) We used the $ operator to extract only the name variable and return it as a vector of length 16.

Differentiate between the different types of logical operators (&, |, !)

& = AND I = OR ! = NOT Double ( !!, II, and &&) return result for only first value.

Over plotting -- Jitter Code

+ geom_jitter(width = 30, height = 30)

Over plotting -- Transparency Code

+ geom_point(alpha = 0.2)

4. You will be asked to examine a lines of code for errors. Understand the proper structure for ggplot line of code.

-- Good Example -- ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) + geom_point()

5 stages of team development

1. Forming 2. Storming 3. Norming 4. Performing 5. Adjourning

What is a vector?

A series of values. These are created using the c() function, where c() stands for "combine" or "concatenate." For example: c(6, 11, 13, 31, 90, 92).

group_by()

Add grouping structure to rows in data frame. Note this does not change values in data frame.

arrange()

Arrange rows of a data variable in ascending (default) or descending order

Factor

Categorical data are represented in R as factors. factor(x) Turn a vector into a factor. Can set the levels of the factor and the order Factors are the data objects which are used to categorize the data and store it as levels. They are labels - they are integer vectors

How do you install a package?

Click on the "Packages" tab. Click on "Install" next to Update. Type the name of the package under "Packages (separate multiple with space or comma):" In this case, type ggplot2. Click "Install." install.packages("packagename") An alternative but slightly less convenient way to install a package is by typing install.packages("ggplot2") in the console pane of RStudio and pressing Return/Enter on your keyboard.

2. What is the color aesthetic and the size aesthetic mainly used to illustrate?

Color is used for showing the data related to a particular variable Size is related to the relative value of the variable

Joseph Juran

Connected to the Pareto Principle Believed that the results should fit for use by the customer Believed in getting top management involved

mutate()

Create new variables by mutating existing ones

Data Frame

Data frames are like rectangular spreadsheets: they are representations of datasets in R where the rows correspond to observations and the columns correspond to variables that describe the observations.

What is the difference between a vector, data frame, and factor?

Data frames are like rectangular spreadsheets: they are representations of datasets in R where the rows correspond to observations and the columns correspond to variables that describe the observations.

Data Types in R (also called class) -- class()

Data types (atomic types) Integers (are whole numbers, positive, negative, or zero) (examples: 2, -4, L7, 6B) doubles/numerics logicals characters Numeric -- examples 12.3, 5, 999 (this is class() level) a <- c(1,2,5.3,6,-2,4) # numeric vector b <- c("one","two","three") # character vector c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector Atomic - holds elements of the same type Lists - holds elements of different types 'Character -- examples: a' , '"good", "TRUE", '23.4' (labels, factor())

Know the different ways you can ask R for help documentation with a dataset or function.

Example: help(sd) ?sd

Walter Shewhart

Father of statistical quality control

7. Differentiate conceptually between a boxplot and a histogram

Histogram is the distribution of one variable across a series of values W hat are the smallest and largest values? What is the "center" or "most typical" value? How do the values spread out? What are frequent and infrequent values? Remember HISTOry of that variable in numbers Barplot Counting the frequency of different variables - like apples and oranges -- Remember - frozen fruit bars,

Philip Crosby

In 1979, emphasized that costs of poor quality far outweigh cost of preventing poor quality In 1984, defined absolutes of quality management—conformance to requirements, prevention, and "zero defects"

Know how to differentiate between the different scales of measurement: (Nominal, Ordinal, Interval, Ratio)

In summary, nominal variables are used to "name," or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate-order, interval values, plus the ability to calculate ratios since a "true zero" can be defined. Nominal scales are used for labeling variables, without any quantitative value. "Nominal" scales could simply be called "labels." With ordinal scales, the order of the values is what's important and significant, but the differences between each one is not really known. Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc. Interval scales are numeric scales in which we know both the order and the exact differences between the values. The classic example of an interval scale is Celsius temperature because the difference between each value is the same. For example, the difference between 60 and 50 degrees is a measurable 10 degrees, as is the difference between 80 and 70 degrees. https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/

4. Is a dataset transformed when only using the group_by() function?

It is important to note that the group_by() function doesn't change data frames by itself. Rather it changes the meta-data, or data about the data, specifically the grouping structure. It is only after we apply the summarize() function that the data frame changes.

Load Data versus Load Packages

Load Data -- read.type() Load Package -- library()

filter()

Pick out a subset of rows

%>%

Pipe (%>%) Operator The principal function provided by the magrittr package is %>%, or what's called the "pipe" operator. This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as: filter(data, variable == numeric_value) data %>% filter(variable == numeric_value) Both functions complete the same task and the benefit of using %>% may not be immediately evident; however, when you desire to perform multiple functions its advantage becomes obvious. For instance, if we want to filter some data, group it by categories, summarize it, and then order the summarized results we could write it out three different ways.

Kaoru Ishikawa

Promoted use of quality circles Developed "fishbone" diagram Emphasized importance of internal customer

Project Charter Appendix

Risk Analysis Gantt PERT Voice of Customer

3. Know the appropriate ggplot codes to create the 5NG.

Scatter Plot ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2) ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_jitter(width = 30, height = 30) Line Graph ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) + geom_line() Histogram ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram() ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(color = "white")

Line Graph versus a Scatter Plot

Scatter plot, relationship between to variables Line graph, x-axis in sequential / ordered A timeLINE - line graph - time is sequential - moves in a LINE

What is the Data Pipeline?

Software that eliminates many manual steps from the process and enables a smooth, automated flow of data from one station to the next. It starts by defining what, where, and how data is collected. It automates the processes involved in extracting, transforming, combining, validating, and loading data for further analysis and visualization. It provides end-to-end velocity by eliminating errors and combatting bottlenecks or latency.

W. Edwards Deming

Special vs. common cause variation The 14 points

Know when to use the str() command and the summary() command.

Str() Str is a compact way to display the structure of an R object. This allows you to use str as a diagnostic function and an alternative to summary. Str will output the information on one line for each basic structure. Str is best for displaying contents of lists - even with nested lists. Can use Str() to get a summary of another function! Gives name of object, things like number of variables, number of objects in a dataframe, etc. Summary() - For a Dataframe - Summary of each individual column - For a vector it gives values: min, 1st Quartile, Median, Mean, 3rd Quartile, Max

summarize()

Summarize many values to one using a summary statistic function like mean(), median(), etc.

What does (na.rm = TRUE) mean?

That the function should remove missing values before executing commands / calculations

5. How do you remedy a scatterplot that suffers from overplotting?

There are two methods to address the issue of overplotting. Either by - Adjusting the transparency of the points or The first way of addressing overplotting is to change the transparency/opacity of the points by setting the alpha argument in geom_point(). We can change the alpha argument to be any value between 0 and 1, where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque. By default, alpha is set to 1. In other words, if we don't explicitly set an alpha value, R will use alpha = 1. Note how the following code is identical to the code in Section 2.3 that created the scatterplot with overplotting, but with alpha = 0.2 added to the geom_point(): ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2) - Adding a little random "jitter", or random "nudges", to each of the points. The second way of addressing overplotting is by jittering all the points, in other words give each point a small "nudge" in a random direction. Keep in mind however that jittering is strictly a visualization tool; even after creating a jittered scatterplot, the original values saved in the data frame remain unchanged. To create a jittered scatterplot, instead of using geom_point(), we use geom_jitter(). Observe how the following code is very similar to the code that created the scatterplot with overplotting in Subsection 2.3.1, but with geom_point() replaced with geom_jitter(). ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_jitter(width = 30, height = 30)

Know how to create a regression table.

This is done in two steps: We first "fit" the linear regression model using the lm() function and save it in score_model. # Fit regression model: score_model <- lm(score ~ bty_avg, data = evals_ch6) We get the regression table by applying the get_regression_table() from the moderndive package to score_model. # Get regression table: get_regression_table(score_model) ________________________ # Get all fitted/predicted values get_regression_points(model_score_2)

Know how to load datasets that come with R.

Through the R-Studio menus -- 1) Environment --> Import Data Set 2) Tools --> Import Data Set From the console: read.table("x") read.csv() read.delim() (and lots of other types listed in R console help notes

How do you load a package?

We do this by using the library() command. For example, to load the ggplot2 package, run the following code in the console pane: library(ggplot2) Can ALSO go to the list of packages and click next to the name - the video called this "attaching" / "docking" the package.

Know the three basic steps to an Exploratory Data Analysis and how to conduct it through R.

You should always perform an exploratory analysis (EDA) of your variables before any formal modeling. ONE Most crucially, looking at the raw data values. glimpse(df) TWO Computing summary statistics, such as means, medians, and interquartile ranges. df %>% summarize(mean_bty_avg = mean(bty_avg), mean_score = mean(score), median_bty_avg = median(bty_avg), median_score = median(score)) evals_ch6 %>% select(score, bty_avg) %>% skim() THREE Creating data visualizations. ggplot(evals_ch6, aes(x = bty_avg, y = score)) + geom_point() + labs(x = "Beauty Score", y = "Teaching Score", title = "Scatterplot of relationship of teaching and beauty scores")

Total Quality Management (TQM)

a comprehensive approach - led by top management and supported throughout the organization - dedicated to continuous quality improvement, training, and customer satisfaction

Know how to identify a default argument within a function.

args(function)

DYPLR arrange

arrange() its rows. For example, sort the rows of weather in ascending or descending order of temp. freq_dest %>% arrange(num_flights) arrange() always returns rows sorted in ascending order by default. To switch the ordering to be in "descending" order instead, we use the desc() function as so: freq_dest %>% arrange(desc(num_flights))

Which command will check the data type of a variable?

class() - what kind of object is it (high-level)? typeof() - what is the object's data type (low-level)? > y <- x/5 + rnorm(10) > class(y) [1] "numeric" > typeof(y) [1] "double" > g <- lm(y ~ x) > class(g) [1] "lm" > typeof(g) [1] "list" A variable provides us with named storage that our programs can manipulate. In fact, everything in R is an object. An object is a data structure having some attributes and methods which act on its attributes. Where values are saved in R. We'll show you how to assign values to objects and how to display the contents of objects. factors are variables in R which take on a limited number of different values; such variables are often refered to as categorical variables - example = blood type (categories = levels)

Know how to compute and interpret a correlation in R

df %>% get_correlation(formula = score ~ bty_avg) We put the name of the response variable on the left-hand side of the ~ "tilde" sign, while putting the name of the explanatory variable on the right-hand side. OR df %>% summarize(correlation = cor(score, bty_avg)) _______________________ A correlation coefficient is a quantitative expression of the strength of the linear relationship between two numerical variables. Its value ranges between -1 and 1 where: -1 indicates a perfect negative relationship: As the value of one variable goes up, the value of the other variable tends to go down. 0 indicates no relationship: The values of both variables go up/down independently of each other. +1 indicates a perfect positive relationship: As the value of one variable goes up, the value of the other variable tends to go up as well.

Sample set of rows

evals_ch6 %>% sample_n(size = 5)

6. When should you use facet_wrap?

facet_wrap wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid() because most displays are roughly rectangular. Think of facet_wrap() as a ribbon of plots that arranges panels into rows and columns and chooses a layout that best fits the number of panels. Show by multiple charts one for each month ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(binwidth = 5, color = "white") + facet_wrap(~ month)

1. Know when to use the different dplyr verbs (filter, arrange, group_by, mutate).

filter() Pick out a subset of rows summarize() Summarize many values to one using a summary statistic function like mean(), median(), etc. group_by() Add grouping structure to rows in data frame. Note this does not change values in data frame. mutate() Create new variables by mutating existing ones arrange() Arrange rows of a data variable in ascending (default) or descending order inner_join() Join/merge two data frames, matching rows by a key variable filter() a data frame's existing rows to only pick out a subset of them. For example, the alaska_flights data frame. summarize() one of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range of temperatures as we saw in Section 2.7 on boxplots. Note there is a subtle but important difference between sum() and n(); while sum() returns the sum of a numerical variable, n() returns a count of the number of rows/observations. group_by() its rows. In other words, assign different rows to be part of the same group. Then we can combine group_by() with summarize() to report summary statistics for each group separately. For example, say you don't want a single overall average departure delay dep_delay for all three origin airports combined, but rather three separate average departure delays, one for each of the three origin airports. mutate() its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius. arrange() its rows. For example, sort the rows of weather in ascending or descending order of temp. join() it with another data frame by matching along a "key" variable. In other words, merge these two data frames together.

DPLYR filter

filter() a data frame's existing rows to only pick out a subset of them. For example, the alaska_flights data frame. portland_flights <- flights %>% filter(dest == "PDX") View(portland_flights) btv_sea_flights_fall <- flights %>% filter(origin == "JFK" & (dest == "BTV" | dest == "SEA") & month >= 10) View(btv_sea_flights_fall)

In tidy data: Each variable ...

forms a column.

In tidy data: Each observation

forms a row.

In tidy data: Each type of observational unit

forms a table.

Know how to convert a untidy data set to a tidy dataset using the gather() function.

gather() "collapses multiple columns into two columns" --> Key & Value If creating the new columns for key and value you put new names in quotes, the columns we are gathering do not need quotes as they already exist Think about the KEY as turning a KEY in a lock -- it moves a single row into a single column (and the names repeat as needed) drinks_smaller_tidy <- drinks_smaller %>% gather(key = type, value = servings, -country) drinks_smaller_tidy drinks_smaller_tidy <- drinks_smaller %>% gather(key = type, value = servings, c(beer, spirit, wine)) drinks_smaller_tidy key is the name of the variable in the new "tidy" data frame that will contain the column names of the original data. Observe how we set key = type. In the resulting drinks_smaller_tidy, the column type contains the three types of alcohol beer, spirit, and wine. value is the name of the variable in the new "tidy" data frame that will contain the rows and columns of values of the original data. Observe how we set value = servings. In the resulting drinks_smaller_tidy, the column value contains the 4 ×× 3 = 12 numerical values. The third argument is the columns you either want to or don't want to tidy. Observe how we set this to -country indicating that we don't want to tidy the country variable in drinks_smaller and rather only beer, spirit, and wine. gather(key, value, columns)

Know how to create a ggplot using the tidy dataset you created (figure 4.4 in textbook)

ggplot(drinks_smaller_tidy, aes(x = country, y = servings, fill = type)) + geom_col(position = "dodge") the fill aesthetic of any bar corresponds to the color used to fill the bars.

Understand how to fit a "best fitting" regression line with ggplot.

ggplot(evals_ch6, aes(x = bty_avg, y = score)) + geom_point() + labs(x = "Beauty Score", y = "Teaching Score", title = "Relationship between teaching and beauty scores") + geom_smooth(method = "lm", se = FALSE) The method = "lm" argument sets the line to be a "linear model" i.e. a line, while the se = FALSE argument suppresses "standard error" uncertainty bars.

DYPLR group_by

group_by() its rows. In other words, assign different rows to be part of the same group. Then we can combine group_by() with summarize() to report summary statistics for each group separately. For example, say you don't want a single overall average departure delay dep_delay for all three origin airports combined, but rather three separate average departure delays, one for each of the three origin airports. summary_monthly_temp <- weather %>% group_by(month) %>% summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_monthly_temp diamonds %>% group_by(cut) %>% ungroup() If you want to group_by() two or more variables, you should include all the variables at the same time in the same group_by() adding a comma between the variable names.

Dysfunctional teams

lack of trust fear of conflict lack of commitment avoidance of accountability inattention to results

How to write the factor function into the graph code

mapping = aes(x = factor(month), y = temp

DYPLR mutate

mutate() its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius. weather <- weather %>% mutate(temp_in_C = (temp - 32) / 1.8) flights <- flights %>% mutate(gain = dep_delay - arr_delay) flights <- flights %>% mutate( gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours )

What is tidy data

outlining a set of rules by which data is saved In the context of doing data science in R, long/narrow format is also known as "tidy" format. In order to use the ggplot2 and dplyr packages for data visualization and data wrangling, your input data frames must be in "tidy" format. Thus, all non-"tidy" data must be converted to "tidy" format first. In tidy data: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table.

1. Know the five named graphs (5NG) and when to use each one. Define each.

scatterplots They allow you to visualize the relationship between two numerical variables. also called bivariate plots ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() linegraphs Show the relationship between two numerical variables when the variable on the x-axis, also called the explanatory variable, is of a sequential nature. In other words there is an inherent ordering to the variable. The most common example of linegraphs have some notion of time on the x-axis: hours, days, weeks, years, etc. Since time is sequential, we connect consecutive observations of the variable on the y-axis with a line. Linegraphs that have some notion of time on the x-axis are also called time series plots. ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp)) + geom_line() boxplots While faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves this same goal are side-by-side boxplots. . ------------ 25% | 25% | 25% | 25% ----------- . ggplot(data = weather, mapping = aes(x = month, y = temp)) + geom_boxplot() ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) + geom_boxplot() histograms A histogram is a plot that visualizes the distribution of a numerical value as follows: We first cut up the x-axis into a series of bins, where each bin represents a range of values. For each bin, we count the number of observations that fall in the range corresponding to that bin. Then for each bin, we draw a bar whose height marks the corresponding count. ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram() Its basically a bar chart barplots Another common task is to visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories of a categorical variable, also known as the levels of the categorical variable. Often the best way to visualize these different counts, also known as frequencies, is with barplots (also called barcharts). ggplot(data = fruits, mapping = aes(x = fruit)) + geom_bar() ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) + geom_col() When the categorical variable whose distribution you want to visualize Is not pre-counted in your data frame, we use geom_bar(). Is pre-counted in your data frame, we use geom_col() with the y-position aesthetic mapped to the variable that has the counts. Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the joint distribution of two categorical variables at the same time. ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar()

STR() (data frame)

shows you the structure of your data set. For a data frame it tells you: The total number of observations (e.g. 32 car types) The total number of variables (e.g. 11 car features) A full list of the variables names (e.g. mpg, cyl ... ) The data type of each variable (e.g. num) The first observations

What is the purpose of the str() and summary() command?

str() - compactly displays the structure of an R object ___________________________________ summary() VS summarize() -- These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument.


Ensembles d'études connexes

Fundies II: Chapter 3- Health, wellness, and Health disparities

View Set

Anatomy and Physiology Quiz 1-16

View Set

Am. English File 3, page 155 - transportation

View Set

Chapter 19 - Documenting and Reporting

View Set

Chapter 04: Settings for Psychiatric Care

View Set