MATH144: Data Analysis with R (Coursera)
Correlation coefficient interpretation
+1 = large positive relationship -1 = large negative relaationship 0 = no relationship
Grid Search
- makes the process of tuning the hyperparameters in models easier - training a new model for every parameter, finding the value of the parameter that has the smallest error.
Linear regression assumptions
1) Linear relationship 2) Independence of each observation 3) Normality- for any fixed value of X, Y is normally distributed 4) Equal Variance (Homoscedasticity)- variance of residual is same for any value of X or LINE
1. Format - way data is encoded (.csv, .json, .xlsx) 2. File path of dataset - where data is stored, either on local file or online
2 factors to consider whenusing the R readr package
1. R automatically assigns types based on the encoding it detects, may be incorrect 2.See which R tidyverse functions can be applied to a specific variable
2 reasons to check data types in a dataset
Accuracy of data and Efficiency with which we're able to access the data
2 things that determine the value of data
1. select() - selects variables by their names, 2. filter() - filters observations based on values 3. summarize() - computes summary statistics, 4. arrange() -reorders the rows 5. mutate() -creates new variables
5 key dplyr functions
a strong correlation
A large F-test score and small P-value indicates
High bias
An underfit model is said to have which of the following?
What is the target variable?
ArrDelay- arrival delay (min) this is the value that we want to predict from the dataset
linear_model <- lm(Y ~ X, data = new_dataset)
Assume you have a dataset called "new_dataset", a predictor variable called X, and a target called Y, and you want to fit a simple linear regression model. Which command should you use?
linear_model <- lm(Z ~ X + Y, data = new_dataset)
Assume you have a dataset called "new_dataset", two predictor variables called X and Y, and a target variable called Z, and you want to fit a multiple linear regression model. Which command should you use?
coefficient close to +/-1 and a P-value less than 0.001
Based on Pearson correlation, a strong correlation between variables is
What is R squared?
Coefficient of determination. How close data is to the fitted regression line
.csv file
Common file format for datasets separates each of the column values within a row with commas
1. Predicting the future 2. Answering questions 3. Discovering useful information
Data analysis plays an important role in which of the following scenarios?
What are the predictors?
Distance, CarrierDelay, WeatherDelay, NASDelay (National Aviation System), SecurityDelay, or LateAircraftDelay. These are the categories for reasons of delay
Bias-variance tradeoff
Error due to simplistic assumptions, underfitting, Error due to complexity, Overfit, too much noise, bias-variance decomposition, optimally reduced amount of error, no high bias or high variance
summarize() and group_by(): summarizes each group into a single-row summary of that group
Function used for describing data
Why use a regression plot? (scatter plot, target values, fitted regression line)
Good estimate of: 1. Relationship between two variables 2. Strength of correlation 3. Direction of relationship
89% of the response variable variation is explained by a linear model.
How should you interpret an R-squared result of 0.89?
1. Check with data collection source 2. Drop missing values (drop variable/data entry, opt for the least impact) 3. Replace the missing values (BUT is less accurate) usually with an avg or mode 4. Leave it missing
How to deal with missing data?
- Add dummy variables for each unique category - Assign 0 or 1 in each category
How to make categorical to numeric values?
Bell-shaped curve
How to tell if it is a normal distribution?
Valid data can be treated as missing data
If you don't ensure that data is stored in the correct format (such as numeric or character), what can happen?
Column
In a dataset, a ___________________________ is also referred to as a variable, feature, or attribute.
Relevant data.
In model development, you can develop more accurate models when you have which of the following?
Coherence IS relevant. Coherence is an indication of the quality of the information in a single data set.
Is coherence irrelevant when assessing the quality of a data set?
Difference of MSE value for MLR and SLR models
MSE value for a Multiple Linear Regression model will be smaller than the MSE for a Simple Linear Regression model Polynomial will also have smaller MSE
xnew = x old - x min / x max - x min new values range between 0 and 1
Min-Max
What does Quantile-quantile (Q-Q plot) evaluate?
Normality Errors are normaly distributed if points are close to the 45 degree or diagonal line
What problem is being solved ?
Predict flight delay rate from LAX to JFK. (locations are the fixed variables) The goal of this project is to predict "ArrDelay" for a given flight date and airline.
What is regularization
Regularization is a technique you can use to avoid overfitting by restricting the magnitude of model coefficients.
What is Residual plot good for?
Residuals vs Fitted plot is good for linearity and homoscedasticity
xnew = x old / x max divides each value by the maximum value for that feature. This makes the new values range between 0 and 1
Simple Feature scaling
One-hot encoding
The process by which categorical variables are converted into binary form (0 or 1) using spread() function
a. Data wrangling b. Data pre-processing c. Data cleaning
The process of converting or mapping data from the initial raw form to another format to prepare it for further analysis goes by several names. What is this process commonly called? Select three answers.
Histogram
To visualize its distribution, binned data is often plotted in which of the following type of chart?
Training set vs Test set
Training set - larger portion of data used - builds the model Test set - tests how model performs in the real world - evaluating the model - smaller part used (20%)
False To return the expected coefficients, you must set the raw parameter to TRUE.
True or False: When using the poly() function to fit a polynomial regression model, you must specify "raw = FALSE" so you can get the expected coefficients.
the correlation coefficient and P-value
Two values provided by Pearson correlation is
- Manually by scientists - Digitally every time you click on a website or access an app on your mobile device
Ways that data are collected
Train the model, Make Predictions, Compute Metrics.
What are the 3 steps to evaluating a model?
Residuals vs Fitted plot Q-Q plot Scale Location Residuals vs Leverage
What are the 4 different diagnostic plots?
a. Determine the relationships between variables. b. Identify any special structures that may exist in the data. c. Understand how the data were generated.
What are the key reasons to develop a model for your data analysis? Select three answers.
- Integer - Double - Logical - Character - Date
What are the known data types in tidyverse?(IDLCD_
F-test score: Variation between sample group mean divided by variation within sample group. Larger value means larger difference between the mean of the groups P-value: Confidence degree, statistical significance of results. smaller than 0.1 or 0.05 = significant
What are the two values obtained from ANOVA? aov() function
a. Minimize the effects of outliers, which can influence the result more. b. Enables a fair comparison between the different features and making sure they have the same impact.
What are two benefits of data normalization?
mean squared error (MSE), the root mean squared error (RMSE), the mean absolute error (or MAE), R-squared.
What are ways to evaluate regression models?
the mean, the total number of data points, the standard deviation, the quartiles, or the extreme values gives idea of the distribution on variables
What can summary statistics show?
1. Predict data trends, 2. Analyze and interpret data 3. Solves the problem and drives effective decision-making in real-world scenarios.
What can we do with data analysis?
?, zero, N/A, or blank cell
What does a missing value in a dataset appear as?
Noise (or bias) which results in a significant drop in variance.
What does regularization introduce into a model that results in a drop in variance?
separate() or correct data types with mutate_all() and mutate_if()
What functions are used to reformat columns?
N-fold cross-validation
What is a cross validation technique that can help improve the evaluation of models when datasets are small?
count() function
What is a method to summarize categorical data?
Increase model complexity helps with underfitting, such as switching from a linear to a non-linear model.
What is a strategy you can employ to address an underfit model?
Pearson correlation
What is a way to measure correlation between continuous numerical variables?
Creating discrete categories, called bins, from numerical values (often continuous) Can improve the accuracy of predictive models Helps understand data distribution
What is binning?
An indication of the quality of the information within a single dataset. Fully coherent data are logically consistent and can be reliably combined for analysis
What is coherence in statistics?
Bringing data into a common standard of expression that allows you to make meaningful comparisons.
What is data formatting?
A way to bring all data into a similar range Enables a fair comparison between different features
What is data normalization?
converting or mapping data from one "raw" form into another format to prepare it for further analysis.
What is data pre-processing? (aka data cleaning, data wrangling)
The variance of residual is the same for any value of X.
What is the definition of the assumption homoscedasticity?
Lasso regression penalizes the sum of the absolute values of the coefficients while Ridge regression penalizes the sum of squared coefficients.
What is the difference between Ridge and Lasso regression?
Understanding the variables
What is the first critical step to data analysis?
Both RETURN a statistical summary of the data
What is the main similarity between the summarize() and group_by() functions?
Unzip the file using the untar() function.
What is the next step must you perform after you download a dataset file from a URL?
Counts the missing values in all columns in the dataset. If the period is replaced with a variable name, only missing values for that variable are counted.
What is the result of the following statement? sub_airline %>% map(~sum(is.na(.)))
readr parsing a flat file into a tibble
What packages does Data Import and Management group include
ggplot 2 produce charts and visualizations eg box plot, density plots, violin plots, tile plots, and time series plots
What packages does Data Visualization and Exploration group include?
diplyr and tidyr helps filter and group data
What packages does Data Wrangling and Transformation group include
purrr package provides statistics, eg. calculating mean for each column
What packages does Functional Programming group include?
Use or create a data frame containing never seen data. This set of randomly selected values provide new predictor variables on which to base the predictions.
What step must you take before you can obtain a prediction based on a fitted simple linear regression model?
This depends on your data. The model that fits the data better has the smaller MSE.
When comparing linear regression models, when will the mean squared error (MSE) be smaller?
Lowest value of MSE
When comparing the MSE of different models, do you want the highest or lowest value of MSE?
Boxplots
When conducting exploratory data analysis, which visualizations are particularly useful for examining the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages?
Heatmaps
When conducting exploratory data analysis, which visualizations are particularly useful for plotting the target variable over multiple variables to get visual clues of the relationship between these variables and the target.
an overfit model
When evaluating models, what is the term used to describe a situation where a model fits the training data very well but performs poorly when predicting new data?
summarize() function produces descriptive statistics based on the groups defined.
When grouping data and calculating the mean of each group as part of your exploratory data analysis, you typically use the group_by() function with which other function?
Error (grid search)
When tuning a model, a grid search attempts to find the value of a parameter that has the smallest ______________.
The P value is less than 0.001.
When using the Pearson method to evaluate the correlation between two variables, how do you know you can have strong certainty in the result?
The correlation coefficient is .85 and the P value is 0.00037. the correlation coefficient is close to 1 and the P value is less than 0.001.
When using the Pearson method to evaluate the correlation between two variables, which set of numbers indicates a strong positive correlation?
The default confidence level is 95%.
When using the predict() function in R, what is the default confidence level?
Simple feature scaling
Which data normalization technique divides each value by the maximum value for that variable, resulting in new values that range between 0 and 1?
read_delim()
Which function can you use to read a text file that uses the "%" character as a delimiter?
replace_na()
Which function replaces missing values in a dataset?
The mutate_all() function specifies that the change should apply to all variables in the dataset. The mutate_if() function changes any instance of one data type to another data type.
Which functions do you use together to correct data types in all columns of your dataset? Select two answers.
Convert categorical variables to dummy variables and assign the value of another variable to each category. Convert categorical variables to dummy variables.
Which of the following can you accomplish using the spread() function? Select two answers.
Descriptive statistics
Which of the following forms of exploratory data analysis generates short summaries about the sample and measures of the data?
Analysis of variance (ANOVA) statistical comparison of groups of data
Which of the following forms of exploratory data analysis is a statistical comparison of groups of data?
A model helps predict a value given one or more other variables
Which of the following is NOT true about a model?
The degree of slope has nothing to do with correlation (FALSE) The degree of slope indicates the strength of the correlation.
Which of the following is NOT true about a regression line?
A small F-test score implies a poor correlation between variable categories and the target variable A large F-test score implies a strong correlation between variable categories and the target variable.
Which of the following statements about the ANOVA F-test score are true? Select two answers.
The correlation coefficient is greater than zero. Both variables move in the same direction.
Which of the following statements describe a positive correlation between two variables? Select two answers.
MSE is the mean of the square of the residuals. You can get this value in R by squaring the residuals from the model and then taking the mean.
Which performance metric for regression is the mean of the square of the residuals (error)?
Cubic, meaning that the predictor variable in the model is cubed.
Which plot type helps you validate assumptions about normality?
Regression plot and Residual plot
Which plot types help you validate assumptions about linearity? Select two answers.
Determining if a model can be generalized for a broader group. Working with models with small amounts of data.
Which situations are helped by using the cross-validation method to train your model? Select two answers.
grid_regular()
Which tidymodels function do you use to create the grid for a grid search?
readr
Which tidyverse package is used for data import and management?
2nd quartile is the median of the observations in the dataset.
With data binning, observations are often organized into defined intervals called quartiles. Which quartile is the median of the dataset?
Evaluate how you plan to use this variable in your data analysis.
You are checking your data using the glimpse() function before beginning your analysis and determine that the data type of a variable called TimeStamp is in a character format. What should you do next?
Add a regression line.
You can visualize the correlation between two variables by plotting them on a scatter plot and then doing which of the following?
dataframe %>% separate(Status, sep = "-", into = c("error_type", "severity_level") separate() function reformats the column by separating the values in one column into two or more columns.
You have a variable called "Status" that contains a status code in the format "error_type-severity_level", for example "10-07", and you want to reformat the column so that the "error_type" and "severity_level" are in different columns. What is the correct function to do this?
sales_data$Date To refer to this column, separate the data frame name and column name using the $.
You want to access the "Date" column of a data frame called sales_data so you can perform an operation on it. What is the correct way to refer to this column?
x new = x old - avg / standard deviation ranges between -3 and 3
Z-score or standard score
boxplots
____ are a great way to visualize numeric data data since you can visualize the various distributions of the data With boxplots, you can easily spot outliers and the distribution and skewness of the data.
polynomial regression
a regression model which does not assume a linear relationship; a curvilinear correlation coefficient is computed
Analysis of Variance (ANOVA)
a statistical comparison on groups Finds correlation between different group of a categorical variable
What is a residuals vs leverage plot?
allows us to identify influential observations in a regression model
Separating data into training and testing sets
an important part of model evaluation
Data Asset eXchange
an important resource for sample datasets
How does a good residual plot look like?
evenly distributed data points around x axis. there is no curvature
How do you measure MSE?
find the difference between the actual value y and the predicted value y hat, then squared mean of the residuals squared
Descriptive statistical analysis
helps to describe the basic features of a dataset and generates a short summary about the sample and measures of the data
P-value interpretation (certainty of the correlation)
less than 0.001 = strong certainty less than 0.05 = moderate between 0.05 to 0.1 = weak larger than 0.1
Data acquisition
procress of loading and reading data from various sources
What is a curvilinear relationship?
squaring or setting higher order terms of the predictor variables
Scale-location plot
the square rooted standardized residual vs. the predicted value Checks homoscadescity
1. Data Wrangling and Transformation 2. Data Import and Management 3. Functional Programming 4. Data Visualization and Exploration
tidyverse library groups
scatter plots
used for continuous variables