MIST5620 Q3
Normalization
Divide degree by the max. possible, i.e. (N-1). N equals the total number of nodes in the network
Model Evaluation Variables
Error for data record = predicted (p) minus actual (a) RMSE: Root Mean Squared Error - how much the predicted diverges from a, on average MAE: Mean Absolute Error MAPE: Mean Absolute Percentage Error Total SSE: Total Sum of Squared Errors
Linear interpretation
Find values of and that best fit - try and get my predicted value of as close as possible to the actual value of
Model Evaluation
How to assess model quality? •A common measure is the Misclassification Rate or the number of wrongly predicted outcomes divided by the total number of observations •Also useful to look at False Positives and False Negatives •False Positive (we predicted 1 but actual was 0) •False Negative (we predicted 0 but actual was 1) •False Negatives + False Positives = Misclassification Rate
Model evaluation
How well the model predicts new data (not how well it fits the data it was trained with) • Key component of most measures is difference between actual outcome and predicted outcome (i.e., error)
Multiple linear regression techniques
It is a set of statistical techniques used to assess the relationship between one dependent variable (DV) or target attribute and several independent variables (IVs) or attributes. Regression is used to make predictions. Correlation is used to assess the relationship between the DV and the IVs. Regression techniques can be applied to a data set in which the IVs are correlated with one another and with the DV to varying degrees
Multiple regression model
Multiple regression is an extension of simple linear regression in which several IVs, instead of just one, are combined to predict a value on a DV for each case. = target (or dependent variable) = y-intercept = coefficients assigned to each of the IVs• = predictors (or independent variables)
Disadvantages of Explanatory Modeling
Outliers - High sensitivity to the data: Erroneous or otherwise outlying data points can severely skew the resultant linear function The data may require intensive manual manipulation and transformations
Logistic Kind of questions
Prediction of group membership •Can hay fever be predicted from the geographic area, season, degree of nasal stuffiness, and body temperature? Importance of predictors •Does a particular variable increase or decrease the probability that a case has hay fever? •How much is the model harmed by eliminating a predictor? •How much are the odds of observing an outcome are changed by the predictor? Classification of casesTabacknick
Machine learning use
Predictive modeling - The goal is to predict the target using a new dataset where we have values for predictors but not the target •Evaluate based on prediction error • Build the model using training data • Assess performance on test (hold-out) data
Logistic Regression Limitations
Relatively flexible and free of restrictions Be cautious about causal inference Multiple regression is likely to be more powerful than logistic regression when the outcome is continuous Select predictors on the basis of a well-justified model A solution is impossible when outcome groups are perfectly separated No form of "adjusted R2", risk of overfitting Outliers in the solution
Coefficients in logistic regression
The coefficient indicates the effect of the predictor on the odds ratio, which is the relative chance of success (odds ratio > 1 implies more likely, odds ratio < 0 implies less likely) •A negative coefficient decreases the relative odds, while a positive value increases them
Linear Regression Elements
The model function = target (or dependent variable) = y-intercept = slope = predictor (or independent variable) Fitted line minimizes the sum or mean of the squares of the errors Also known as ordinary least squares (OLS) regression - very popular!
Using Number for mapping
Undirected degree, e.g., nodes with more friends are more central .•Assumption: the connections that your friend has don't matter, it is what they can do directly that does (e.g., have a beer with you, help you with your business...)
Simple Linear Regression in R
Use the lm() function formula - A formula in the form y ~ x1 + x2 + ... where y is the dependent variable, and x1, x2, ... are the independent variables. If you want to include all columns (excluding y) as independent variables, just enter y ~ data - The data frame containing the columns specified in the formula
How to run a logistic regression in R?
glm(formula, data, family = 'binomial') formula an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under 'Details'. family a description of the error distribution and link function to be used in the model. For glm this can be a character string naming a family function, a family function or the result of a call to a family function. For glm.fit only the third option is supported. (See family for details of family functions.) data an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which glm is called.
The ROC curve
—assessing model quality •Plotting False Positives against True Positives •Should always be above the 45-degree line •The more area under the curve, the better the model
Limitations to multiple regression analyses
• reveal relationships among variables but do not imply relationships are causal .•Inclusion of variables: Which DV should be used, and how is it to be measured? Which IVs should be examined, and how are they to be measured? •A multiple regression solution is extremely sensitive to the combination of variables that is included in it. •Ratio of observations to IVs, what's the ideal sample size? n≥ 20 + 5 IVs •Extreme cases have too much impact on the regression solution and affect the precision of estimation of the regression weights.
Linear Regression Interpretation and example
•Falling down in a ski run is tested against the difficulty of the run, and the season (1 = autumn, 2 = winter, and 3 = spring) •Variable season was recoded into dummy variables, season(1) and season(2). Season(1) = 1 if autumn, season(2) = 1 if winter.
What does degree does not capture?
•In what ways does degree fail to capture centrality in the following graphs?
Kind of research questions
•Investigate the relationship between a DV and some IVs with the effect of other IVs statistically eliminated. • Compare the ability of several competing sets of IVs to predict a DV. •Regression analyses can be used with either continuous or dichotomous IVs. •Predicting scores on a DV for observations in which only data on IVs are available. •E.g., state location : Georgia = 1, Texas = 2, 3 = California , 4 = Others •Converted into 3 variables : a) Georgia =1 vs non-Georgia = 0, b) Texas=1 vs non-Texas = 0, c) California = 1 vs non-California = 0
Purpose of logistic regression Model Types
•Logistic regression emphasizes the probability of a particular outcome for each case •The model produced by logistic regression is non-linear•It can be used to fit and compare models •Simplest model: Constant and none of the predictors •Best-fitting model: Constant, predictors, and interactions between predictors
Purpose of logistic regression
•Predict a discrete outcome such as group membership •Logistic regression dependent variable is dichotomous •Independent variables are continuous, discrete, dichotomous, or a mix •Example: Can the presence or absence of hay fever be diagnosed from geographic area, season, and degree of nasal stuffiness?
Explanatory modeling
•The goal is to explain the relationship between predictors (independent variables) and target (dependent variable)
Logistic regression model outcome
•The outcome variable, Y, is the probability of having one outcome or another based on a non-linear function of the best linear combination of predictors, with two outcomes: Where is the estimated probability that the ith case (i = 1, ...n) is in one of the categories and is the usual linear regression equation