Quant Method Final

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Logistic Regression Strengths:

1. Outputs can be interpreted nicely as probabilities 2. Easy to implement and use 3. Very efficient to train

Coefficient interpretation:

Difference in the mean value of the dependent variablebetween the two categories.

Dissimilarity between Groups (Weaknesses):

1. Can be quite computationally expensive especially with high dimensional data. 2. Dendrograms become much more difficult to interpret with high numbers ofsamples. 3. Clustering solution provided by hierarchical clustering is rarely optimal.

K-means (Weaknesses):

1. Have to choose the cluster number, K, manually 2. Solution depends on initial centroid values 3. Struggles with outliers - Can end up in their own cluster or ignored 4. Can struggle with high numbers of variables.

Logistic Regression Weaknesses:

1. Makes strong assumptions about the data 2. Does not perform well with missing data 3. Tends to underperform when there are multiple or non linear decisionboundaries 4. Does not naturally capture complex relationships

Using PCA in Regression

1. Multicollinearity - When the data available contains highly correlated variablesusing principal components instead of the original features can help to reducemulticollinearity. 2. Overfitting - By reducing the amount of features used in the model, PCA leavesus with a less complex model which in turn can help avoid overfitting. The key idea here is that in most cases a small number of principal components sufficeto explain most of the variability in the data. In other words we assume that theprincipal components with the most variation are the most associated with the responsevariable. While this assumption is not guaranteed to be true, it often turns out to bereasonable enough to give good results. Using principal components in regression can perform well when only a few principalcomponents are needed to model the response, when a high number of PC's arerequired then the model can perform worse than the original.

Dissimilarity between Groups (Strengths):

1. Produces a series of clusterings with different levels of granularity which canreveal hierarchical structures within the data. 2. The clustering solution can be easily visualized using a dendrogram. 3. Does not require us to specify the number of clusters prior to applying thealgorithm. 4. Can handle many different forms of similarity and distance and as such can beapplied to many different forms of data.

Impact of additional variables on R squared

1. R2 quantifies how well our model explains the variance in the data. If a model with a set of variables explains, say, 50% of the variance, adding another variable can either maintain that explanation level or increase it, but it cannot decrease. 2. No matter how insignificant, every added variable will explain some portion ofthe variance in the dependent variable, even if by pure chance. Hence, the overallR2 will never decrease and will usually increase. 3. While the R2 may go up, it doesn't always mean the added variable is genuinelymeaningful. It could be capturing noise rather than true signal. This is where theconcept of "overfitting" comes into play. In overfitting, a model becomes tootailored to the specific dataset it was trained on, capturing its noise andanomalies, making it less generalizable to new data. Recognizing the inflationary nature of R2 with added variables, the adjusted R2 metricwas introduced. The adjusted R2 accounts for the number of predictors in the model. Ifa new variable doesn't significantly improve the model's predictive capability, theadjusted R2 might decrease.

K-means (Strengths):

1. Relatively simple to implement 2. Scales to large datasets 3. Guaranteed to converge to a solution 4. Can warm start the position of the centroids 5. Adapts easily to new examples 6. Generalizes to clusters of different shapes and sizes

We need to use linkage which extends DISSIMILARITY BETWEN GROUPS of samples. There arethree types of linkage measure we will use:

1. Single Linkage: Distance between two closest points in each cluster 2. Complete Linkage: Distance between two farthest points in each cluster. 3. Average Linkage: Average of the distance between each of the points in each cluster. (think of diagrams) -Average and complete linkage usually perform better than single linkage as they tend to lead to more balanced clusters. -Single linkage tends to lead to one large cluster and a few small clusters.

K-means Overview:

1. The K in K-means refers to the number of clusters which are used. 2. For each cluster the algorithm calculates a cluster centroid which is the meanvalue of the samples in the cluster, hence the mean. 3. The objective of K-means is to minimize the distance between points and their cluster centroid. The idea behind K-means is that a good clustering is one for which the within-clustervariation is as small as possible.

Choosing the Number of Principal Components to Use

1. The Scree Plot Method: This involves plotting the principal components in decreasingorder based on how much variation they capture. You look for the "elbow" point in theplot, where the decline of PCs becomes less pronounced. Components before the"elbow" tend to capture most of the significant variance, while those after capture less and are potentially noise. 2. Cumulative Explained Variance: Another approach is to choose the number ofcomponents that cumulatively explain a significant proportion of the variance (e.g., 95%or 99%).

In summary for Dendrograms:

1. Where samples are joined on the vertical axis represents their dissimilarity 2. We draw conclusions about the similarity of points using the vertical axis. 3. We can use the selected cut-off to split the data into however many clusters welike.

Choosing the number of Clusters

A key weakness of K-means clustering compared to hierarchical clustering is that itrequires us to specify how many clusters we want to detect in the data. This is often achallenging problem and we often have to rely on expert knowledge of the area. If we do not have that knowledge then one method for choosing cluster numbers is to look at the within cluster variation for different numbers of clusters.

The F-test in Regression Analysis (Interpretation):

A large F-statistic suggests that the model explains a significant amount of variation in the dependent variable, leading to a rejection of the null hypothesis. The p-value associated with the F-statistic will indicate the probability ofobserving such an F-statistic if the null hypothesis were true.

Hierarchical Clustering (Dendrogram):

A tree diagram used to visualize the hierarchy of clusters. Eachmerge between branches indicates two clusters joining together.

Including interaction terms

An interaction term is a variable in a regression model that captures the combined effectof two (or more) variables. It is typically created by multiplying these variables together. For example, if you have two independent variables, X1 and X2, the interaction termwould be X1×X2 .

Positive Linear Relationship

As one variable increases, the other also rises consistently. This means the two variables move in the same direction. (think of the graph. As x increases, y also increases.)

Negative Linear Relationship

As one variable increases, the other decreases, In other words, the two variables move in opposite directions. (think of the graph. As x increases, y decreases.)

Logistic Regression (Interpretation of Coefficients):

As we are not fitting a straight line, the interpretation of the coefficients is different for logistic regression. The coefficient for a predictor represents the change in the log odds of the outcome for a one-unit increase in that predictor, all else equal. Exponentiating a coefficient yields the odds ratio for that predictor.

Assumptions of Linear Regression (No Multicollinearity):

Assumption: In multiple regression, the independent variables are not highly correlated with each other. Why It's Important: High correlation among predictors can cause problems inestimating the individual effects of variables.

Assumptions of Linear Regression (Normality of Errors):

Assumption: The errors follow a normal distribution (especially important for hypothesis testing). Why It's Important: Non-normality can lead to inefficient estimates andincorrect inferences.

Assumptions of Linear Regression (Independence):

Assumption: The observations are independent of each other. Why It's Important: Dependence between observations can lead to biasedestimates.

Assumptions of Linear Regression (Linearity):

Assumption: The relationship between the dependent and independent variables is linear. Why It's Important: Non-linear relationships cannot be accurately captured by alinear model.

Assumptions of Linear Regression ((Homoscedasticity (Constant Variance)):

Assumption: The variance of the error terms is constant across all levels of the independent variables. Why It's Important: If the errors exhibit patterns or trends, it can affect thestability and efficiency of the estimates.

Confidence Intervals for Coefficients in Linear Regression (Limitations):

Assumptions: The validity of CIs in linear regression relies on assumptions such as linearity, independence, homoscedasticity, and normality. Interpretation: A common misconception is that a 95% CI implies a 95%probability that the interval contains the true coefficient. Instead, it means that ifwe were to take many samples and build a CI for each one, we expect about 95%of those intervals to contain the true coefficient.

Binary Response, Logistic Regression, and Imbalanced Data (Intro):

Binary response data arise when the outcome variable has only two possible outcomes.Examples include success/failure, yes/no, and 1/0 outcomes. We call these types ofproblems classification problems, or specifically binary classification problems. Weoften want to model these sorts of responses in a similar way to linear regression,however, linear regression is not suitable for these problems.

Binary Variables

Binary variables, also known as dummy variables or indicators, take on two possiblevalues, often represented as 0 and 1. However, when we receive the data, they can oftenbe stored as characters, and we need to convert them to 0-1 values. In Regression analysis binary variables represent a distinction between two categories.

Categorical Variables (Definition):

Categorical variables have multiple categories without a natural ordering. Examples ofthis would be variables like States, Colors, etc. To use categorical variables in regression we need to create K-1 dummy variables torepresent k categories. Here, one category (the reference category) is left out to avoid the "dummy variabletrap", which leads to multicollinearity.

Clustering

Clustering is an unsupervised machine learning technique that is used to identify groups of similar samples within a dataset. That is there is no response variable for the clustering, we only have explanatory variables to use to group the data. As clustering is unsupervised there is no measure of accuracy or R^2 for clustering,instead we must judge the sensibility of our results.

Why Use PCA?

Data Reduction: Reduce the number of variables and ensure minimal loss of information. Visualization: High-dimensional data can be visualized in 2D or 3D space. Noise Reduction: Emphasize the most significant patterns and often improveother data analyses.

Hierarchical Clustering (Definition):

Hierarchical clustering creates a tree of clusters. At the bottom, each item is its own cluster, and at the top, all items belong to a single cluster.Intermediate levels represent different numbers of clusters.

Interpreting Dendrograms

Hierarchical clustering will return a dendrogram to us that we must then use to decidehow many clusters are present. When we generate a clustering dendrogram, we see the dissimilarity on the y-axis. On the dendrogram, where the different clusters join up indicate their level ofdissimilarity: -In a dendrogram each leaf, represents a single observation. -As we move up tree some leaves fuse into branches - Corresponding to similar observations -The earlier (lower in the tree) the fusions occur, the more similar the samples. -Observations fusing at the top are quite dissimilar -We create clusters by cutting the tree at a certain height, points joined below thatare in the same cluster.

Imbalanced Data Problems

In binary classification, data is imbalanced when one class significantly outnumbers theother class. Common examples include fraud detection (where fraudulent transactionsare rare compared to legitimate ones) and disease prediction (where the disease might berare in the population).

Confidence Intervals for Coefficients in Linear Regression (Definition):

In linear regression, confidence intervals give a range in which the true coefficient islikely to fall, with a certain level of confidence. A confidence interval provides a range of plausible values for an unknown parameter.The interval has an associated confidence level that quantifies the level of confidencethat the parameter lies within the interval.

Calculation of the T-Statistic in Regression (Estimation of Coefficient)

In simple linear regression, the estimated coefficient (β) represents the change in thedependent variable for a one-unit change in the independent variable, assuming allother variables are held constant.

linear regression (summary)

In summary, __________ provides a means to predict a dependent variable based on thevalue(s) of one or more independent variables. It provides a clear view of relationships betweenvariables and is an essential tool in various fields, from business to biology.

The F-test in Regression Analysis (Hypothesis Setup):

In the context of regression analysis, the F-test is primarily used to test the hypothesisthat a particular model fits the data well. Null Hypothesis (𝐻0): All regression coefficients are equal to zero, implying the model has no explanatory power. Alternative Hypothesis (𝐻𝐴 ): At least one regression coefficient is not zero.

Practical Considerations

Interpretation: Dummy variable coefficients are interpreted relative to the reference category. Multicollinearity: Always omit one dummy to avoid multicollinearity. Thisomitted dummy serves as the reference category. Interaction with Other Variables: Binary and categorical variables can interactwith other variables. Include interaction terms if it's hypothesized that the effectof one variable depends on the category of another variable.

Interaction Terms

Linear regression is used to establish relationships between independent variables and adependent variable. Sometimes, however, the effect of two independent variables on thedependent variable is not just additive. Interaction terms allow us to model how the relationshipbetween one independent variable and the dependent variable changes depending on the levelof another independent variable.

Logistic Regression (Assumptions):

Linearity: Relationship between log odds and predictors is linear. Independence: Observations are independent. No Multicollinearity: Predictors are not perfectly correlated.

Logistic Regression

Logistic regression is a classification algorithm used to predict a binary outcome. Insteadof fitting a straight line, as is done in linear regression, logistic regression fits an 'S'shaped logistic function.

Weaknesses of PCA

Loss of Interpretability: The original features' meaning might get lost when transformed to principal components. Assumption of Linearity: PCA assumes linear relationships between variables. Variance Might Not Always Equate to Predictive Power: The main componentsexplain the most variance, but they might not necessarily be the most predictive.

Imbalanced data can lead to a variety of difficulties for modelling including:

Model Bias: The model becomes biased towards the majority class, often predicting it by default because it reduces the overall error. Poor Generalization: The model may perform well in terms of overall accuracybut fails to correctly identify the minority class, which is often of higher interestin many problems. Misleading Metrics: Accuracy can be misleading. For instance, in a dataset with95% class A and 5% class B, a naive model always predicting class A would stillhave 95% accuracy.

Importance in Regression

Model Validity: While individual p-values tell us about specific coefficients, examining many p-values together (like in an F-test) can give insights about the validity of the model as a whole. Model Selection: P-values can be used to decide which variables to retain in amodel. However, they shouldn't be the only criterion. Practical significance,domain knowledge, and other statistical measures should also be considered.

Caveats and Limitations

Not a Measure of Effect Size: A small p-value does not necessarily mean the effect is practically significant. Multiple Comparisons: Running multiple tests increases the chance of finding atleast one significant result just by chance. Corrections like the Bonferroni correction can be applied to address this. P-hacking: This refers to the practice of trying multiple methods to findsignificant p-values. It's considered a questionable research practice and can leadto non-reproducible results.

SMOTE

One way of handling imbalanced data is to use SMOTE (Synthetic Minority Over-sampling Technique). This is an oversampling method which creates synthetic samplesfor the minority class. It works by selecting two or more similar instances (using distance measures) andperturbing an instance one attribute at a time by a random amount within the differenceof the instances.

Ordinal Variables

Ordinal variables represent ordered categories, where the order has meaning, but the distancebetween categories is not uniform. In Regression analysis ordinal variables can sometimes be treated as continuous if theordered values have a linear relationship with the dependent variable. Alternatively,ordinal variables can be treated similarly to categorical variables, using dummy coding.

Hierarchical Clustering (Similarity):

Our objective in clustering is to group similar examples into clusters. In order to judgehow similar samples are, we first need a measure of similarity. To calculate the similaritybetween points we need to calculate the distance between them, for this we use adistance metric: Distance Metric: A function that defines a distance between each pair of points in a set

Interpretation of P-Values

P-Value < Significance Level (e.g., 0.05): There's sufficient evidence to reject thenull hypothesis. In regression, this often implies the coefficient is statisticallydifferent from zero. P-Value ≥ Significance Level: There's insufficient evidence to reject thenull hypothesis.

Introduction to Principal Components Analysis (PCA)

Principal components analysis is a dimension reduction technique. It uses orthogonaltransformation to convert the variables into a set of linearly uncorrelated variablescalled principal components. The main objective is to capture as much of the variabilityas possible with a smaller number of principal components.

Some points about R squared

R2 varies between 0 and 1. An R2 of 0 indicates that the independent variables do not explain any of the variability in the dependent variable. An R2 of 1 indicates that the independent variables perfectly explain the variability in the dependent variable. In finance, values of R2 are often far from 1, since financial phenomena are influenced by numerous unobserved factors. If R2 is close to 1, the model captures most of the variability. If it's close to 0, the modeldoesn't capture much of it, and the residuals (errors) are large.

What is RMSE?

RMSE measures the average magnitude of errors between predicted and actual observations.It provides a consolidated view of the model's accuracy across all predictions. A lower RMSE indicates that the model's predictions are close to the actual values,suggesting a good fit. Conversely, a higher RMSE suggests a poorer fit and largerdiscrepancies between predicted and actual values.

Strengths of PCA

Reduces Overfitting: By using fewer principal components, models might generalize better, that is perform better on unseen or test data. Uncorrelated Features: PCA ensures that the features are orthogonal. Noise Reduction: PCA can help in focusing on the main variance directions andeliminating noise.

Confidence Intervals for Coefficients in Linear Regression (Factors Affecting the Width of CIs):

Sample Size: Larger samples generally lead to narrower CIs, as larger samples tend to offer more precise estimates. Variability in Data: More variability can lead to wider CIs. Confidence Level: A higher confidence level (e.g., 99% vs. 95%) will produce awider CI.

The F-test in Regression Analysis (Definition):

The F-test compares the fits of different linear models. Unlike t-tests that can test onlyone regression coefficient at a time, the F-test can assess multiple coefficientssimultaneously.

The F-test in Regression Analysis

The F-test is a statistical test used to compare nested models and determine thesignificance of groups of variables in regression analysis. In the context of regression,the F-test commonly assesses the overall significance of a model.

Interpreting K-means Results.

The K-means algorithm returns to use: 1. The cluster assignment of each sample 2. The centre or mean values of each cluster. (We can characterize the clusters by viewing the centre values for each cluster. We will often do this with a heatmap.)

R squared and Adjusted R squared in analysis

The coefficient of determination, denoted R2, and the adjusted R2 are statistics used toevaluate the goodness of fit of a linear regression model. The value represents theproportion of the variance for the dependent variable that's explained by theindependent variables in a regression model.

Interpreting the Correlation Coefficient

The correlation coefficient (r) ranges from -1 to 1. A positive correlation coefficient indicates apositive relationship while a negative correlation coefficient indicates a negative relationship. Acorrelation value of 0 indicates that there is no linear relationship between the variables. Thecloser the coefficient is to either -1 or 1 the stronger the relationship is.

T-Statistic Calculation and P-Values

The p-value associated with each coefficient in the regression model provides insightinto the significance of the predictors in explaining the variation in the dependentvariable. That is, we can make use of p-values to see if the relationship we are seeing isstatistically significant or more likely to be random chance. The hypothesis we test here are: Null Hypothesis (𝐻0): The regression coefficient β𝑖 is equal to zero. Alternative Hypothesis (𝐻𝐴 ): The regression coefficient β𝑖 is not equal to zero. In the context of regression analysis, the p-value for each coefficient tests the nullhypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05)indicates that you can reject the null hypothesis. In other words, a predictor that has alow p-value is likely to be a meaningful addition to your model. To calculate a p-value for a regression coefficient we first need to calculate the t-statistic.

P-Values (Definition):

The p-value is a fundamental concept in hypothesis testing. It measures the evidenceagainst a null hypothesis. The smaller the p-value, the stronger the evidence that youshould reject the null hypothesis.

Training Model

The process of selecting the best set of values for β0 and β1 (and other model coefficients we willcover later) is referred to as training a model.For linear regression the method to calculate the coefficients is Ordinary Least Squares.

Curvilinear Relationship

The relationship between the two variables doesn't follow astraight line. Instead, it might be best described by a curve or polynomial function. This type indicates a more complex relationship. (think of the graph. y follows a curved pattern relative to x.)

No Relationship

There's no discernible pattern between the two variables. Changes in one variable don't provide any consistent information about changes in the other. (think of the graph. no apparent pattern or trend.)

The Essence of PCA

Variation is Important: In PCA, the goal is to capture as much variation in the data as possible using fewer 'composite' variables. The First Principal Component: This is the direction in the data that has themost variation. Think of it as the longest length if the data was put inside a box. Subsequent Components: Each subsequent component captures the mostvariance possible, under the constraint it's perpendicular to the preceding one(s).The second component is like the width of the box, and so on.

Adjusted R-squared: k = Number of independent variables.

We can see from the term that as k increases the value of adjusted R2 will go downunless the R2 value goes up. Adjusted R2 can be less than R2 but never greater. A significant increase in adjusted R2 after adding a variable suggests that the variable is providing meaningful explanatory power. If adjusted R2 decreases after adding a new variable, it suggests the variable mightnot be adding meaningful information to the model, given the complexity cost.

Confidence Intervals for Coefficients in Linear Regression (Usage):

We can use the confidence interval to understand the uncertainty of the estimate of thecoefficient. In addition, if the confidence interval does not include 0 this indicates thatthe coefficient is not statistically significant at the level of 𝑡 ∗ .

Fitting the Model

We then square the errors and average them for each model, giving us the mean squared error (MSE). (think of the equation.)

Why Not Linear Regression?

When modelling a binary response variable, we want the model to return a probability(between 0 and 1) that the sample is a 1. If we use a linear regression model we can getprobabilities that are below 0 and above 1 which do not make sense.

K-means Formula:

When viewed this way (the formula) we can see that the main objective is to minimize the distance from each cluster centre to each sample in the cluster.

Confidence Intervals for Coefficients in Linear Regression (Interpretation).

Width of the CI: A wide interval suggests more uncertainty, while a narrow interval suggests more precision. Position Relative to Zero: If the CI does not include zero, this indicates statistical significance at the chosen confidence level. Practical Implications: Even if a coefficient is statistically significant, the CI canhelp in assessing if the range of plausible values is practically significant.

Linear Regression

is a statistical tool that allows us to model and analyse the relationships between adependent variable and one or more independent variables. Its simplicity and interpretability makeit a cornerstone in predictive modelling.At its core, linear regression seeks to fit a linear equation to observed data. The simplest forminvolves a single independent variable and results in a line of best fit.A vital part of linear regression is selecting the best line to fit our data.

Correlation

provides a quantified measure of the strength and direction of a linear relationship between two continuous variables.


Kaugnay na mga set ng pag-aaral

Chapter 6 Group Life Insurance Exam

View Set

Normal Tissue Radiation Responses

View Set

Streams & Flooding (Chapter 13 Quiz)

View Set

Final Exam Modern Art 2470 Artist

View Set