3000 PS final review

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

residual :

difference between actual observation a line of best fit

first differences model:

first differences model: time series framework, not looking at income ina given year but looking at the change of income from year to year (change in one variable leads to change in another)

violating the assumptions

independent variables are actually independent - almost always false for some combination of covariates - weak correlations are not problematic - strong correlations cause variables to appear insig. (because significance is looking at all other variables you cannot explain through correlation) - test with correlations or VIF (gives you a list of variables and numbers, giving you relationship between that variable and every other variable) ... VIFs > 10 say I have a problem - what to do? cannot drop variable because then you have omitted variables bias

unstandardized coefficients:

two independent variables are measured in different metrics, making comparisons misleading

fitted values:

predicted value of y = B observed - B1(x1) + B2(x2) + .... ^ missing error term/ the residuals.. difference is a function of the effect that you cannot pick all of the variables (still some distance between observed values and fitted values) ^ expecting a negative relationship because a negative in front of B1

R squared statistic

range between zero and one, indicating the proportion of the variation in the dependent variable

omitted-variables bias:

results from the failure to include a variable that belongs in our regression model

How strong is the relationship?

(determine by strength of angle) Left is weaker than right - comes with a sign (+ or neg) - says nothing about statistical significance - a weak relationship can be stat. significant if the true observations are clustered around the regression line.. model is predicting what we see out there in the world

Analysis of Variance test

(the ANOVA) on spss - checks to see if model, as a whole, has a significant relationship with the DV - part of the predictive 'value' of each regressor may be shared by one or more of the other regressors in the model so the model must be considered as a whole - F: how ALL of your variables help independent variable predict dependent

* look at times where different terminology mean same concept

* look at times where different terminology mean same concept

ex: R squared = 1

- 100 percent points lie directly on the line, perfect relationship between X and Y - implies that a best-fit line is a perfect description of data - p-value of slope will be essentially 0 * normally not going to happen, but can be very high if a big organization's behavior is being predicted and they don't tend to change much year to year ** non-directional

Why study Statistics?

- Allow you to be as precise as possible in describing phenomena in the world and the relationships between them - Allow you to draw inference from samples to the population as a whole with some level of confidence - Allow you to engage in scientific production of knowledge (literal proof)

Measures of Central Tendency

- Mean (point of balance, point at which cases are evenly distributed on both sides... add up all values divide by total) - Median (value of the middle case in the distribution) - Mode (most frequently occurring variable) - if measuring categorical variables without rank order you can only measure central tendency through mode

Multiple regression's goodness of fit

- R squared determined in the same way as 1 -SSR/SST (residual sum of squares/total sum of squares) - mathematically the sum of squared residuals never increases when a new regressor is added - therefore R squared likely get larger every time another regressor is added to the model even though new regressor may only provide a tiny improvement in amount of variance in the data explained by the model

be able to calculate

- R squared if he gives us pieces of information - adjusted r squared: know what equation looks like and why those pieces are in it (what makes it different than r square)

Dispersion

- Standard deviation: most commonly used and most statistically useful measure of dispersion - s = sq root of sum of all xs minus x bar squared over n= Standard Deviation X1 means the observed number - how far average values are from the mean - sort of average difference from the mean (really square root of average squared difference) - squared difference gives large role to outliers, but it takes care of the negative numbers and makes it easy to prove statistical theorems

Testing for differences between groups

- The logic of a difference of means test (if I have 2 distinct samples/populations are the same or different on a dimension I care about): I draw two samples and observe differences in the mean value on a variable of interest, I subtract one from the other in order to get a precise measure of the difference (the statistic), I combine the error inherent in both of those sampling processes (standard error of the statistic) and divide the statistic by that number (t-score), I use that to determine the probability that the difference I observe is really different than 0 (testing the null) ** looking at association between groups not variables

Univariate Statistics

- Understanding a single variable (using stats to describe the variable) - Frequency - Central tendency (most common measure: mean) - Dispersion (how good of a measurement is the middle? sometimes people don't actually get the score of the mean but it may just be the average so dispersion can be high. Ex: grouping around two extremes and the mean is in the middle but people didn't score in the middle)

multiple regression terminology

- a : constant/intercept: now the value of Y when all X's are at 0 - b: slope/regression coefficient: means the same, just being estimated with another variable held at its mean - goodness of fit: same logic, but different method - need a measure of variance explained by ALL IVs simultaneously - still called R squared typically, but may also be called "Multiple R" or "Coefficient of Determination" - in multiple regression this value is still linked to the probabilities of the individual slopes, but not perfectly

assumptions

- any time that you estimate a regression model you are implicitly making a large set of assumptions about the unseen population model - assumption that ui is normally distributed allows us to use the t-table to make probabilistic inferences about the population regression model from the sample regression model - assumption that ui has a mean or expected value equal to zero is known as the assumption of zero bias - assumption that ui has variance = s squared - assume no autocorrection meaning that the covariance between the population error terms is equal to zero (most common in time-series data)

Chapter 8 - bivariate hypothesis testing

- are x and y related? - means two variables - doesn't let you answer "is there some confounding variable Z that is related to both x and y and makes the observed association between x and y spurious?"

relationship between confidence intervals and two-tailed hypothesis tests

- because the 95% confidence interval for our slope parameter does not include 0, the p-value for the hypothesis test that beta = 0 is less than .05 - because the 95% confidence interval for our intercept parameter does not include 0, the p-value for the hypothesis test that alpha = 0 is less than .05 - because the 95% confidence interval for our intercept parameter does not include 50, the p-value for the hypothesis test that alpha = 50 is less than .05

ex: R squared = .02

- can explain 2 percent of the variation (no apparent relationship between x and y) - best-fit line is a very poor description of data - low probability that line is not actually 0 ** it's okay when sample size is so large that getting 2% of the population is still a lot of individuals

limitations of chi-squared test

- cannot evaluate the direction of the relationship - does not address the strength of the relationship - evaluates the significance of the relationship of all the cells (it is a combined significance test) - gamma test will address these issues (concordant pairs: observations that are consistent with the hypothesis, with what we expect) (discordant pairs: are observations that are not consistent with the hypothesis)

Simple regression

- correlation cannot tell you causation - cannot tell us how much y moves when x moves (cannot give us a coefficient for x) can tell us if they correlate and the strength of it

R ^2 (Pearsons r squared)

- for simple regression, R^2 is the square of the correlation coefficient - reflects variance accounted for in data by the best-fit line - takes values between 0 (0%) and 1 (100%)... how much variation does it explain? - frequently expressed as percentage, rather than decimal - in simple regression this value is perfectly linked to the p-value of the slope (how good of a job is the line of best-fit doing to explain variation)

understanding model fit

- how good are fitted values to real values? - total sum of squares (total sample variation in actual value of y) Sum of all .... (y observed - y average) ^2 - explained sum of squares (total sample variation in fitted values of y) Sum of all .... (y hat - y average) ^2 ^ **** similar to standard deviation, if this value is small then mean does a good job describing the description of the average

Statistically significant

- if there is a relationship between two variables - most use standard .05, if less than they consider a relationship to be significant - others use standard .01 if less than it is significant

Correlation

- it is interpreted in the same way as Gamma (once again, think of the percentage that the two variables are related) - correlations assess the strength and direction of the relationship - the significance or p-value is interpreted in the same way as well

same logic, but different than chi-squared...

- it is skewed - sensored (doesn't have negative numbers)to the right of zero - more appropriate test for when conditions are not normal

measures of association

- know differences in amount of info that come from each (what they - know differences between their significances - why do we have to have different measurements for different variables?

know skews and distribution

- look at pictures

*** Usually use median

- middle observation in a set of numbers when observations are rank ordered (useful for ordinal level data - provides the median case, not the median value - Median = N+1/2 (take the observation that is closest to that) - not affected by extreme values - actual values of observations are not included in calculation - not an accurate representation of centrality in data do not cluster around the mean - represents the 50th percentile in a distribution

over model specification

- no causal variables left out, no noncausal variables included - assume parametric linearity (relationship doesn't vary) * number of cases (n) must exceed the number of parameters to be estimated (k)

The assumptions of regression

- random sample (inferencial technique, say something about a bigger group based on a smaller group) - no error in the measurement of variables (never really true) - variables are measured at interval/continuous/ratio level - errors (unobserved factors).. are there other things that can influence the variable? - are uncorrelated with each other - are uncorrelated with the IV - E(u/x) = 0 - constant variance for all values of IV Var(u/x) = s^2 - need good independent variable and good controls to make a good prediction - are normally distributed (a+Bx+u is going to look like a normal distribution)

partial derivation method to obtain slope (B1)

- regress x1 and x2 (predict x1 with x2) and obtain the residuals (everything that x2 did not explain of x1) - regress y on the residuals (giving the relationship between the portion of x1

Univariate Distributions

- report all data or summary statistics? - frequency distributions are still a lot of information, not parsimonious (not simple way of conveying information) - how are data distributed? (general) - what is "average" or "typical" case (what does the middle look like) - do values "deviate" much from average (how accurate of a measurement is this middle) - how are data distributed? (specific) - central tendency: mean, median and mode - dispersion: standard deviation

adjusted R square

- report if you have more than one variable in the model - takes into account the number of regressors in the model - calculated as: R squared adjust. = 1 - (1-R square) (N-1) /(N-n-1) ** bottom denominator is degrees of freedom N = number of data points n = number of regressors - note that R squared adjust will always be smaller than R square

equations

- required to recognize equations and know which one goes with what

What Statistics cannot do

- settle normative arguments ("this is the way the world should be" political science is all about "the way the world is" : statistics) - provide information about phenomena that cannot be measured (things may have measurement error, so we do not know if it precisely measures what we are trying to measure) - allow you to "prove" anything

Reliability

- stable measurements over time and space - ex: debt is not stable over time, debt is defined differently depending on the state you are looking at

p value

- stands for probability (ranges from 0 to 1) - the probability that we would see the relationship that we are finding because of random chance - the lower the p value the greater confidence we have that there is a systematic relationship between the two variables - doesn't prove causal relationships - based on the assumption that you are drawing a perfectly random sample from the underlying population

line of best fit

- statistical model of reality - elements of m and b are the line's parameters

null hypothesis

- theory based statement that would expect to observe if our theory was incorrect

bimodal relationships

- two bumps on the graph skewed left = tail to left, skewed right = tail to the right - if a distribution is normal then... - 68.3 percent of all cases will lie within +1 and -1 standard deviations from the mean - 95.5 percent of all cases will lie within +2 and -2 standard deviations from the mean - +3 to -3 standard deviations away = 99.7% of the population

interpreting regression coefficients: which regressor has the most effect on the dependent variable?

- units for each regression coefficient are different, so we must standardize them if we want to compare on with another - column headed standardized coefficients (Beta... not same as B) - needs to know standard deviation and mean to standardize variable - can now compare the Beta weights for each regressor variable to compare effects of each on the dependent variable

interpreting regression coefficients: relationship each individual regressor has with dependent variable

- unstandardized coefficients (B): getting the coefficients as they were measured, the impact on a one unit change on independent variable on dependent variable * can't directly compare different coefficients: a one unit change in age and income cannot really be compared... but standard deviation changes can positive number = increase negative = decrease in dependent variable... causes ___ in the independent variable

estimating error variance

- variance of estimates based on residuals, not errors - theoretically based on the latter, but cannot be in reality because they are unobservable (can't observe the "real world" : every person in the world and every variable that might influence their behavior) - requires creation of intermediary terms 1. Residual = proxy to measure error (1 / (n-2) times the sum of the squared residuals - using 2 degrees of freedom, one degree goes away because it is a sample not real world, the other one goes away because you're trying to calculate a statistic 2. Standard error of the estimate (t-score) 3. Standard error of the slope ( above answer / sum of all variation)

Logic

- we have rho, give gives us direction, overall magnitude, and significance - simple regression gives us all those, plus a way to summarize the data visually and to predict the change in the dv given a known change in the iv - simple regression also lays the ground work for dealing with alternative causes

The logic of multiple regression

- we know that alternative explanations for observed outcomes may exist - multiple regression does every thing that simple regression does plus allows us to isolate the impact of each IV on the DV - more precisely, estimates the impact of our explanation when the alternative (other variables) is at its mean value - looking at murder rates and ice-scream sales but you control for temperature

Probabilities

- world is normally distributed in a normal curve - necessary for statistical inference - lots of numbers in the middle - slow decay of numbers off to both sides

two-variable regression model

- y intercept = alpha - slope parameter = beta - y is the dependent variable - x is the independent variable - ui = stochastic or random component of our dependent variable (because we do not expect all of our data points to line up perfectly on a straight line)

simplifying through interpretation

- yields a partial effect - "ceteris paribus" or all other things being equal - core of the "quasi-experimental" design: controlling for variation in observed qualities without actually having control over those qualities

Validity

1. Do our measures capture what we want them to? (MAP testing.. is it testing your knowledge or your ability to sit through a test) 2. Are our measures appropriate & complete? (diversity... it doesn't just mean race. Can be class or region, but in terms of appropriateness you wouldn't include left-handed people.... example given in terms of affirmative action. 3. Is the measure systemically wrong in some way? (MAP example because it under-rates what they know because of the type of test it is) (if you are measuring wealth and are using income to determine that..doesn't measure what it is supposed to)

Types of Validity

1. Face validity: a test can be said to have face validity if it "looks like" it is going to measure what it is supposed to measure 2. Construct validity: appropriateness and completeness - internal validity: if I say X predicts Y I need to make sure Z isn't influencing it, is the causal relationship captured in what I am showing you... (predicting success by increased diversity.. but may only measure diversity by race so cannot say it isn't anything else, so it cannot be causal) - external validity: would your measure still depict the same thing no matter where you take your sample from (measure complete enough to take to different samples and show the same result) 3. Discriminant validity: hard to distinguish between two concepts (party and ideology into a model... are they really different than one another? yeah, emotional ties could come with party identification not just ideology... a party could shift ideologies)

Frequencies

1. Frequency Distributions - most basic restructuring of data to facilitate understanding (simply pairs data values or ranges of values with frequency of occurrence) 2. Percentage Distributions - allows for comparison between two distinct groups (frequency distributions) and displays percentage of total cases that fall into each class - saying how does MU compare to itself year after year or how does MU compare to other universities

Necessary pieces of information

1. Number in each sample 2. Mean values of variable of interest 3. Variances of individual means s = sq root of sum of (Xi - X bar) ^2 over n -1 Xi = instance of x calculate for each individual mean

3 bivariate hypothesis tests

1. Tabular Analysis: dependent variable in rows and the independent in columns, figure out what the individual marked values represent, see patterns, to know if statistically significant use the chi-squared test (sum of observed values minus expected values squared over the expected value), to compare the chi-squared value you need a critical value. - critical value: need to the degrees of freedom (aka number of rows and columns) 2. Difference of means: we have a continuous dependent variable and a categorical independent variable, looking to determine if the distribution of the dependent variable is different across the values of the independent variable, do so by comparing our real-world data with what we would expect to find if there were no relationship between our independent and dependent variables) 3. Correlation coefficient: covariance is the variation of a single variable, summarizes the general pattern of association between two variables

Paper code

1. analyze: give main division or elements 2. classify: arrange to main divisons 3. compare: point out the likenesses 4. contrast: point out the differences 5. criticize: give your perspectives on good and bad features 6. describe: name the features of or identify the steps 7. discuss: examine in detail 8. evaluate: give your perspective on the value or validity 9. explain: make clear, give reasons for 10. illustrate: give one or more examples of 11. interpret: give an explanation or clarify the meaning of 12. justify: defend 13. review: examine on a broad scale 14. significance: show why something is meaningful 15. summarize: briefly go over the essentials

terminology

1. best-fit or regression line: guess about the values of dependent variable based on known values of independent variable. Trying to draw a line that minimizes the sum squared differences between line and actual values 2. b is the slope of that line or regression coefficient for x - t-test used to arrive at probability that b is different than 0 (or that slop is not due to chance) 3. x is the predictor or regressor variable for y 4. fitted values: predicted values of the dependent variable given the slope and the intercept of the regression equation 4. residuals: difference between the actual observation value and the fitted value - regression model selects a (Bo) and (B1) to minimize the sum of squared residuals. Hence OLS ** Bo = the y intercept or constant, the value of y if x =

Calculation of Gamma

1. calculate number of concordant and discordant pairs 2. calculate difference between Concordant and Discordant pairs 3. Divide that number by the sum of the pairs *continuous variables give you the most information/precision

for gamma, tau and rho:

1. if relationship is negative and perfect the measure equals -1 2. sign equals direction, - is negative, + is positive ** all of these measures are calculated by comparing the observed values and the expected values (the values we would expect if there is no relationship) ** they are used to determine the strength and direction of association or the statistical likelihood that a relationship actually exists

for lambda, gamma, tau and rho:

1. if relationship is positive and perfect, the measure equals 1 2. if no relationship exists the measure equals zero 3. the further the number is from zero the stronger the relationship

Chapter 12 applications of multiple regression models

1. presidential popularity: it is hard to say no to a popular president, used in bargaining situations, and unpopular presidents are not normally influential presidents - fluctuate based on economic reality (measured by inflation and unemployment rates, driving approval ratings up and down) - consumer confidence: public's perception of the economy - 1 point increase in Consumer Confidence we expect to see an immediate increase in presidential approval : controlling for the effects of inflation and unemployment

Assumptions of a t-test

1. the populations have the same variance (how good of a representation the data is of the mean) 2. the populations are normally distributed 3. the samples are drawn independently of another * violating the first two has only minor consequences, violating the third biases the result

Chapter 12 applications of multiple regression models

2. transition societies: like those in Eastern Europe, for which democracy and markets are new experiences - approach 1: citizens' sentiments about democracy are often a function of the public's experiences with the market economy and how these actual experiences differ from their expectations, if expectations are exceeded they will be supportive - approach 2:that citizens' evaluations of the pros and cons of democracy will be based primarily on visible evidence that the new democratic institutions are working to represent citizens' interests

Chapter 12 applications of multiple regression models

3. causes of international trade: size of each nation's economy, physical distance between them and the overall level of development - approach 1: friendly nations are more likely to trade than those in conflict (interstate conflict can result in embargoes, can raise the risks for firms that wish to engage in cross-border trading) - approach 2: trade will be higher when both nations are democracies and lower when one (or both) is an autocracy (firms in a democracy have more open political and judicial systems so disputes will be resolved openly and fairly in courts) - approach 3: states in alliance with each other are more likely to trade (one nation may think the gains from the trade may be used to arm itself for future conflict)

Chi-Squared

= (O-E)^2/E observed minus expected squared (to get rid of negative numbers) divided by the expected number ** add all of those together for the different values

expected values:

B observed - B1(x1) + B2(x bar2) + .... Bk(x bar k) x bar = mean of sample that you have

Begs the question.. Ambiguous or vague... Descriptive or historical...

Begs the question.. do not answer the question asked, make assertions about some small aspect of the question without really answering it Ambiguous or vague... lack specific detail/need clarification. should ask : so what? Descriptive or historical... factually correct but no critical or analytical response

intercept

Bo(^ goes over it) = y bar - B1(x bar) * hat over B not because its a sample guess on population/prediction * minus the selected slope at the mean value of x * taking x out of the equation * mean value of dependent variable when the value of the independent variable is 0

Deductive vs. inductive

Deductive: broad assertion, define its limits, give reasons, provide evidence Inductive: state common belief, give reasons, explain specific implications, explain general implications

violating the assumptions

Error Variance is Equal - In statistics, a collection of random variables is heteroskedastic (often spelled heteroscedastic,[1] and commonly pronounced with a hard k sound regardless of spelling) if there are sub-populations that have different variabilities from others. Here "variability" could be quantified by the variance or any other measure of statistical dispersion. Thus heteroscedasticity is the absence of homoscedasticity. The possible existence of heteroscedasticity is a major concern in the application of regression analysis, including the analysis of variance, because the presence of heteroscedasticity can invalidate statistical tests of significance that assume that the modelling errors are uncorrelated and normally distributed and that their variances do not vary with the effects being modelled. -------- - rarely true - small deviations as x increases, not problematic - large deviations cause estimates of standard errors to be biased - test with hettest (Breusch-Pagan / Cook-Weisberg test for heteroskedasticity) ... testing for consistency of error variance, we reject the null : that the error variance is equal if this test is significant (reject homoskedasticity for heteroskedasticity) - what to do: Robust (gives you white sandwich robust of standard errors... normalizes standard error variances), significance can change

Total sum squares = all variation in dependent variable

Explain sum of squares = fitted values, the line we drew, how much of that variation did we manage to actually explain given the variables that we have classified Residual sum of squares = what is left over from total - what we explained, so it tells us how poorly the line fits

Central Tendency (Mean)

FOR POPULATIONS mu = sum of all x over n FOR SAMPLES x bar = sum of all x over n - Includes value of every observation - Rigidly determined - May take on unrealistic values - Extreme values have disproportionate impact (particularly when n is small) - average number of children in a household = 2.7, there are not .7 of a kid somewhere, but when you calculate mean it ends up like that - if it is us and Bill Gates in a room the mean income will be skewed (mean is going to look high because of his millions)

Levels of measurement

Mode - Nominal/categorical data (distinguished by name only, not by numerical value) ** least informative Median - Ordinal data (rank ordered where the distance between the categories is not the same) Mean - Interval/ratio data (money, age, grades) *** interval/ratio gives you most information because it has the most values PRECISION INCREASES WITH MORE INFORMATION IN DATA Mode: does not need to be central and can take on more than one value

On april 17th the professor's socks were: lime green (extra credit on final exam)

On april 17th the professor's socks were: lime green (extra credit on final exam)

residual sum of squares - sample variation in errors

SSR = sum of all .... u hat sqaured * how much did we not explain in the real world... R squared = SSE/SST = 1 - SSR/SST (1 minus the ratio of what we didn't to to everything there was to do: maximum value it can have is 1 minimum is 0) - people are harder to predict than state government (basically do the same thing from year to year - easier in 50 people to get a lower sum of squares, harder in 1000 (less variation)

Hypothesis testing

alternative hypothesis opposite of null hypothesis - we observe a sample slope parameter, which is an estimate of the population slope, then we make the confidence interval around it and evaluate how likely it is that we observe this sample slope if the true but unobserved population slope is equal to zero

change of y = b1(change x1) + b2(change xk) k = any number x of the sequence

change of y = b1(change x1) + b2(change xk) k = any number x of the sequence

root mean squared error

is called standard deviation

model significance

is F test p value = variables significance Chi-squared = overall significance

t test info

keep adding observations... keep adding degrees of freedom.. gets standard error closer and closer to zero but contribution gets smaller as number of observations get larger - want to be 95% sure

Measures of association

lambda - nominal, high association = 1.0 (tells us if relationship is strong, but tells us nothing about the direction... which is hard since its nominal) gamma - ordinal, high association is +1 or -1 (tells us about strength and direction) Tau - ordinal, high association is +1 or -1 (tells us about strength and direction) chi squared - categorical, high association is infinity (tells us how sure we are about the relationship, step toward estimating the probability of the relationship) rho - interval/ratio/continuous, high association is +1 or -1 (correlation coefficient) (gives strength and probability that you are right)

Bivariate hypothesis testing: observing relationships between two variables Error in testing

major questions: how strong is the relationship? what direction is the relationship? how likely is it that the relationship was not observed by chance? - Type 1: rejecting the null hypothesis when it is true - Type 2: accepting the null hypothesis when it is false (primary concern) ** null hypothesis is "no relationship" not the reverse relationship of alternative hypothesis

root mean-squared error

measure of the overall fit between a regression model and the dependent variable - standard error of the regression model - provides measure of the average accuracy of the model in the metric of the dependent variable

are the relationships of each regressor with the dependent variable statistically significant?

one tail vs two tail test (do one tail when you have a directional hypothesis: which is most often... increase in x leads to increase in y for example) -- you need to hit 1.69 on t-test (2tail) for it to be significant (the cutoff) ^ for one tail tests... if it is at a 2 tail test now, will be half of the 2 tail p-value making it more significant gender: typically coded 0 (females) typically coded 1 (males) party: republicans typically coded 0, democrats typically coded 1

directional hypothesis

parameter is either positive or negative (not just different from zero) - ex: the better the economy is performing, the higher will be the vote percentage for the incumbent-party candidate

degree of freedom

parameters that can be used to calculate a statistic - defined by how much you begin with and how much you have already sucked up through other means - begin with number of observations... sucked up = using a sample, could be # of variables you use in a regression model

F test

significance of the amount of variation explain by all variables in the model (do independent variables jointly predict the dependent variable)

assumption:

that there is no exact linear relationship between any two or more of our independent variables (X and Z)... aka assumption of no perfect multicollinearity

Example

the relationship between age and feeling thermometer score (Bush) Age ---- .109 (weak positive relationship) Significance (p-value) .000 (99.999 sure it did not happen by chance) ** we can have a weak correlation but be sure of it because of the large sample size

talk about error... talk about residuals.. how are they related and how are they different

think short answer

standardized coefficients:

though not normally comparable, there is a rather simple method to remove the metric of each variable to make them comparable with one another

sample regression model

to make inferences about the unseen population regression model uses ^, meaning they are estimates

residual

u hat = Yi - Y hat i * looking at error in sample * Yi = dot on that scatter plot * Y hat sub i = predicted value * we care about how good the prediction was

crosstabulation

used to assess the relationship between two categorical variables (nominal or ordinal) - chi squared and gamma tests used most often - correlations are used to assess the strength and direction of the relationship between two interval level variables - rho used most often - difference of means testing is used to compare two groups - T-statistic

Chapter 9: bivariate regression models

we control for another variable (Z) as we measure the relationship between our independent variable of interest (X) and our dependent variable (Y)

go over

when you need a one tail vs two tail test

will have print out

with sets of info you need to assess size/variation of model (SPSS) - know what makes things significant

best-fit

y = a+bx+u a = y intercept (constant)... when x = 0 b = slope of best-fit line.... rise over run (if I increase x what changes in y do I observe?) y = dependent variable x = independent variable u = error or unobserved factors (most basic presentation... also seen as a lowercase e, shows the difference between regression line I draw and the real world.. if I calculate it for the whole world)

fitted values

y hat = B not + slope value (any given value of x) * predicted value of y * left off error term

Properties of a z-score

z = (x - x bar) / s - it has a mean of o (because half the cases are higher, half lower) - it has a standard deviation of 1 (because each deviation is divided by the "standard deviation" and on average this is like dividing the same thing by itself) - z score is the numerical number of the standard deviations from the mean


Ensembles d'études connexes

Module 1 Quiz - M1 - What is Organizational Behavior; Individual Differences

View Set

Success in CLS Ch. 12 Laboratory Calculations (60 q.)

View Set

CFA Level 2 - Quantitative Methods

View Set

VA Life Insurance Practice Exam (90 Xcel)

View Set

Principles of Accounting 1 : Chapter 2

View Set

Science quiz chapter 16 section 3

View Set