Categorical Data Analysis quizlet

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

what does it it mean to say ML estimators are asymptotically normal?

similar to Central Limit Theorem. Justifies use of a Normal table to calculate p-values and confidence intervals

how does this affect the interpretation of a linear regression model where the dependent variable is dichotomous?

since the model implies that the expected value of the dependent variable is a linear function of the x's, this is the same as saying the probability of an event is a linear function of the x's\n-this leads to the "linear probability model"

how do you find the values of the betas that maximize the log-likelihood?

take the first derivative of the log-likelihood, set it equal to 0, and solve the resulting equation\n-we are calculating a formula for the tangent of the maximum point...the slope of the tangent will be 0, so we use that information to find corresponding beta values\n-since we have multiple betas, the derivative is actually a vector derivative //fce-study.netdna-ssl.com/2/images/upload-flashcards/68/68/73/12686873_m.png

what is the log likelihood?

takes the log of the likelihood, which is easier to work with because it converts products into sums and exponents into coefficients\n-there are also attractive theoretical properties of the log likelihood\n-also, log is an increasing function (if maximize log likelihood, increase likelihood itself) //fce-study.netdna-ssl.com/2/images/upload-flashcards/68/67/59/12686759_m.jpg

what if you want to compare specific categories, where null hypothesis is that the coefficients on the 2 categories are equal? e.g. compare category 3 with category 2?

test i2.varname=i.3.varname\n-if p-value below 0.05, we reject null and conclude that there is evidence that the 2 coefficients are not equal\n-this test is also invariant to what you choose as your reference category

what if you want an overall test of the null hypothesis that all of the coefficients for a categorical variable are equal to 0?

testparm i.varname\n-this test is invariant to choice of reference category\n-if p-value below 0.05, we reject the null and conclude that at least one of the coefficients for the categorical variable is greater than 0

what happens if you use the asis command in presence of quasi-complete separation?

the Newton-Raphson algorithm iterates forever!\n-can force it to produce results by "asis iterate(# of iterations)"\n-it will then produce a valid likelihood ratio chi2 with 1 df, but again, if sample is small, can't be confident about this number\n-the coefficient on the variable with quasi-complete separation will get bigger and bigger as stata does more iterations (so can't trust it)\n-z-statistics aren't valid

what do we use as a measure of predictive power in linear regression? Can we do this for logistic regression?

the R-squared\n-no agreement on what measure of predictive power to use for logistic model

what is the linearity assumption?

the dependent variable is a linear function of a set of variables plus an error term (where the error term represents the variables not included)\n-this linear equation applies to all individuals in the sample

what is the assumption of normally distributed error term?

the error term is Normally distributed\n-note: this doesn't make any assumptions about distribution of X's (can be non-Normal)

why can't assumption 5 (Normal distribution of error term) be true in case of LPM?

the error term is equal to the observed value of y minus the expected value of y\n-but in case of LPM, this can only result in 2 different values: 0 - πi (so πi) or 1-πi\n-so, it's impossible for the error term to be normally distributed!

what is the expected value of a dichotomous/dummy variable ?

the expected value of a dichotomous variable is equal to the probability that the variable (πi) = 1

what is another name for the vector of first derivatives?

the gradient

what exactly does hazard measure/tell us?

the instantaneous probability of an event occurring

what is the link function in a probit model?

the inverse of the cumulative distribution function for a standard normal variables\ng = Φ^-1\n\n\nif Z is a standard Normal variable, then\nΦ(a) = Pr(Z\n\n\nand Φ^-1(πi) = linear function of the x variable\nπi = Φ(bxi)

what feature of logistic regression makes it particularly well-suited for maximum likelihood estimation?

the likelihood function for logistic regression is strictly concave, meaning there are no local maxima\n-therefore, if algorithm converges, you know you've got the right result

what is the log likelihood in maximum likelihood estimation?

the likelihood function is the probability of observing entire set of y values for our function\n-if take the log of a probability that is 0, get a negative number, so log likelihood will always be negative

what is the most direct measure of fit?

the log-likelihood\n-a model that perfectly fits the data will have a log-likelihood of 0 (because the log(1) = 0)\n\n\nwhen used as a fit measure, it's common to use -2log(L)\n1. the statistic is positive (log-likelihoods are always negative, unless model fits data perfectly)\n\n2. differences in -2log(L) for nested models are Likelihood-Ratio chi2 statistics\n3. when fitting models for individual-level data, this is equiv. to the deviance (but not for aggregated or group data)

why does quasi-complete separation occur in this situation?

the maximum likelihood estimator of the slope coefficient for a 2x2 table is the logarithm of the cross product (odds) ratio. Four the table in prev. flashcard, that is:\n\n\nbhat = ln(5x10/15x0)\n-but if there's a 0 in the denominator, the answer is undefined!\n-if 0 in the numerator, answer is also undefined b/c can't take the log of a 0!

what is the likelihood ratio chi2 in the model output testing?

the null hypothesis that all of the coefficients on the independent variables in the model are equal to 0

how do probit results compare to logit results?

the probit curve is very similar in shape to the logit curve\n-probit model's curve has tails that approach 0 a little more quickly than logit\n\n-thus, you will rarely get different qualitative results using the 2 methods\n-it would take very large samples to discriminate b/w them

what is the simpler model? what is the more complex model?\n-i.varname vs varname\n-x1 vs x1 x2\n-x1##x2 vs x1 x2

the simpler model is the one that imposes more restrictions on the parameters\n-treating a variable as continuous is always simpler than treating it as categorical (constrains the variable to be linear)\n\n-including fewer variables in the model is always simpler than including more (constrains all coeff. on the variables not included to equal 0)\n\n-including no interactions in the model is always simpler than including more (constrains all interaction coefficients to 0)

what is the assumption of mean independence?

the x's are unrelated to the random disturbance (not the same as saying x's are independent of the random disturbance)\n\n-no correlation between the error term and any of the x's \n\n-the expected value of the error term, given a particular value of x, is 0 (i.e. knowing what x's value is doesn't affect the expected value of the error term, which is always 0)\n-this is important, b/c if there is correlation b/w the error term and the x's, this would lead to bias (would mean there is a lurking variable that is associated both with the x's and with the y, leading to a spurious relationship b/w x and y)

what is the assumption of homoskedasticity?

there is constant variance of the error term across individuals\n-so if you know x, it does not affect the expected variance of the error term, which is always equal to a constant sigma^2

how does the linear regression equation for log odds differ from the usual linear regression equation we see?

there is no error term \n\n-because the dependent variable is actually our expected value, rather than its predicted probability (??)

what is the problem with maximizing the log-likelihood via setting vector of first derivatives equal to 0?

there is no explicit solution\n-so need to use iterative numerical methods

what is McFadden's method?

this is how Stata produces a pseudo-R-squared\n-Lo: log-likelihood for the model with no covariates\n-L1: log-likelihood for model with all covariates\n\n\nMcFadden R-squared = (L0-L1)/L0\n\n\nin other words, how much do we improve the fit of the model by putting all x-variables into the model?\n\n\nPros & Cons:\n-can't interpret like a linear regression R-squared because can't say that the model explains 20% of the variance in y because don't have variance in the formula\n-but tends to be roughly the same magnitude as linear regression R-squared

why do we use standardized coefficients?

to allow you to compare effects of different predictor models, even if use different units of measurement\n-tells you how many standard deviations y changes for a one-standard deviation increase in x

what can you do to compare coefficients in logit models across two different groups?

use complogit command\n\n\nex. complogit y x1 x2, group(female)\n-will produce model(s) and perform tests\n-make sure to exclude group variable (don't include as an indep var)\n-performs 2 likelihood-ratio tests:\n1. tests null hypothesis of equal residual variation for the two groups\n2. tests null hypothesis that all coefficients are the same\n-also performs a Wald test of null hypothesis of equal residual variation\n\n\nex2: complogit y x1 x2, group(female) specific(variable that you think has unequal residual variation)\n\n\n-performs 1 likelihood ratio test and 1 wald test:\n1. lr test of null hypothesis that coefficient of variable specified is same across groups\n2. wald test of null hypothesis that coefficient of variable specified is same across groups

once have the formula for log likelihood, what do you want to do?

want to find the values of betas for each independent variable that maximizes the log likelihood

where does the term 'logit' come from?

we call the left-side of the logistic regression model the 'logit'\n-the part that says: log(πi/1-πi)\n-the logit is sometimes called the "log odds"

why do we like to use odds ratios and what are they?

we can measure the 'effect' of a dichotomous variable by taking the ratio of the odds of the outcome event for the two categories of the independent variable\n-to find the odds ratio, divide the cross products of a 2x2 table\ne.g. if looking at effect of drug on survival, and observe 90 people who took the drug survived while 10 died and 70 people who didn't take the drug survived while 30 died, the odds ratio would be (90x30)/(70x10) = 3.86. This says that the effect of getting the drug is to multiply the odds of survival by 3.86. In other words, getting the drug increases odds of survival by 286%

when does quasi-complete separation most often occur?

when an explanatory variable x is a dummy variable and, for one value of x, either every case has the event y=1 or every case has the event y=0\n\n\ne.g. y\n 1 0\nx 1 5 0\n\n 0 15 10

when can you NOT use log likelihood as a measure of goodness of fit?

when comparing non-nested models with different number of predictors\n-by chance alone, would expect models with more predictors to have lower -2log(L)\n-so the -2log(L) is typically adjusted to penalize models with more predictors

where does the error term reach its maximum in a linear probability model?

when the probability of the event (πi) = 0.5\n-the error term will be very small when πi is near 1 or 0

what does goodness of fit tell us?

whether or not we need a more complicated model than the one we have fitted to adequately represent the data\n-i.e. whether we need interactions and non-linearities\n-does NOT include adding more variables into the model\n\n\n\ndiff b/w predictive power and goodness of fit\n-predictive power: is this model better than nothing? by how much?\n-goodness of fit: is there something better than the model, given the current set of variables?

if you treat a variable as categorical instead of continuous, what would you expect to happen to standard errors?

would expect them to go up because estimating more things (the more you estimate, the bigger you expect the standard errors to be for all variables)

what is the general premise of latent variable interpretation of logit, probit, and complementary log-log models?

yi = dep var\nxi = indep var\nyi* = latent variable\n\n\nyi* is correlated with xvar such that\nyi* = alpha0 + alpha1(xi) + sigma(ei)\n\n\nsuppose there is some threshold theta such that if:\nyi*>theta, yi=1\nyi*\n\n\nwhat is the probability that yi=1? (i.e. what is πi?) and how does it depend on x?\n-that depends on the probability distribution of the error term...different assumptions about the error term would lead to different conclusions about the probability of an event

if the odds are 3.5, what is the probability of the event?

π = odds/1+odds\n\n\nπ = 3.5/(1+3.5) = 0.78

but what else is wrong with the linear probability model?

πi = alpha + beta(xi)\n-in this equation, the left-hand side is constrained to lie between 0 and 1, but the right-hand side has no such constraints\n-for any values of the betas, we can always find some values of x that give values of π that are outside the permissible range\n-thus, a strictly linear model just isn't plausible

what do the various assumptions imply about the OLS estimator b (i.e. the slope coefficient)?

-assumptions 1 and 2 imply that b is unbiased (linearity and mean independence --> unbiased)\n-assumptions 3 and 4 imply that OLS b is BLUE \n\n-assumption 5 implies that OLS b is normally distributed across repeated samples (so we can use a normal table for significance tests and confidence intervals), b is MVUE, and b is MLE

is the linear probability model a reasonable model?

-assumptions 1 and 2 of the linear regression model seem to hold for linear probability model \n\n(i.e. linearity and mean independence)\n-this means that OLS estimators a and b are unbiased\n-BUT if the linear probability model is true, then assumption 3 (homoskedasticity) cannot also be true --> there is an internal contradiction\n-assumption 4 (no autocorrelation) is fine\n-but assumption 5 (normal error term) is also necessarily violated by LPM

how does N-R algorithm get us an estimate of the covariance matrix of the coefficients?

-at the last iteration, take the inverse of the Hessian matrix to get an estimate of the covariance matrix of the coefficients\n-will get us estimated variance\n\n-standard errors of the coefficients are calculated by taking the square root of the main-diagonal elements of this matrix\n-estimated standard errors are OIM (Observed Information Matrix) standard errors

why can't you use Pearson chi2 as overall test of goodness of fit for a model that includes continuous predictor variables?

-b/c with individual-level data, the Pearson-chi2 does not have a chi- distribution\n-Pearson chi2 is calculated by subtracting expected value from observed value, divided by expected value\n\n-but each "cell" in the contingency table and the expected values (which are the predicted probabilities) are too small to get good approximation to chi2

why can't homoskedasticity assumption be true for a LPM?

-because the variance of the error term is NOT a constant. It is equal to the probability of an event minus (1-prob. of event) (i.e. p-(1-p))

what are some solutions we might use for this imbalance in the linear probability model?

-broken line: within particular range of values, relationship is linear, but when x goes above a certain value, the probability stays at 1, and if x goes below a certain value, the probability stays at 0\n--> but this isn't really plausible\n- s-shaped curve, so that as π gets closer and closer to 1, the tangent of the line has a smaller and smaller slope, and as π gets closer to 0, the slope also gets smaller and smaller

what is limitation of BIC and AIC?

-can be used to compare non-nested models w/ diff. sets of covariates\n-BUT they cannot be used to construct a formal hypothesis test, so the comparison is only informal\n-for both statistics, smaller values mean a better fit, but the overall magnitude of these statistics is heavily dependent on sample size

how to we find betas that make first derivative of log likelihood equal to 0?

-choose values of beta to make sums of cross products of observed values of y and x equal to sum of cross products of observed x and predicted y

after regression 'discovered,' what was an important limitation of this approach?

-correct only when dependent variable is at interval or ratio level (not really appropriate for categorical dependent variables)

what is the consequence of the violation of the Normality assumption?

-even if someone get standard errors right, can't use Normal table to do tests, calculate CIs, etc.

what is the classification table approach?

-get predicted probabilities that the event will occur and then use a cut-off point to determine which predicted probabilities will lead to prediction of the event occurring vs not occurring (common cut-off point is predicted probability of 0.5)\n-then, compare what our prediction would be based on the cut-off point to the observed data (e.g. person has predicted probability of 0.6...so predict that they will have event...look at observed data and either they do or do not have event)\n-then, get ratio of individuals who were correctly predicted over total number of individuals in sample\n-can do this in stata using "estat class(0.5)" command, where 0.5 = cut-off point...can change (0.5 is actually default)

how can you get a really high number for correctly classified even if model isn't very good at prediction?

-if event is very rare and we use model that predicts no one has event, w/o using any analysis of data, we would get a high correctly classified level

how serious are these problems with the linear probability model?

-if the sample is moderately large, lack of normality is rarely a problem because the Central Limit Theorem tells us that test statistics will be approximately Normal\n-heteroskedasticity is more serious, but in many applications it makes little difference. Moreover, can use robust standard errors to correct for heteroskedasticity (note: there is less justification for using robust standard errors as solution if have a small sample size)

if π = 0.75, what are the odds? if π = 0.6, what are the odds?

-if π=0.75, the odds are 3 or "3 to 1" (for every 1 time the even did not occur, 3 times where event did occur) --> 0.75/0.25 = 3\n\n-if π=0.6, odds are "3 to 2" --> 0.6/0.4 = 3/2

in all three cases, what do coefficients for the models depend on? what implications does this have?

-in all 3 cases, the coefficients depend on sigma (i.e. the coefficient on the error term)\nbecause: beta1 = alpha1/sigma\n\n-as sigma increases, the estimated coefficient for b gets smaller\n-since sigma represents the effect of 'unobserved heterogeneity' on y, we see that unobserved heterogeneity (residual variance) leads to attenuation of regression coefficients (toward 0)\n-this is why it is desirable to include all relevant predictors of y, not just those that are correlated w/ the other x's\n-this also means that cross-population comparisons of coefficients can be misleading when estimating logit, probit, or comploglog models, b/c sigma can vary across populations

what are features of complementary log-log models? when is it used?

-like other link functions, it takes the probability and converts it into a thing that doesn't have lower or upper bounds\n-unlike logit and probit, it is asymmetrical (π approaches 1 more rapidly than it approaches 0)\n-b/c model is asymmetrical, it's critical that 1s and 0s are coded properly, so that 1= event (coefficients will change in size depending on how coded...the z-stats and p-values will also change; for logit, will only change in sign)\n\n-primarily used for predicting probability that some event will occur in an interval of time\n-it is implied by an underlying proportional hazards model (Cox model)\n-exponentiated coefficients can be interpreted as hazard ratios\n-model is invariant to the length of the time interval

if we convert to log odds ratio, how does this affect interpretation?

-log odds ratio will be positive for positive effect\n-log odds ratio will be zero for no effect\n-log odds ratio will be negative for negative effect

what estimation method do we use for probit model? what computing algorithm do we use?

-maximum likelihood\n-Newton-Raphson algorithm

what should you do if you have multicollinearity?

-might delete one of the highly correlated variables\n-might combine the x-variables that are highly correlated (but can only do this if you truly think they are measuring the same thing)

what are the 2 types of categorical data?

-nominal: no ordering\n-ordinal: categories are ordered

what is an odds ratio of 1? what is an odds ratio between 0 and 1? what is an odds ratio greater than 1?

-odds ratio of 1 = no effect\n-odds ratio between 0 and 1 = negative effect\n-odds ratio above 1 = positive effect

when you produce classification table for fitted model with command "estat class," what other output will you see?

-sensitivity: of all individuals who had the event, what percent were predicted to have event (i.e. if testing for a disease, sensitivity is the ability to give positive result in presence of disease)\n\n-specificity: of all those who didn't have the event, what percent were predicted not to have the event (i.e. if testing for disease, specificity is the ability to produce negative result in absence of disease)\n-positive predictive value: given that we predicted someone will have event, what percent actually do have event\n-negative predictive value: given we predicted someone did not have event, what percent do not have event

what is the consequence of a violation of the homoskedasticity assumption?

-the OLS estimates will no longer be fully efficient (they will have more sampling variability than other estimates)...in other words, no longer BLUE\n\n-plus, our formulas for standard errors require homoskedasticity, so we can't trust SE, and thus can't trust p-values or CIs

what assumption represents the biggest danger in any regression method?

-the assumption of no omitted explanatory variables, b/c if this isn't true, can lead to bias

if we stick to the probability scale and yet we estimate a logit model, this leads to what strange property?

-the effect of x on the probability will depend on the level of the other x's (since πi is a function of all the x's in the model)\n-on the logit scale, this is not the case...the effect of each x is independent of the other x's on the logit scale

why doesn't the algorithm for maximum likelihood estimation converge in context of complete separation?

-the estimated slope coefficient is tending toward infinity...it gets larger at each iteration, and the log-likelihood continues to increase\n-x is "too good" a predictor of y\n-the log likelihood function will get closer and closer to 0, but will never reach a maximum

how do Wald tests, score tests, and likelihood ratio tests compare to each other?

-they are asymptotically equivalent (i.e. in large samples they have approximately the same distribution)\n-all three only have approximately a chi2 distribution (as n increases, get closer to chi2 distrib)\n-in small samples, there is some evidence that Likelihood Ratio tests are superior\n-however, usual p-values reported for individual coefficients are Wald statistics b/c they are more easily computed

which of the various means of calculating R^2 analog does Prof like best?

-used to prefer McFadden R^2 (and Cox Snell) because based on log likelihood\n-but in some cases, want to compare models based on different methods (e.g. might want to compare an OLS model to a logistic model), so it's nice to have a measure that can be used for any method\n-also, when add variance to the model, McFadden R^2 and Cox-Snell R^2 will go up (b/c based on log likelihood), but this isn't necessarily the case for Tjur R^2\n-Tjur R^2 is bounded by 0 and 1

what is the only situation in which the homoskedasticity assumption will hold for a LPM?

-when b=0

what if you want an overall test of the goodness of fit of the model (not just a statistic like BIC, AIC, or Log-Likelihood, and not just comparison of fit of 2 models)?

-when predictor variables are all categorical and the data are aggregated (i.e. can put them into a contingency test), you can perform a valid chi2 test using either deviance or Pearson chi-square\n-but if the predictor variables are not all categorical and thus can't create contingency table (unless create dummy variables for continuous variables), can't use deviance or Pearson chi2

what is the relationship b/w probabilities and odds?

-when probabilities are low, odds tend to be just a little higher than probabilities\n-as probabilities increase, odds increase more rapidly\n-as probabilities get greater than 0.5, odds get greater than 1 and at increasingly rapid rate\n-as probability gets closer to 1, odds will get bigger and bigger

what problem is described by the concept of "multicollinearity"?

-when there are high correlations among the independent variables in a model, it makes it difficult to separate out the effect of each indep. var on dep var

describe a logarithm function (shape, etc.)

-when we talk about logarithm functions, usually work with the natural log\n\n-when x is small, rapid increase in log of x, but as x gets bigger, less and less rapid rate of increase\n-it's not possible to take the log of zero or negative numbers\n-so log(x) can take on values above 0 only

what do we need to keep in mind when working with s-shaped curves as models?

-within certain bounds, the linear approximation will be quite good, but weighted least squares tends to put most weight on values closest to 1 and 0 (those that are most nonlinear)

what measures of goodness of fit do we use when comparing non-nested models?

1. Akaike's Information Criterion (AIC)\n\n\nAIC = -2log(L) + 2k\n\n\nwhere k is the # of covariates\n\n\n2. Bayesian Information Criterion (BIC)\n\n\nBIC = -2log(L) +k*log(n)\n-more severe penalization for additional covariates (in most cases)

what do you do to solve the problem of complete separation?

1. Delete the problem variable(s) from the model: this will give you a model that converges but it's not good because taking out the variable with the biggest effect. i.e. this variable is a strong predictor of the dep var. You should also report a likelihood ratio chi2 for this variable by itself\n2. Use "exact methods" with the exlogistic command: This is a very good method but it's very computationally intensive so it's only feasible for small sample sizes with a single predictor variable. Exact logistic regression is a generalization of Fisher's exact test for independence in contingency tables. Basically, it enumerates all possible values of the dep var given a set of indep vars and calculates p-values based on enumeration of outcomes (i.e. all sample space outcomes)\n3. use penalized maximum likelihood: will describe in future flash card

what are some of the various measures of predictive power proposes for logistic regression models?

1. McFadden's method\n2. Cox and Snell's method\n\n3. Nagelkerke R-squared\n4. Tjur R-squared\n5. Classification Tables\n6. ROC curves

what are the 3 different things that the term "multiple regression" is used to describe?

1. Model: statistical model for how data were generated\n2. Estimation method: approximate guesses for unknown parameters in a model\n3. Computing Algorithm: for any given estimation method, there are often several competing ways of doing the computations, all of which will give same numerical answers

what are the implicit assumptions of the logit model?

1. No omitted explanatory variables (especially those correlated with the included variables...note: in linear regression, only need to worry about omitted variables that are correlated with independent variables...but in logistic regression, any omitted variables are a problem!)...this is b/c more unobserved heterogeneity leads to attenuation of regression coefficients toward 0\n\n2. No measurement error in explanatory variables\n3. observations are independent (i.e. one person having the event does not raise the probability that another person has the event...we also make this assumption for linear regression)\n4. dependent variable doesn't affect the independent variables (i.e. no reverse causality)

what are 3 kinds of tests that can be used to test the same null hypotheses?

1. Wald tests: based on coefficients and the estimated covariance matrix (testparm and test command are both Wald tests)\n2. Score tests: based on first and second derivattives of the log-likelihood function\n3. Likelihood ratio tests: twice the positive difference b/w the log-likelihoods under the null hypothesis [restricted model] and under a less restrictive model [full model]. (i.e. Take the difference between the log likelihoods of the 2 models and multiply it by 2)\n-Note: for LRtest, the two models must be nested

how can you get LR test for a categorical variable (note: testparm command is a Wald test)

1. estimate a model with the variable as a factor variable\n2. store the estimates from the model\n3. estimate a model without the variable of interest\n4. store estimates from the model\n5. lrtest model1 model2\n\n\nnull hypothesis: simpler model is fine\n-if p-value is less than 0.05, reject null and have evidence to use full model (model w/ categorical variable) (i.e. we conclude that adding the categorical variable does not result in a statistically significant improvement in the model)\n\n-if p-value is 0.05 or greater, conclude that can leave categorical variable out of model

what are the 5 assumptions of the linear regression model?

1. linearity assumption\n2. mean independence\n3. homoskedasticity\n4. no autocorrelation\n5. normal distribution of the error term

what 3 s-shaped curves are most widely used in practice?

1. logit- logistic curve \n\n2. probit - cumulative normal distribution\n3. complementary log-log (asymmetrical) \n\n\n\nlogit and probit curves are more symmetrical, while complementary log-log has very long tail

what are the different assumptions we could make about the error term in equation \n\nyi* = alpha0 + alpha1(xi) + sigma(ei)? \n\n-what model would we have based on this assumption?

1. suppose the probability distribution of the error term is standard normal\n-we would then have a probit model\n-note: if beta1=0, then alpha1 = 0\n-note: beta1 does not depend on theta\n\n\n\n2. Suppose the error term has a standard logistic distribution\n-we would have a logit model\n\n\n3. Suppose the error term has a standard extreme-value distribution (also known as Gumbel or double-exponential distribution)\n-this is an asymmetric distribution that is skewed to the left\n-so we would have a complementary log-log model

if estimate logit model of acceptance into college and find that coefficient on gpa is smaller for women compared to men, what are 2 interpretations of this?

1. there is a stronger effect of gpa on college acceptance for men than for women\n2. women may have a higher sigma than men (i.e higher amount of residual variation among women...i.e. a lot more things affect college acceptance for women than for men), leading the coefficients on gpa for women to be lower than that for men

what does it mean to say 2 models are nested?

2 models are nested if the simpler model can be obtained by imposing restrictions on the parameters of the more complex model (e.g. setting parameters equal to 0 [by excluding variables from model], or equal to each other [by treating categorical variable as continuous)

how do you interpret odds ratios?

2 options:\n1. talk about odds ratios as multiplying odds by a number: e.g. if talking about the relationship b/w age and whether or not in labor force, and coeff. on age is .9069, would say "a one unit increase in age multiplies the odds of being in the labor force by .9069"\n2. talk about odds ratios as percent change in y: e.g. for same example, we would first substract .9069 from 1 to get 0.093 and then multiply this by 100 to get 9.3%. We would say "a one unit increase in age is associated with a 9.3% increase in the odds of being in the labor force"\n-note: if odds ratio is greater than 1, subtract 1 (rather than subtracting odds ratio from 1)

what are Raftery's guidelines for comparing BIC for 2 models?

Absolute Diff. Evidence that model is better\n\n0-2 Weak\n2-6 Positive\n6-10 Strong\n>10 Very strong

which of the assumptions are part of the Gauss-Markov Theorem?

Assumptions 1-4\n-linearity, mean independence, homoskedasticity, and no autocorrelation\n-same assumptions that make b BLUE

what is something to keep in mind when running a logit model in stata and requesting odds ratios?

Confidence Intervals are based on a Normal distribution. Because odds ratios are bounded below by 0, this tends to make sampling distribution of CIs not really Normal\n-so it's better to get CIs in the regular log odds scale rather than in the odds ratio scale

what is Cox and Snell's method?

G^2 = -2(Lo-L1)\n-this is the likelihood ratio chi2 for the null hypothesis that all coefficients are zero\n\n\nthen define R^2(C&S) = 1-exp(-G^2/n)\n\n\nPros & Cons\n-for an OLS linear model, this formula reduces to the usual R^2 (so think this is good reason to use)\n-unlike McFadden R^2, the Cox Snell R^2 has an upper bound that is less than 1 (so no matter how good your predictions are, Cox Snell R^2 can't go above the upper bound of less than 1...seems undesirable)

what has been proposed as a way to perform overall test of goodness of fit for model with individual-level data?

Hosmer-Lemeshow Test\n-after fitting model, sort cases by their predicted probabilities and divide them into approx. 10 groups of roughly equal size\n-within each group, calculate observed and expected numbers of events (and non-events). The expected # is just the sum of the predicted probabilities\n-summarize the discrepancies using Pearson's chi2 statistic, with df=number of groups

does professor prefer probit or logit? why?

Logit\n1. intimately related to log-linear models for contingency tables (for every logit model for a contingency table, there is a log-linear model equivalent)\n2. can easily be generalized to unordered multiple categories\n3. the coefficients are more readily interpretable (no corresponding odds ratios that we could use in probit)\n4. it has desired sampling properties (ex. can over sample some categories and under sample others without negative consequences)

what iterative numerical method do we use to maximize?

Newton-Raphson algorithm\n-uses the Hessian matrix, which is 1st and 2nd derivative of log-likelihood\n-Newton-Raphson algorithm sets one beta value equal to another beta value minus the inverse of the Hessian matrix (times some other stuff)\n-basically, what you do is pick a starting value for beta (b=0 is fine). Then, plug that starting value into beta on the right hand of the Newton-Raphson algorithm and calculate the next beta (left-hand size). Then, plug the new beta into the right-hand side of the N-R and calculate the next beta, etc. etc. \n\n-keep doing this until left side of N-R equals right side (convergence). At this point, we get an estimate of the covariance matrix of the coefficients is //fce-study.netdna-ssl.com/2/images/upload-flashcards/68/73/53/12687353_m.png

if π is the probability of an event, then the odds is equal to what?

Odds = π/(1-π)

what is the ROC curve approach?

ROC = Receiver Operating Characteristic\n(examines the sensitivity and specificity over whole range of cutoffs)\n-note: there is an inverse relationship b/w specificity and sensitivity\n-as cut-off increases, sensitivity goes down and specificity goes up\n-ROC is a graph of sensitivity vs 1-specificity, as both depend on r (where r = cut-off)\n-use "lroc" command after running logistic model to get graph of ROC curve\n-the area under the curve is known as the "C statistic" and is a summary measure of the predictive performance of the model

if S is the expected number of individuals who experience an event and F is the expected number who do not experience the event, what is the odds of experiencing the event?

S/F = (expected # of successes)/(expected # of failures)

what is the Tjur R^2?

Take the difference between the mean predicted value when y=1 and the mean predicted value when y=0\n\n\nR^2(tjur) = mean(π1hat) - mean(π0hat)\n(mean predicted value of π for individuals that have event) minus (mean predicted value of π for individuals that don't have event)

why do we need these 5 assumptions?

The assumptions are plausible approximations in many cases, and they justify the OLS estimator\n-in other words, the 5 assumptions imply that the OLS estimator has certain optimal properties\n-if the 5 assumptions are true, OLS will be as good or better than any estimation method

what was the original approach to looking at relationships between variables? what were the problems with this approach?

Treat all variables as categorical, usually dichotomized. Create contingency tables. Would use chi-square tests of independence\nProblems:\n-loses information (b/c dichotomizes variables that may be continuous, ignoring variation)\n-complicated for more than one test variable (if try to control for too many variables, get very small #s)\n-statistical theory not well developed (e.g. no chi-square test for hypothesis "relationship b/w income and party preference is same for high school and college grads")

what can we look at to check for multicollinearity?

Variance Inflation Factors (VIF)\n-can use command estat vif \n\n-Tolerance = 1/VIF\n-if VIF>2.5, this is reason for concern about multicollinearity

what is a link function?

a function that converts the probability, which is bounded by 0 and 1, to something that isn't bounded, to match the other side of the equation\n\n\ng(πi) = bxi\n\n\nwhere g is a function

how would you interpret the following coefficient from a logit regression in stata:\n1.25

a one unit increase in x is associated with a 1.25 increase in the log odds of y

what are models and what are examples of models?

a set of probabilistic assumptions (usually expressed as a set of equations and using probabilistic notation) about how the data were generated\n-often a set of equations expressing what we believe or are willing to assume (usually deliberately oversimplified)\n-for example, for a dichotomous dependent variable, there are several alternative models: linear probability model, logit linear probability model, probit model, log-linear model\n-models usually contain unknown parameters that need to be estimated

what are estimation methods and what are examples of estimation methods?

a way of using sample data to get an approximation for the unknown parameters\n-for a linear regression model, we could use ordinary least squares, weighted least squares, maximum likelihood\n-each method has advantages and disadvantages\n-a given estimation method, like weighted least squares, can be used for many different kinds of models

what is Nagelkerke R^2?

adjusts for fact that Cox-Snell formula has an upper bound that's less than 1. \n\n\n\nR^2(max) = 1-exp(2Lo/n)\nso\n\nR^2(nag) = R^2(c&s)/R^2(max)\n\n\nPros/Cons\n-has upper bound of exactly 1\n-Prof. thinks it tends to over-correct the C-S R^2 -the adjusted R^2 is always greater than the unadjusted R^2, which is the opposite of the adjusted R^2 for linear models (which adjusts for number of covariates)

what is Paul's preferred way of writing the model?

as a model for the probability itself:\nπi = 1/(1+e^-bxi)\n-now, no matter what values plug in, π will vary between 0 and 1\n-this is the formula we use to generate predicted probabilities\n-if we graph this, we get the s-shaped curve seen before\n-the model says that the closer you get to 0 or 1, the smaller the change in π for a given change in x

what does it mean to say ML estimators are consistent?

as the sample gets larger, estimators converge in probability to the true values (i.e. the larger the sample, the smaller the probability that the estimators diverge greatly form the true values)\n\n-implies that ML estimators are approximately unbiased

what can you do to force stata to estimate logit model instead? what can and can't you use from output?

asis option\n-it will give a valid likelihood ratio chi2 test with 1 df (valid in the sense that it does have an approximately chi2 distribution and is a valid test of the null hypothesis, but like any case, you might not be able to trust it if the sample is small (b/c likelihood ratio tests only approximate chi2 distributions well if large sample size)\n-note: some software will report a huge SE for the coefficient...I don't think you can trust the coefficient on the variable w/ complete separation or the SE on the coefficient for the variable w/ complete separation

which of the 5 assumptions of the linear regression model is the least important and why?

assumption 5 (error term is Normally distributed) \n\n-if sample is reasonably large, Central Limit Theorem will ensure that distribution of b is approximately Normal, regardless (so could go ahead and use Normal distribution for significance tests, etc.)

why can't we just use the odds as our transformation of π?

b/c the odds varies between 0 and +infinity, but we need a transformation of π that varies between -infinity and +infinity

why do we say that exponentiating the coefficient is the adjusted odds ratio, rather than just the odds ratio?

because it's the odds ratio, controlling for other variables in the model

what does it mean to say assumptions 3 and 4 imply that OLS b is BLUE?

best linear unbiased estimator\n-i.e. among all linear unbiased estimators, b has minimum sampling variance\n-ensures that usual standard error estimates are approximately unbiased, i.e. consistent

what is a more complicated way to standardize coefficients from logit, probit or comploglog?

by multiplying b times the standard deviation of x (same as in simple approach) and then dividing by the estimated standard deviation of y* (the latent variable)\n-this leads the standardized coefficients to be lower than ones produced from simpler approach...but not much of a difference if trying to compare relative magnitude of the xs b/c it's multiplying by the standard deviation of x that makes the coefficients comparable across diff. variables

so, overall, what conclusions do we reach about using LPM?

can do OLS where treat depvar as a dummy variable\n-would give unbiased estimates of coefficient\n-but can't trust standard errors or p-values

what is maximum likelihood estimation of logit model? what are its properties?

choose parameter estimates which, if true, would make the observed data as likely as possible. ML is typically an iterative method (keep running estimates until they converge. Each step along way, calculates the log-likelihood, which is criterion it's trying to maximize)\n\n1. consistent\n2. asymptotically efficient\n3. asymptotically normal\n-ML is useful b/c can often get ML estimators when other approaches are unfeasible or nonexistent

if 2x2 table looks like this, what kind of separation is it?\n\n\n y\n 1 0\nx 1 5 0\n\n 0 0 10

complete separation\n-in which case, there will be zeros in both numerator and denominator

what is the first step in maximum likelihood estimation?

construct the 'likelihood'\n-this is the probability of observing the data we observed, where the probability of observing the data we observed is a function of unknown parameters //fce-study.netdna-ssl.com/2/images/upload-flashcards/68/65/73/12686573_m.jpg

what command can you use in stata to get hazard ratios for complementary log-log model? if get hazard ratio of 1.147, how would you interpret that?

eform\n-a one unit increase in x is associated with a 15% increase in the hazard of y

how do you get BIC and AIC in stata?

estimate model\nestat ic

what is one easy way to get an "adjusted odds ratio" from the log-odds coefficient?

exponentiate the coefficient (but ONLY if it's a dummy variable)\n\n-can either do this by hand (ex. e^0.024 = 1.024) or \n\n-by requesting odds ratios in Stata\nor\n-by running logit command in Stata (as opposed to logistic)\n\n\n\nif it's not a dummy variable (i.e. if it's on a quantitative scale), use the following transformation:\n100([e^beta] - 1)\n-this gives the percent change in the odds for a one-unit increase in x\n\n\nwhen b is small (i.e. |b|<0.10), then e^b is approximately equal to b, meaning you can just move the decimal over 2 places to get percent change in odds

what if we don't want a logistic regression model and instead want an odds model?

exponentiating both sides gives us an odds model\n- (πi/1-πi) = exp(bxi)\n-in this case, the odds is an exponential function of the linear function of the xs

what is quasi-complete separation?

far more common than complete separation\n-occurs when there exists some linear combination of the predictors that separates the two outcomes, except for at least one case in each outcome for which the combination equals 0\n\n\ne.g. when x<4 y=0\n when x>4 y =1\n when x=4 y=0 OR y=1

how can you get LR statistics for individual variables (since Wald statistics are reported for the independent variables in stata output)?

fit a set of models, each excluding one of the predictor variables

what is the assumption of no autocorrelation?

for any 2 random errors for any 2 individuals, the covariance (i.e. correlation) is 0\n-no correlation between error terms of individuals in the sample\n-example of violation: ask a married couple something...there may be a variable not included in the model that affects their answer and that is the same for both of them

so what is the general rule as to when the ML estimate for the regression coefficient doesn't exist?

for any dichotomous indep var in a logistic regression, if there is a zero in the 2x2 table formed by that variable and the dependent variable, the ML estimate for the regression coefficient does not exist\n-but this also generalizes to categorical variables with more than 2 categories

what is a computing algorithm and what are examples of computing algorithms?

for any given estimation method, there are often several competing ways of doing the computations, all of which will give the same numerical answers\n-for maximum likelihood estimation, for example, we could use iterative proportional fitting, Newton-Raphson, EM algorithm

what does it mean to measure predictive power?

how well you can predict the y variables based on the x variables

how does the likelihood formula work?

if the observed probability is 1, just left with πi. If the observed probability is 0, just left with 1-πi\n-works like a switch

what does it mean to say ML estimators are asymptotically efficient?

in large samples, ML estimators have (approximately) minimum sampling variation, as compared to all possible consistent estimators

under what conditions would the algorithm in maximum likelihood estimation NOT converge?

in some cases, the ML estimates do not exist and maximization algorithm may not converge. This is especially likely with:\n1. small samples\n2. an extreme split on the dependent variable\n3. complex models with lots of interactions

so how do we know if the model fits the data?

in what ways might it not fit the data?\n-check for interactions and non-linearities directly using Wald or LR tests\n\n\nalso, turns out that Pearson chi2 doesn't have chi2 distrib in indiv. level data, but does have an approximate normal distrib\n-can use pearsonx2 command\n\n\nhowever, overall goodness-of-fit tests tend to have low power, so might be best to do specific tests for non-linearities and interactions

what happens if you estimate a logit model in presence of quasi-complete separation?

it will run it (not case for complete separation...won't even run it unless use asis command) but it's useless!\n\n-but it deletes cases that are perfectly predicted\n-it drops the predictor variable that has quasi-complete separation from the model\n-the likelihood ratio chi2 isn't even useful!

what does the likelihood ratio chi2 test reported in stata output report?

it's a test of a model with no predictors (null) versus a model with all predictors in model specified\n-null hypothesis = no predictors\n-we reject the null and conclude that at least one of the coefficients on the indep. vars is not equal to 0 if the p-value is below 0.05

what command in stata can you use to standardize, and what column should you look at to rank order coefficients in terms of size?

listcoef, std help\n\n\nbStdXY = fully standardized coefficients\n-note: rank order of the standardized coefficients tend to be same as rank order of z-statistics, but z-stats are heavily dependent on sample size\n\n\nif don't include std option, will get odds ratios standardized by x\n-i.e. reports the exponentiated coefficient, multiplied by the standard deviation of x\n-e.g. 1 standard deviation increase in x multiplies odds of y by 0.47\n\n\nif do listcoef, percent help, get percent changes in odds for a one-standard deviation increase in x\n-e.g. 1 standard deviation increase in x is associated with a 53.2% decrease in odds of y

what is one reason people often prefer logit models over probit models?

log odds is called the "natural paramter" of this distribution (what distrib? log-likelihood distrib??)

what transformation do we use to get π to vary between -infinity and +infinity?

log of the odds

what are some rules of logs?

log(y) = x is equivalent to y = e^x\nlog(1) = 0\nlog(e) = 1\nlog(e^x) = x\nif 0\nlog(xy) = log(x) + log(y)\nlog(x^y) = y*log(x)\nlog(x/y) = log(x) - log(y)\nfor all y, e^y > 0\ne^(x+y) = e^x * e^y\ne^log(x) = x\n(e^x)^y = e^(xy)

what is the function for complementary log-log?

log[-log(1-πi)] = bxi\n\n\n-i.e. take the inverse log of 1 minus the probability of an event and then take the log of this value\n-this is equal to a linear function of set of x variables\n\n\nanother way to write function:\n\n\nπi = 1-exp{-exp(bxi)}

what does it mean to say assumption 5 implies that OLS b is the maximum likelihood estimator?

means that using b as our estimate would lead to the same estimation method as using the maximum likelihood estimator

what does it mean to say assumption 5 implies that OLS b is MVUE?

minimum variance unbiased estimator\n-stronger than BLUE\n-among all unbiased estimators (both linear and nonlinear), b has minimum sampling variance

why does anyone use probit?

more mathematically convenient for models involving latent variables\n-i.e. when you believe some continuous latent variable is affecting set of dichotomous variables, it's more convenient to use probit\ne.g. want to measure depression. conceptualize it as continuous variable. ask 20 Qs that are yes or no. want to do factor analysis where have unobserved, latent variable 'depression' and 20 dichotomous indicators. Would assume each indicator is a probit function of a lurking variable

how do we standardize coefficients in linear model? what's the problem with standardizing coefficients in logistic regression?

multiply b times (standard dev. of x/standard dev. y)\n-in case of logit models, there's no agreement on what the denominator should be

what is one simple way to standardize coefficients from logit, probit, or comp-loglog?

multiply b times the standard deviation of x divided by the standard deviation of the error term in the latent variable model\n-the standard deviation of the error term in the latent variable model is a constant number \n\n-in logit model, it's 1\n-in probit model, it's 1.81\n-in comp-log-log it's 1.28\n-so the standardized coefficients will all be approx. the same for all 3 models\n-note: these standardized coefficients will tend to be higher than if used linear model

what are the problems with/limitations of Hosmer-Lemeshow Test?

no theory to justify it, only simulation evidence. this wouldn't be fatal flaw, but statistic also frequently behaves in strange ways:\n-it can be very sensitive to the number of groups\n-adding non-significant terms to the equation may greatly improve H-L fit (e.g. adding interaction)\n\n-adding statistically significant terms to the equation may worsen H-L fit

what is complete separation?

occurs when there is some linear combination of the predictor variables such that whenever the combination is above 0 (or some specific #) then y=1 and whenever the combination is below 0, then y=0, i.e. perfect prediction\ne.g. when x =< 3, y = 0 \n\n when x>3, y =1

what is an advantage of odds over probabilities in terms of comparing people?

odds are a more natural scale for multiplicative comparisons\n-ex. if have prob. of 0.6 of voting, would be absurd to say someone else's probability of voting was twice as great (b/c 0.6x2 = 1.2, can't have prob. greater than 1). No problem on odds scale, though

Categorical Data Analysis quizlet

Ensembles d'études connexes

MKT 320F FINAL REVIEW

Automobile Exception

Pharm201 CH38

Chapter 5 - Plasma Membrane

Digital Photography I and II Quizzes

A&P Ch 13 and 14

RNSG 1125 Exam 2

Ch.11 : Axial and Appendicular Muscles, chapter 10 Muscle tissue

Series 79 Days

FIN 323 Learnsmart reading Chapter 1

Multiple Choice

CHAPTER 27 ASSESSING FEMALE GENETALIA, ANUS, AND RECTUM

Chemia Nowej Ery 1 5. Atomy, cząsteczki i wiązania - definicje

Artt. 13- 54 Costituzione

FIN 3305 CH.19

People- Revolution/Declaration of Independence

Maternal and Newborn Success Questions Unit 2 Exam

Medical Terminology Chapter 2 Test

Psychology Midterm

Econ 2120 Exam 2