Stats

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Key concepts of data analysis

Descriptive statistics - Describing observed data Inferential statistics - Relationship between two (bivariate) or more variables (multivariate regressions)

Describe the different types of skewness

Positive Skew (lefty) - Mode, median, mean Symmetrical distribution - Mean, median, mode Negative Skew - Mean, median, mode

What are applications of hypothesis testing?

T-test - Comparing means Χ2-test (Chi-squared) - Compares frequency distribuions F-Test/ ANOVA - Equality of variance

How can the statistical significance of the estimated variables be tested? How can the t-value be calculated?

T-test - Does the independent variable have a statistically significant effect on the dependent variable? Or is the effect zero? - H0: ß1=0 (no effect); H1: ß2≠0 (effect) T-value= ß1/SE - With SE= Standard error (Average deviation between estimated and actual value of the independent variable) - The higher the t-value, the more probable is the effect unequal zero (reject H0) - significance by comparing to critical values

How can a time dummy be implemented in a regression and how can the intercepts be interpreted?

Selection of a base year (2000:ß0) -> y=ß0+ß1*x+ß1*x²+...+ ß3*y2005+ß4*y2010+u Intercepts are: 2000: ß0 2005: ß0+ß3 2010: ß0+ß4 Interpretation of the intercepts: in 2005: ß3% less smokers, in 2010: ß4% less smokers (always in comparison to the base year 2000)

Explain Serial correlation/ Autocorrelation of error terms

Serial- /Autocorrelation - Error terms are positively correlated over time - Example: education spendings in t-1 are related to education spendings in t - One reason for autocorrelation: omitted variables correlate across periods - Example: Sales of a product increase over time -> Population growth

How can correlation be measured in descriptive statistics? And how is it scaled?

- Covariance = linear relationship between two variables - Correlation = standardized covariance = covariance divided by product of standard deviations scaled between -1 and 1 - r= -1 strong negative correlation - r= 0 no correlation - r= 1 strong positive correlation

Describe the main character of a Logit Model

- For binary dependent variables (alternative: Probit) - Uses logistic function (Probit: Standard normal cumulative distribution) and gives response probability (Prob(yi=1/Xi)) - Estimation with maximum likelihood - OLS assumptions are violated (homoscedasticity - Sign of coefficients indicates direction of relationship (i.e. probability to observe y=1 increases/ decreases) - Interpretation not intiuitive (logodds) - significance of coefficients with Wald-Test - (pseudo) R² is calculated differently (e.g. Nagelkerke's R²) - residuals are really different - multicollinearity is similar (VIF)

What's the difference between R² and adjusted R²?

- In multiple regression models: adding a variable will never decrease R² (not good for model comparisons) - takes number of variables in a model into consideration - adding a variable may decrease adjusted R² (penalizes you for adding variables that do not fit the model)

What indicates the correlation?

- Indicates relationships between variables - Indicates the strength of the relationships

What are the characteristics of pooled cross-section data?

- Individual observations across time (each time period, different individuals) - No concerns with OLS assumptions (as with cross-sectional data, we would need to check for heteroscedasticity of error terms etc.) - Only problem: population may have different distributions in different time periods (solved by year dummy)

Describe the maximum likelihood method

- Likelihood of observing the data we measure is maximized - so that the estimated parameter best fits to the observed data - Iterative procedure

What additional assumption to the Gauss-Markov-Theorem must be made?

- Normality assumption: The unobserved error is normally distributed in the population Necessary to perform tests based on t statistics and F statistics (make statistical inference) - No Serial-/ Autocorrelation: Error terms are positively correlated over time

What are characteristics of time series data? How can time be modeled?

- Only one individual - Regular intervals with a temporal ordering - the past can affect the future - but the future cannot affect the past (natural order) - N= the number of observations in the time periods Two ways to model time 1. The static model for immediate effects 2. Finite distributed lag model for lagged effects

What are the components of a Boxplot?

- Outlier - Maximum - 3rd Quartile (75th percentile) - Median - 1st Quartile (25th percentile) - Minimum

What are the problems of omitted variable bias?

- Overestimate the strength of an effect - Underestimate the strength of an effect - Change the sign of the effect - Mask an effect that actually exists This occurs if - the omitted variable correlates with the dependent variable - the omitted variable correlates with at least one independent variable in the model

How can the Goodness of fit be measured and how is it scaled? What does the result 0,67 express?

- R²=SSR (explained sum of squares)/ SST (total sum of squares) - R²= explained variation/ Overall variation - Scaled between 0 and 1 - 67% of the variation of the dependent variable is explained by the independent variables

What are essential things to interpret within regression outputs?

- Sign/ size of coefficient - Statistical significance (t-values), p-values) - Goodness of fit (R2/R2adjusted) - Overall model significance (F- Test) - Residuals (plotting residuals: close to zero is good)

- After the unusually warm weather in the last few decades, many people suspect that this has affected ecosystems in Germany causing them to release carbon to the atmosphere. The long-term mean of net ecosystem carbon exchange (NEE) is given as 20. - We have measurements of NEE from a number of localities (in units of grams of carbon per square metre per year). The data are shown below. A negative value represents a net uptake of carbon by the ecosystem, a positive value net release of carbon. 15, -10, 62, 53, -3, 8, 19, 16 Suggest an appropriate hypothesis test

- T-test (comparing means) - select significance level and perform test

When to use Poisson distribution? What is the Poisson distribution? Watt are the assumptions and how are the coefficients estimated and how can they be interpreted?

- Used for nonnegative integer values (0,1,2,3,...) (no normal distribution) - Gives the probability of certain number of occurrence given the linear predictor: Pr(y/z)=(exp(-z)*z^y/y! z is the linear predictor: z=ß0+ß1*xi1+ß2*xi2... - Assumption of poisson distribution: Variance is equal to the mean Var(y/x)=E(y/x) - (Quasi) Maximum likelihood estimation

What are the p-values and how can they be interpreted?

- p-values indicate the believability of the null-hypothesis - p-values are always linked to null-hypothesis - High p-values: Your sample results are consistent with a null hypothesis that is true. - Low p-values: Your sample results are not consistent with a null hypothesis that is true.

What does the F-test do? and how can it be interpreted?

- tests whether my linear regression model provides a better fit to the data than a model that contains no independent variables (intercept only model) - H0: Intercept only model fits the data as well as my model - H1: my model fits the data better - Decision with p-value: p-value < significance level: reject H0

Give the type of curves that are calculated by the following quadratic terms: 1. -x+x² 2. x-x²

1. Convex curves 2. Concave curves

Key concepts in descriptive data analysis

1. Histogram 2. Minimum and Maximum 3. The range 4. Percentile 5. The (arithmetic) mean 6. The mode 7. The median 8. Boxplot 9. mean deviation 10. Variance 11. Standard deviation 12. Skewness 13. Kurtosis 14. Correlation

What's the 1. rule of the Gauss-Markov-Theorem? How can the problem be solved?

1. Linear in parameters Linear relationship between dependent and (several) independent variables: ß0+ß1x+ß2x+ß3x... - Put it into a linear framework: Logarithmic transformation E.g. Cobb-Douglas Production Function (Y=K^a*L^b -> log(Y)= a*log(K)+b*log(L) - Use a different estimation method

Explain the 1 recipe for a regression analysis

1. Start with assumptions/ theory about relationships between variables Which factor is influenced by which variable? (H0;H1)

Mention the different recipes for a regression analysis

1. Start with assumptions/ theory about relationships between variables 2. Obtain data and operationalize variables 3. Create regression equation and decide on estimation method 4. Perform regression and interpretation 5. Model diagnostics 6. Adapt model and start over with 4

What are the steps of hypothesis testing?

1. State hypothesis - Null Hypothesis H0: default positions, no relationship/ difference (we typically want to prove it wrong) - Alternative Hypothesis HA: assuming a relationship/ difference 2. Select test - depends on data and question 3. Choose significance level - a probability threshold below which the null hypothesis will be rejected 4. Perform test, compare with critical values, reject/ accept null hypothesis - compare p-value with confidence level - accept H0 if confidence level < p-value - Reject H0 if there very small p-values 5. Interpretation Potential errors - Type I error: minimized by decreasing the alpha (significance level) - Type II error: ß minimized by increasing difference of HA, increasing sample size, increasing alpha

What role play zeros in modeling counting data? What models are best for the different cases?

1. Zero inflation: lots of zeros 2. Zero truncation: no zeros 1. typically consider two processes to predict zeros: the certain zeros, and a true-count-process-zero - E.g. number of fishes fished: certain zeros = those who never went fishing, true-count-process-zeros = those who went fising but did not catch a fish - You can use zero-inflated poisson or negative binomial regressions that use two steps 2. Still use basic models (poisson, negative binomial) but account for zero truncation

How many different types of point patterns do you know?

1. regular pattern 2. random pattern 3. clustered pattern

Explain the 2 recipe for a regression analysis

2. Obtain data and operationalize variables Is your data a random sample of the population? Data preparation - Identify gross errors/ outliers (Start with graphical investigation) Preliminary model investigation - Identify functional form/ important interactions (Scatterplots) Operationalize variables - Friendliness can be measured by frequency of smiles - Innovativeness can be measured by number of patents

What's the 2. rule of the Gauss-Markov-Theorem? And tell the rule of thumb to avoid a violation of this assumption Can the sample selection based on dependent variable?

2. Random sample of size n - the larger the sample size, the less you have to worry... - ... and the more independent variables you include, the larger your sample size should be - At least 10 observations per variable that you include (E.g. including tip, smiling, gender would need at least 30 observations) No, it would be always problematic. But there are non-random samples that cause no problems... - Sample selection based on the independent variable (exogeneous sample selection) - Includes stratified samples if stratification by independent variable

What's the 3. rule of the Gauss-Markov-Theorem?

3. The x−values of the sample vary Var(x) > 0 (no perfect collinearity) Multicollinearity: Correlation among the independent variables in a multiple regression model - Your independent variables should (ideally) be independent of each other - If one independent variable varies this should have no effect on other independent variables - Multicollinearity can be tested for with variance inflation factor (VIF) - Exclude perfectly collinear variables (e.g. same variable in different unit)

Explain the 3 recipe for a regression analysis. What needs to be balanced?

3. Create regression equation and decide on estimation method Common problems - Including irrelevant variables: overspecification - Omitted variable bias: underspecification - Misspecification Gold standard: theory driven variable selection (stepwise regression) - Forward selection: Start with no variables, add variables and only keep significant ones... - Backward selection: Estimate the most complex model with all available covariates, then remove the insgnificant ones Trade-off between - Including as many predicting variables as possible - And keeping it simple (have many degrees of freedom)

What's the 4. rule of the Gauss-Markov-Theorem?

4. Error terms have an expected value of zero E(u/x) =0 Violated if Endogeneity is present (independent variable is correlated with the error term) and is caused by a) Functional form misspecification: Functional form can be tested with RESET test (regression specification error test) b) Omitted variable: Underspecification: missing variable bias - Ideally: try to include variable or proxy variable - If previous data is available: use lagged variable (from a prior year) c) Simultaneity One ore more of the independent variable is jointly determined with the dependent variable

What's the 5. rule of the Markov-Theorem? And how can the errors be detected and solved?

5. Homoscedastizity: Error terms have the same variance Var (u/x) = 𝜎² and are not correlated Detecting heteroscedastic errors - Residual plots (graphical tool) - Statistical tests for heteroscedasticity: Breusch-Pagan test: regresses squared OLS residuals on independent variables White-test: also involves regressing squared OLS residuals on independent variables One way to solve: data transformation (e.g. log transformation)

Explain the 5 recipe for a regression analysis. How can the errors be detected?

5. Model diagnostics Examination of model assumptions Perform a number of tests - Functional form (RESET test) - Check collinearity - Detecting heteroscedastic errors Detecting heteroscedastic errors - Residual plots (graphical tool) - Statistical tests for heteroscedasticity, e.g. Breusch-Pagan test, White-test If heteroscedasticity is detected: variable transformation or two stage least squares

Explain the 6 recipe for a regression analysis.

6. Adapt model and start over with 4 - Incorporating nonlinearities (Logging of variables, Polynomials, ...) - Deal with severe multicollinearity

What's the time constant individual characteristics in panel data?

Advantage of panel data: - You can "control" for individual characteristics because you observe the same individuals with the First differenced estimation - First differenced estimation: Difference each variable over time and then apply OLS - a time constant effect can be assumed (by observing the same individuals across time)

Why does the Poisson regression rarely fit? What model is the better choice in this case?

Assumption violation (Var=E): Var(y/x)>E(y/x) (=Overdispersion) Negative binomial regression - Generalization of poisson regression - Adds a parameter that allows the variance to exceed the mean - Extends poisson distribution (allowing for greater variance) to negative binomial distribution - Coefficients also estimated with Maximum Likelihood - Interpretation the same as in poisson regression model - Negative binomial regression models is what you typically would use to handle patent data

What needs to be taken into account by analyzing time series data (problems) and how can it be solved?

Autocorrelation Time trends: - Causal inference needs to recognize these trends (otherwise our estimates are biased) - Solution: 1. include time trend t into regression equation (exponential function of time) - yt=ß0+ß1*x1(t)+ß2*x2(t)+ß3*t+u(i) 2. Detrend time series data Seasonality: - Causal inference needs to recognize these patterns - Solution: include a time dummy for season of interest

What does the variance measure?

Average of squared deviations between individual value and the mean of the dataset sum of (xi - avg. x)^2/n Describes how individual values differ from the mean value

How can ordinal information in a regression be included? and how can it be interpreted?

Categories with order - eg. credit ratings CR [0,1,2,3,4]: 4 is better than 3, but is the difference the same? - therefore include 4 different dummies (n-1) with 0 as base category Interpretation of CR1-CR4 Difference in dependent variable (other factors fixed) between a municipality with a credit rating of zero and a credit rating of 1 (2, 3, 4 respectively)

What type of datas do you know (structure)

Cross-sectional data - Samples of individuals, households, firms, countries... at a given point in time Time series data - Observation on a variable/ a set of variables over time, usually regular intervals Pooled cross section - Combining cross-sectional data for two or more time periods Panel/ longitudinal data - Following individuals over time

What is the Kurtosis? What differences does it explain?

Degree of peakedness in a data distribution = degree to which data values are concentrated around the mean - High kurtosis: distinct peak near the mean, decline rapidly - Normal distribution - Low kurtosis: flat top near the mean

How is the range calculated?

Difference between maximum and minimum

What is measured by interaction terms? Give an example: housing prices depend on size, bedroom, the interaction between size and bedroom, and bathrooms. Calculate the marginal effect of bedrooms on price. What information do we get?

Effect of one variable may depend on another variable price=ß0+ß1sqrft+ß2bdrms+ß3sqrft*bdrms+ß4bthrms+u (Partial) marginal effects of bdrms on price (holding all other factors fixed): derivative of bdrms-> =ß2+ß3sqrft If ß3<0 -> additional bedroom increases house prices more for larger houses

Define the finite distributed lag model

Finite distributed lag model - For lagged effects - reflecting one impact (x) over time for one individual - y= ß0+ß1x(t)+ß2x(t-1)+ß3x(t-2)...u(t) with ß1=immediate effect and ß1+ß2+ß3= overall effect

How can dummys be implemented in interaction terms? Use for the example: wages, marital status, gender

Four groups: female= 0, married=0 female=1, married=0 female=0, married=1 female=1, married=1 log(wage)= ß0+ß1female+ß2married+ß3female*married+u

- After the unusually warm weather in the last few decades, many people suspect that this has affected ecosystems in Germany causing them to release carbon to the atmosphere. The long-term mean of net ecosystem carbon exchange (NEE) is given as 20. - We have measurements of NEE from a number of localities (in units of grams of carbon per square metre per year). The data are shown below. A negative value represents a net uptake of carbon by the ecosystem, a positive value net release of carbon. 15, -10, 62, 53, -3, 8, 19, 16 Formulate H0 and HA

H0: Mean = 20 (hypothesis of "no difference") - i.e. the sample comprising of NEE values for different localities is drawn from a population of NEE values with the long-term mean of 20 HA: Mean ≠ 20, alternative hypothesis - i.e. the sample comprising of NEE values for different localities, differ from the long-term mean of 20.

What does the degree of freedom explain?

How much can your information vary when estimating parameters - the more variables you include and the smaller your sample size, the less freedom your estimated parameters have - You always want many degrees of freedom if there is not enough freedom to vary -> overspecification

100 families were asked for their annual consumption (cons) and income (inc) (both in USD). The following regression equation was estimated cons = -124,84 + 0,853 inc ; n=100, R²=0,0692 Interpret the intercept, discuss its sign and the absolute value

Intercept (=ß0): -124,84 Meaning: The constant term in regression analysis is the value at which the regression line crosses the y-axis. The constant is also known as the y-intercept. The regression equation shows us that the negative intercept is -124,84. Using the traditional definition for the regression constant, if income is zero, the expected mean consumption is -124,84. (=Minimum of model) Interpretation: In this case the sign doesn't make any sense, because a negative consumption if the income equals 0 isn't possible. To have any chance of interpreting the constant, this all zero data point must be within the observation space of your dataset. A portion of the estimation process for the y-intercept is based on the exclusion of relevant variables from the regression model. When you leave relevant variables out, this can produce bias in the model. Bias exists if the residuals have an overall positive or negative mean. In other words, the model tends to make predictions that are systematically too high or too low. The constant term prevents this overall bias by forcing the residual mean to equal zero.

What's difference in interpretation of OLS and non.linear models?

Interpretation in OLS - Coefficients give you marginal effects - non-linear independent variables (logged variables can be interpreted as elasticity) - Interaction terms and quadratic variables (first derivative gives you marginal effects plugging in values helps in interpretation) Interpretation in non-linear models - Big drawback: not intuitive - Signs can be interpreted - Size of effects interpretable by factor change (exp(ß)) - Marginal effects: via first derivative, but in the non-linear case not that straight-forward - In non-linear regression models we need more observations

Summarize the functional forms involving Logarithms

Level-level: y and x -> Interpretation: If X increases by one unit, Y will change by β1 units (marginal effects) Level-log: y and log(x) -> If X increases by one percent, Y will change by β1/100 units Log-level: log(y) and x -> If X increases by one unit, Y will change by 100*β1 percent Log-log: log(y) and log (x) -> If X increases by one percent, Y will change by β1 percent (elasticity)

What does the elasticity measure? Try to interpret: coffee (quantity)=0,06-800log(price)

Measures the responsiveness of one variable to changes in another variable (e.g. log-log Model) -> the higher the elasticity, the stronger the response - In a log-log model the coefficient (ß) is the elasticity Interpretation: If the price increases by 1 %, the demand in coffee quantity will decrease by 8 coffees.

What is the mode used for?

Most frequent value in a dataset

Why is the OLS so important and how is it connected to the Gauss-Markov Theorem?

OLS = Estimation that yields consistent estimators for ß0 and ßn (under certain assumptions) Gauss-Markov Theorem: if the following assumptions apply in a linear regression model...: 1. Linear in parameters 2. Random sample of size n 3. The x−values of the sample vary Var(x) > 0 (no perfect collinearity) 4. Error terms have an expected value of zero 𝐸 (u/x) =0 5. Homoscedastizity: Error terms have the same variance Var(u/x) = 𝜎² and are not correlated ...then OLS provides the Best Linear Unbiased Estimator (BLUE)

Try to interpret the variable experience in 10 years? and where is the turning point? wage= 3,73+0,298*experience-0,0061*experience²

Option 1: Plug in variables to interpret effect 10 years of experience: 0.298*10-0.0061*102=2.37 (2,37 - 1,3375 -> 5 more years worth 1.03) Option 2: Plugging in the years and calculate first derivative 0.298 − 2 ∗ 0.006*10=0,178 (-> going from 10 to 11 year) turning point: Set first derivative=0 and solve 0=0,298-2*0,006*experience

How can regressions outputs be interpreted?

Overall model quality - Goodness of fit - F-Test Interpretation of coefficients - Simple regression model - Multiple regression model - Statistiscal significance Own practice

Explain the effect of over- and underspecification

Overspecification - Inclusion of irrelevant (too many) variables: even though they do not have an effect - No serious problem: does not bias my coefficients as there is no effect - But reduces degrees of freedom and thus can be problematic Underspecification - Inclusion of too few variables - You underfit your model and encounter an "omitted variable bias"

State the difference between PCA and Factor Analysis

PCA reproduces data structure as well as possible - "Which umbrella term can we use to summarize a set of variables?" Factor analysis explains variable correlations, looks for latent variables - "What is the common reason for strong correlations between a set of variables?"

What are the characteristics of a panel data? y(i,t)= ß0 + ß1 * d(t) + ß2 * x(i,t) + a(i) + u(i,t) Explain the model on 1. what is d(t) 2. intercept if t=1 and intercept if t=2 3. What is a? 4. what type of errors can occur?

Panel data (longitudinal data) - The same observations across time: questions being asked to the same individual each year - Panel data has both a cross-section (i=many individuals) and a time dimensions denoted by the subscripts (t) 1. d(t) is a fixed (no change between i) time dummy, that equals 0 in the first period and 1 in the second time period, in order to allow the intercept to change 2. if t=1: ß0, if t=2: ß0+ß1 3. Unobserved time-constant factor that does not change over time (=fixed effects) 4. - errors that vary across time (=idiosyncratic error u(i,t)) - errors that are constant across time a(i)

What's the structure of space data?

Points - Pairs of coordinates (x, y), representing events, individuals, cities or any other discrete object defined in space - Examples: location of a crime, location of a household - R can handle shape files and lat/ long data Polygons - Sequence of connected points where the first point is the same as the last - Examples: administrative units such as US counties, NUTS regions... - R can handle shape files Lines - Open polygon: sequence of connected points that does not result in a closed shape - Examples: roads, rivers... Grids - Divides the study region into a set of identical, regular-spaced, discrete elements (pixels) where each pixel records values - Example: remote sensed data

What are Percentile?

Positions of values in an ordered dataset

For what stands PCA? Why is it necessary what's the basic idea and the main application?

Principal Component Analysis Why? - PCA is a tool for dimensionality reduction" - Especially for variables that are correlated with each other (contain duplicate information) Basic idea - Transform data set into linearly uncorrelated "principal components (PC)" - each component captures as much of the variance in the data as possible - First PC has largest variance, second has second largest... - A so-called rotation method is use to load each variable on factors by common variance (weighted) Main applications: - Exploratory data analysis - Used to address multicollinearity in regression models - But Interpretation of variables constructed with PCA in regressions are not interpretable beyond any relationship

What are key sampling methods and on which do we need to focus on?

Probability sampling - Each unit in the population has a known, nonzero chance of being included in the sample - Inferences can be made about the population Non-probability sampling - Sampling is not representative of the population - Practiced for reasons of costs, timeliness, convenience focus on probability sampling, as we are interested in inference statistics

What function does quadratic variables have? Mention a famous example for diminishing marginal effects.

Quadratic variables for non-linear relationships capture diminishing or increasing marginal effects Famous example for diminishing marginal effects: Kuznets curve - Hypotheses that as economies develop, inequality first rises and then decreases over time (parable)

How can the coefficient of the Logit model be interpreted? Use as an example: 1. ß<0 1. ß>0 1. ß=0 What are logodds? How can the effect of ß be interpreted?

Sign of coefficients give the effect of the relationship Probability decreases to accept Y=1 by exp(ß) Probability increases to accept Y=1 by exp(ß) Probability remains the same Coefficients in logit model = logodds 1. estimated by probability (p=x/sum of x) 2. Probability used for the odds (odd=p/1-p) 3. logodds calculated by Log(odd) Interpreting the size of effects (ß) - logodds: the probability of accepting Y - the larger the odds=exp(logodds) the larger the probability of being in respective group

How can the poisson regression model be interpreted?

Signs of coefficients give direction of relationship Interpreting the six of effects - Option 1: factor change exp(ß) - Option 2: percentage change (100*exp(ß)-1)

How can the coefficients be interpreted? - in a simple linear regression? - in a multiple linear regression?

Simple linear regression: If x increases by one unit, y will increase on average by ß1 units Multiple linear regression: If x1 increases by one unit, y will increase on average by ß1 units if other factors are held constant. If x2 increases by one unit, y will increase by ß1 units if other factors are held constant (ceteris paribus)

Why must space need to be taken into account?

Spatial heterogeneity - Magnitude and direction of effects may vary across space - Solution: control for them, e.g. regional dummies Spatial dependence (=autocorrelation) - Variable at one location is determined by the same variable at a different location - If such dependences exist: spatial autocorrelation - Solution: spatial regression models -> Space can be included in all cross-sectional and panel data

What does the standard deviation measure

Square root of variance Wurzel aus Variance= S =Wurzel(sum of (xi - avg. x)^2/n)

How can residuals be interpreted?

Symmetrie of residuals (ideally: normally distributed): median should be close to zero, 1Q and 3Q roughly similar

What is the VIF? How is it calculated?

Tests for variance inflation of individual coefficients by auxiliary regression (=regress individual variable on all other variable) Vif(i)= 1/(1-R²(i)) VIF>5 -> sign for multicollinearity VIF>10 -> extreme multicollinearity (would make you worry) If you detect strong multicollinearity (5-10) - Drop highly multicollinear variables - Perform PCA of highly multicollinear variables

What is the median used for?

The data value in the middle of the ordered data value

What is a marginal effect?

The effect on the dependent variable that results from changing an independent variable by a small amount The Marginal effect can be calculated by the first derivative (Ableitung) Interpretation of coefficients obtained by OLS = marginal effects interpretation

The variable output is the maize harvest in kg per year, the variable rain is the rainfall in milimetres per year. A simple model that estimates the harvest dependent on rainfall is given as Output= ß0 + ß1 rain + u (u being the error term) Define the error term (u) and tell the formula to calculate the error term.

The error term u(i) captures the deviation between observations and the estimated function. - the influence of excluded variables, non-linearity of data and other "noise" - u(i)= y(i) - ß0 - ß1 x(i)

What is a histogram used for?

The frequency distribution of one variable

For what are logging variables? What function do they have and what else do you know?

The use of logs is great for when we have a marginal effect that is changing in size in units (the effect is somehow related to percent changes, rather than unit changes) - May address model misspecification (wrong functional form) -> Small values that are close together are spread further out -> Large values that are spread out are brought closer together - Only works with non-negative data Sometimes we'll have a marginal effect that we think will change in direction at some point. In this case, we'll switch to a polynomial, or quadratic, functional form where we include a squared term of the independent variable(s).

What's to say about outliers and how can they be handled?

Unusual (influential) observations may largely affect OLS estimates - OLS very susceptible: minimizes the sum of squared residualslarge residuals receive a lot of weight Source of unusual observation - Data entering error (too many zeros, wrong unit...) -> check data - "True" extreme value -> Decision with researcher Detecting outliers - Visual inspection (Scatterplot, Boxplot...) - Different systematic approaches based on standard descriptive statistics (Dffits, Cook's distance...)

Why do we need a year dummy interaction and try to interpret the following example for: 1. intercept (1978) 2. intercept (1985) 3. Wage differential between men and women in 1978 and 1985 4. what about the variable experience Example changes in wage since 1978: log(wage)=ß0+ ß1*y85+ß2*educ+ß3*y85+ß4*exper+ß5exper²+ ß6*female+ß7y85*female+u

Year dummy interactions - See if effect of independent variable changes over time 1. 1978: ß0 2. 1985: ß0+ß1 3. 1978: ß6; 1985: ß6+ß7 4. Experience and union have the same effects across time in this model

How can the adjusted R² be calculated?

adj. R²= (1-(1-R²)(n-1))/ (n-k-1) with k number of independent regressors

100 families were asked for their annual consumption (cons) and income (inc) (both in USD). The following regression equation was estimated cons = -124,84 + 0,853 inc ; n=100, R²=0,0692 What is the estimated assumption with an income of 3000?

cons=-124,84+0,853*3000 cons=2434,16

What kind of variables can be assumed for independent and dependent variable in a linear regression?

independent: - metric - binary - (multi-)categorial - logarithms - quadratic dependent: - always metric

What does OLS mean? and for what is it used?

ordinary least square - a type of linear least squares method for estimating the unknown parameters (ß0 and ß1) in a linear regression model (under certain assumptions) - Estimates a linear function - Errorterm u captures the deviation between observations and the estimated function („residuals") - minimizing the sum of the squares of the differences between the observed dependent variable in the given dataset and those predicted by the linear function (=error term)

Define the static model of time

static model of time - A change in x at time t will have immediate effects on y at time t (subscript t denotes time) - we have only one individual but different observations (x) across time - y=ß0+ß1x1(t)+ß2x2(t)+...u(t)

Give the mathematical definition of a logic modell

the model describes the probability of the dependent variable to take on the value 1 in the presence of covariates: Logistic (response) function: G(z)= exp(z)/(1+exp(z)) with z as linear predictor z=ßo+ß1x+... An equivalent formulation is given by assuming the multiplicative model: G(z)/(1-G(z)) =exp(ß0)*exp(ß1xi1)*exp(ß2xi2) Trnasformatioin into regression: ln(G(z)/(1-G(z)) =ß0+ß1xi1+ß2xi2

State the different names for y and x

y - Dependent variable - Explained variable - Response variable - Predicted variable - Regressand x - Independent variable - Explanatory variable - Control variable - Predictor variable - Regressor

How can dummies be included in regression analysis? And how can it be interpreted? How many variables can be included? with wage=ß0+δ0female+ß1educ+u

δ0 Interpretation of gender discrimination effect For women: female=1, wage=ß0+δ0 1 + ß1educ For men: men=0, wage=ß0+ß1educ always one variable less than categories: n-1 (try to avoid collinearity eg. including male, female)


संबंधित स्टडी सेट्स

Geology Final Review (Multiple Choice)

View Set

Chapter 23: Management of Patients with Chest and Lower Respiratory Tract Disorders

View Set

American History: Teddy Roosevelt and Domestic Policy

View Set

Mga Salitang Ginagamit sa Impormal na komunikasyon

View Set

Microeconomics Chapter 1 Practice Exam; Mississippi State- Randall Campbell

View Set