Chapter 14: Linear Regression
maximum likelihood principle two assumptions?
(1) the spread of the points around the line is of similar magnitude along the entire range of the data and (2) the distribution of these points about the line is normal. If the above criteria are met, least-squares regression will provide the best (most likely) estimates of a0 and a1. and a standard deviation for the regression line can be determined as (look at pic) sy/x (s y/x) standard error of the estimate
(1)why is the equation for standard error of the estimate (s y/x sy/x) divided by n - 2 n-2 (2)and what does y/x mean
(1)divided by n-2 because 2 degrees of freedom were lost when calculating Sr, because it is now based off of a0 and a1 also there is no such thing as the "spread of data" around a straight line connecting two points. same as n - 1, when n = 2, the equation yields a meaningless results of infinity (2)the error is for a predicted value of y corresponding to a particular value of x
A number of additional properties of this fit can be elucidated by examining more closely the way in which residuals were computed. (properties of fit)
* comparing ways in which residuals were calculated (card 56-57) * the standard error of the estimate
linearization of saturation growth rate equation
*inverting*
linearization exponential equation
*ln*
linearization of power equation
*log*
*process* for determining values for a0 and a1: -- the actual values of a0 and a1 are in next card
*page 338* -- set derivatives in respect to a0 and a1 equal to 0. -- summation a0 = n*a0 where n is the number of samples -- solve for a0 and a1 simultaneously
variance
- the square of the standard deviation (alternative equation follows this card) I think this version yields more subtractive cancellation
standard deviation
- the total *sum of the squares of the residuals* between the data points and the mean divided by n minus one.
strategies for fitting a "best" line through the data:
1) minimize the sum of the residual errors for all the available data -- 3 different ways
If a quantity is normally distributed, the range defined by y bar − sy to y bar + sy will encompass approximately ____% of the total measurements
68
the range defined by y ̄ − 2sy to y ̄ + 2sy will encompass approximately ___%.
95
*NONLINEAR STARTS HERE* *NON LINEAR STARTS HERE* *NOTE THAT THIS IS STILL LINEAR REGRESSION BUT WITH TRANSFORMATIONS*
:( im really sad rn. (03/26/2018) lol aye i feel better today (03/27/2018)
a0 = a a1 = B
:O
Normal Distribution
A function that represents the distribution of variables as a symmetrical bell-shaped graph.
Monte Carlo simulation
An analytical method that simulates a real-life system by randomly generating values for variables.
power equation when is it used
It is very frequently used to fit experimental data when the underlying model is not known.
What is nonlinear regression? *NOTE THAT THIS IS STILL LINEAR REGRESSION BUT WITH TRANSFORMATIONS*
Linear regression is predicated on the fact that the relationship between the variables is linear - this is not always the case. so nonlinear regression was created
for a perfect fit the coefficient of determination (r^2) and sum of the square of the residuals around the regression line (sr s r) equal what signifying that the line explains 100% of the variability of the data
Sr = 0 r^2 = 1
improvement or error reduction due to describing the data in terms of a straight line rather than as an average value is represented by
St - Sr S t - S r
OKAY so, the square of the residual can represent various things, but two things in particular are defined by the two equations behind this card.
The first equation represents the square of the *vertical distance* between the data and another measure of central tendency - the *straight line* The second equation represents the square of the residual represented the square of the discrepancy between the data and a single estimate of the measure of central tendency. -- *mean* (more on next card -- picture)
(2/2)
These results indicate that 88.05% of the original uncertainty has been explained by the linear model.
saturation-growth-rate-equation saturation growth rate equation when is it used
This model, which is particularly well-suited for characterizing population growth rate under limiting conditions that levels off, or "saturates," as x increases
why use degrees of freedom?
This nomenclature derives from the fact that the sum of the quantities upon which St is based (i.e., y ̄ − y1, y ̄ − y2,..., y ̄ − yn) is zero. Consequently, if y ̄ is known and n − 1 of the values are specified, the remaining value is fixed. Thus, only n − 1 of the values are said to be freely determined. Another justification for dividing by n − 1 is the fact that t*here is no such thing as the spread of a single data point.* For the case where n = 1, Eqs. (14.3) and (14.5) yield a meaningless result of infinity.
1-b to remove the effect of signs in the previous example, we can take the absolute value of the discrepancies (*minimize sum of the absolute value of the residuals*)
any straight line falling within the dashed lines will minimize the sum of the absolute values of the residuals. *what?* I still don't get why it's not a unique solution I think doesn't provide one unique solution, but has possibility of yielding more than one equation that minimizes the equation
spread of data around mean (s y) vs spread of data around regression line (s y/x sy/x) (AKA standard error of the estimate)
apparently there's an improvement for the regression line as the spread is decreased
what does a normal distribution histogram say about the data?
bell shaped data says that most of the data are grouped close to the mean value
polyfit()
built in matlab function that fits a least-squares nth-order polynomial to data >> p = polyfit(x, y, n) *inputs* where x and y are the vectors of the independent and the dependent variables, respectively, and n = the order of the polynomial. *outputs* The function returns a vector p containing the poly- nomial's coefficients from x^n + x^n-1... etc (so coefficient of the highest power to lowest)
exponential model when is it used
characterize quantities that increase (positive β1 ) or decrease (negative β1 ) at a rate that is directly proportional to their own magnitude. ex: population growth, radioactive decay
sum of the squares of the residuals
deviations predicted from actual empirical values of data
variance pt.2
doesn't require precomputation of the mean and yields identical results as the previous equation
error e is calculated:
e = y − a0 − a1 x the residual is the discrepancy between the true value of y and the approximate value, a0 + a1 x , predicted by the linear equation.
(T/F) if r^2 (coefficient of determination) is close to 1, your line is a good fit
false; it is possible to obtain a relatively high value of r^2 when the underlying relationship between y and x is not even linear. proven in next card
randn()
generates numbers that have a *normal* distribution, that is mean 0 and standard deviation of 1.
mean, median, mode, range, variance, and standard deviation in MATLAB
if s is a single vector, works as expected, if s is a matrix, it'll return a *row* vector containing the arithmetic mean/mode/etc. for each *column* in s. *NOTE* The mode function only returns the first of the most occurred values.
when will standard deviation be low?
if sum of the squares of the residuals is low. That is, the data set is close to the mean.
1- a minimize the *sum of the residual errors* for all the available data (*minimize sum of the residuals*)
inadequate criterion because any straight line passing through the midpoint of the connecting line (except a perfectly vertical line) results in a minimum value equal to zero because positive and negative errors cancel
how to approach a normal distribution?
increase sample size
A big sum of the squares of the residuals
indicates a looser fit of the data to the mean
A small sum of the squares of the residuals
indicates a tight fit of the data to mean
correlation coefficient
is just r or the square root of this equation another equation is in the next card
median AKA (50th percentile)
is the midpoint of a group of data
mode
is the value that occurs most frequently.
why isn't the range considered reliable?
it is highly sensitive to the sample size and is very sensitive to extreme values (outliers)
The location of the center of distribution of data can be measured in what ways? which is most common?
mean mode median mean
Linear Least-Squares Regression
method to determine the best coefficients in a linear model for a given data set means minimizing the sum of the squares of the *estimate residuals* (Sr -- S r)
1- c *minimax criterion*
minimizes the maximum distance that an individual point falls from the line very ill-suited for regression because it is very sensitive to outliers
overall which approach should be taken to find the best fit line through data points?
minimizing the sum of the squares of the residuals
the bin with the most frequency is often referred to as
modal class interval In this picture it would be bin 6.6 to 6.64 could also say that mode is the midpoint of the pin (6.62) but the latter is more appropriate
if r^2 = 0 (coefficient of determination) Sr = St (s r = s t) the fit represents
no improvement
coefficient of determination (r^2)
normalized to st (s t) because the magnitude of this quantity is scale dependent provides a handy measure of goodness of fit the difference between the sum of the squares of the data residuals (St) (around the mean) and the sum of the squares of the estimate residuals (Sr)(around the regression line), normalized by the sum of the squares of the data residuals (St) represents the percentage of the original "uncertainty" explained by the model.
(2/3)
percent variation (range/2*mean)*100
rand()
produces set of random numbers that are *uniformly* distributed between 0 and 1
s y/x sy/x standard error of the estimate
quantifies the spread of the data around the regrression line
The degree of spread of the data set can be measured in what ways? which is most common?
range standard deviation variance coefficient of variation (c.v.) standard deviation
coefficient of variation
ratio of the standard deviation to the mean provides a *normalized measure* of the spread
a = Z\y
returns a0, a1, a2, a3 make sure z is [ones(size(x)) x x.^2 x.^3] depending on how many powers you have
degrees of freedom
sample size minus 1 (n-1)
Sr (S r) is what?
sum of the squares of the residuals around the regression line (the residual error that remains after regression). *AKA* unexplained sum of the squares
hist() function
syntax: [n, x] = hist(y, x) n = number of elements in each bin x = a vector specifying the midpoint of each bin y = vector being analyzed n, x are optional. Could've done just hist(y) and would've produced histogram with 10 bins (automatic)
arithmetic mean
the location of the center of the distribution of the data
if r^2<0 then
the model is even worse than simply picking the mean
data distribution
the shape with which the data are spread around the mean.
S t (St) is what?
the total sum of the squares around the mean for the dependent variable This is the magnitude of the residual error associated with the dependent variable prior to regression.
example using rand() (14.2) page 332 (1/3) Monte Carlo Simulation
unsure what equation they are using to calculate percent variation range/(2*mean) * 100
polyval()
used to compute a value using the coefficients y = polyval(p, x) where p = the polynomial coefficients, and y = the best-fit value at x. For example, >> y = polyval(a,45) y= 641.8750
histograms
used to observe and display the data distribution. uses bins (sorting data into intervals) units of measurement are plotted on the *abscissa* and the frequency of *occurrence* on the ordinate ( I think just saying x, y lol)
when is a minimax criterion useful?
well-suited for fitting a simple function to a complicated function page 337
mode is useful when...
when dealing with discrete or coarsely rounded data, but not for continuous data
when will standard deviation be high?
when the individual measurements are spread widely around the mean (sum of the squares of the residuals is high -- i.e. numerator is high).
values of a1, a0
x and y bar are the means of x,y respectively
A curve that minimizes the discrepancy between the data points and the curve is:
y = a0 + a1 x + e where a0 and a1 are coefficients representing the intercept and the slope, respectively, and e is the error, or residual, between the model and the observations,