Exam 1 Data Analytics
uses of forecasts
accounting-cost/profit finance-cash flow human resources-hiring/recruiting etc.
big data
data characterized by 4Vs
types of quantitative data
discrete and continuous
race would be _________ scale
nominal
student letter grades would be _______ scale
ordinal
data types
qualitative and quantitative
ratio scale
quantitative data measurement. values are ordered. it DOES have an absolute zero (the 0 is meaningful). doubling is meaningful. example: weight, maps
measures of dispersion
range, interquartile range, variance, standard deviation
nominal scale
used for qualitative variables. distinct categories (mutually exclusive). no inherent order or ranking. examples: eye color, gender (can't say brown eyes are better than blue)
interval scale
used for quantitative measurement. values can be ordered. difference of 1 unit has same meaning for all values of the variable. arbitrary 0 point (zero has no meaning). doubling is not meaningful. examples: temperature, IQ score
histogram
useful way to illustrate frequency distribution of continuous data -Y axis is frequency of occurence and x axis is value or range of values useful to visualize skewness in data excel function: =FREQUENCY(data_array, bins_array) to count the number in a given range
how walmart uses data analytics for customer retention and acquisition
uses social media to stay acquainted with market trends
where does data come from
various sources
when to use line graphs
when your x axis is continuous, like a period of time
phase 6: deployment
*plan deployment* (develop a strategy for deployment, document procedure for deployment). *plan monitoring and maintenance* (helps to avoid unnecessarily long periods of incorrect usage results). *produce final report* (summary of project) *review project* (assess what went right and wrong, what needs to be improved).
descriptive analytics
-Highlight features and characteristics of a data set by using a summary -Typically used to convert large amount data into a small amount of information which is easier to understand -Answering the question of what happened -Retrospective analysis of historic data -describes or summarizes raw historical data, provides info on *what HAS happened*, learn from past behaviors and plan for future actions, uses data aggregation, data visualization, and data mining to provide insights. examples: sales reports, finance reports, etc.
big data facts
We generate 2.5 million terabyte every day 90% of world's data was created in the last 2 years 80% of world's data is unstructured Data captured by industry doubles every 1.2 years Today's data centers occupy an area of land equal to the size of almost 6,000 football fields
pivot table
an interactive worksheet that allows you to summarize large amounts of data
IQR for outliers
arrange data in order calculate Q1 and Q3 calculate IQR compute: -upper limit= Q3 + 1.5 x IQR -lower limit = Q1 - 1.5 x IQR *anything outside the above calucated value is an OUTLIER*
phase 5: evaluation
assess how well the model performs. evaluate model, evaluate results, review process
arithmetic mean
average common measure for mid-point of a set of values -can be badly affected by outliers excel function: =AVERAGE(num1,num2,...)
strategic decisions with demand forcasting
based on long term forecasts (several years) example: facility and capacity planning
two dimensional table
both row and column field eg: what is the expenditure by each employee for each category?
three dimensional table
can use multiple row/column filter fields eg: what is the expenditure by each employee for each month under each category?
ordinal scale
categories with implied order. no arithmetic operations. example: car size: compact, mid-size, full-size. or strongly are to strongly disagree.
descriptive statistics measures
central tendency measures of dispersion
phase 2: data understanding
collect and collate data, explore data with summary statistics and visualization, verify the quality with missing values, outliers, inconsistent or incorrect values
what is data?
collection of facts, observations or other information related to a particular question or problem Numbers, characters, symbols, images, etc., May contain different content May originate from various sources Meaningless unless interpreted by a human or machine
qualitative unstructured
data not in traditional database. info that is difficult to analyze. difficult and costly to analyze. example: customer reviews on amazon or posts on instagram and facebook.
qualitative structured
data that resides in a fixed field within a file. info with high degree of organization. easy to store an analyze. example: a persons gender, race, ethnicity
veracity
data uncertainty managing the reliability and predictability of inherently imprecise data types.
3 key enablers of descriptive analytics
descriptive statistics data visualization summarizing data into tables
types of business analytics
descriptive, predictive, prescriptive
how to handle outliers
drop outlier records(compute mean without outlier values) cap your data (you have a lower and upper limit) assign new value for outliers (if outlier is due to a mistake in your data) try a transformation (percentile version of original data)
why analytics?
gain valuable insights on data generated. faster and smarter decisions. competitive advantage. high demand. skill shortage.
scatter plot
graph in which two variables are plotted along two axes useful for revealing the presence of any correlation example: Example: A real estate agent surveyed a community and obtained the data on the house size and its selling price. He wants to determine if the two variables are correlated. Use a scatter plot to visually represent the relationship between the two variables
variance
how far away the values are from the mean smaller variance=closer the scores are to mean
forecast
inference of what is likely to happen in the future -best estimate of random variable based on available information Different from estimating probability distribution of demand
qualitative data
measures of TYPES, name symbol, or number code. descriptions. can be observed. time consuming to analyze. example: the painting is blue and green, large brush strokes, smells musty
quantitative data
measures of VALUES expressed as numbers. data can be measured. less time consuming to analyze. example: the painting is 10" by 14" and costs $300.
tactical decisions with demand forecasting
medium range forecast (yearly) example: aggregate planning, inventory policy, labor needs, production scheduling
median
middle value when a variables values are ranked in order point that divides a distribution into two equal halves excel function: =MEDIAN(num1,num2,...) the 50th percentile the BEST central measure for skewed and original data If 'n' denotes the number of data values and if 'n' is odd then, median will be found in the (n+1)/2 position If 'n' denotes the number of data values and if 'n' is even then, Find value at position n/2 Find value at position n/2 + 1 Find average of the above two values to get median
mode
most common or most frequent occuring value in a series of data excel function: =MODE(num1,num2,...) -there could be NO mode to a data set or there could be more than one mode (two modes is called bimodal series, multiple modes is called multimodial series) -LEAST used measure of central tendency -BEST central measure for *nominal data*
data measurement scales
nominal ordinal interval ratio
what does data look like?
numbers, characters, symbols, images, etc.
pareto chart
observed that 80% of income in italy was recieved by 20% of the italian population *most of the results are determined by a small number of causes.* graph in ranking order from most frequent to least frequent.
one dimensional table
only row field eg: what is the total expenditure by each employee?
cancer stages would be ______ scale
ordinal
relationship between variables of scatterplots
positive linear-mostly going up from left to right negative linear- mostly going down from left to right curvlinear-not a straight line but some type of curve relationship no relationship-no pattern to data points
predictive analytics
predicts or forecasts future outcomes, provides info on *what MIGHT happen*, allows companies to make informed decisions, uses machine learning algorithms, a subset of artificial intelligence. examples: will MU win the next football game? price of google stock tomorrow?
prescriptive analytics
prescribes best course of action for a given situation, provides recommendations for *what TO do*, allows to take advantage of the predictions, uses optimization and simulation mofels to provide best course of action. example: best strategies for promotion given sales forecast, or optimal price given sales forecast
how walmart uses big data analytics
processes over 40 petabytes of data per day more than 1M transactions per hour collects 2.5 PB of data from 1M customers every hour. second largest in-memory platform in the world.
CRISP-DM
provides framework for devising, creating, building, testing and deploying data analytics solutions
box plot
provides snapshot of data useful tool for visualizing and identifying outliers very versatile outliers are points that are not included in the plot max is the highest point at one end, min is lowest point at other end of line upper quartile: 25% of data is larger than this value at the median: 50% of data is larger than this value lower quartile: 25% of data is smaller than this value
to find IQR manually
put data into ascending order, find middle value for first half of data, and middle value for second half of data. then subtract second value from first value.
height would be ________ scale
ratio
how walmart uses data for customized recommendation system
recommends products which would suffice your budget based on previous purchase history and customers like you
data analytics
science of finding patterns and obtaining insights from raw data using algorithmic or mechanical process. qualitative and quantitative data used
operational decisions with demand forecasting
short-term forecast, low variability day-to-day operations
elements of good forecasts
should be timely should be reliable accurate equal chance of being over and over (should not be biased in one direction) easy to use and simply understood
range
spread or distance between lowest and highest values of a variable highest value minus the lowest value excel function: =MAX(num1,num2,..) *-* MIN(num1,num2,...)
standard deviation
square root of variance excel function: =STDEV.S(num1,num2,...)
why do we need standard deviation?
standard deviation is expressed in the *same units* as the mean (st. dev is good for interpretation) variance is expressed in squared terms (more useful for statistics terms, and better for developing theoretical models)
common features of forecasts
they are mostly wrong more accurate at the aggregate level (i.e. sales of product categories vs SKUs) short term forecasts are more accurate than long term forecasts forecasts are dynamic and change always garbage-in-garbage-out
phase 1: business understanding
understanding the project objectives and reqiurements from a business perspective, then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objects
when can data be meaningless
unless it is interpreted by a human or machine
80% of worlds data is ________________
unstructured
quantitative discrete
values belonging to the set are *distinct and separate* example: number of students in a class, number of iphones sold by apple
quantitative continuous
values belonging to the set can take *any* value within a *finite/infinite interval* example: persons height or weight distance traveled by a car
quartile
values that divide data set into 4 groups containing approximately equal number of observations
walmart labs analyze
what customers buy, what is trending on twitter, local events, how whether effects buying patterns to: improve shopping experience, increase sales and revenue
when to use pie charts
when percentages add up to 100
when to use horizontal bars
when the categories have lengthy descriptions
advantages of delphi
-*provides*: statistical feedback for each round, statistical group response, anonymous -*avoids*: dominant authoritative figures, bandwagon effect, persuasiveness -*used to be time consuming, but with electronic communication it can now be done quickly*
useful excel functions
-LARGE(range, k) - returns kth largest number in a range -SMALL(range, k) - returns kth smallest number in a range -PERCENTILE(range, k) - returns the 100*kth percentile in a range -PERCENTRANK(range, value) - calculates the percentage of elements in "range" below "value" -COUNTIF(range, criteria) - counts the number of cells in a range that meets the criteria -SUMIF(range, criteria, sum_range) - adds up the entry in the sum_range column for every row in which the cell in the range meets the desired criteria -
forecasting software
-automatic (most expensive) -semi-automatic (moderate cost) -manual (cheap/free) user has to have more knowledge of forecasting with this
delphi method
-based on expert opinion -panel of experts that are anonymous with different background and expertise -facilitator-obtains responses and presents them
qualitative forecasting
-based on subjective opinions -strategic decisions
incorporating seasonality
-constant level forecasting methods assume demand values in various periods form a stationary time series -some cases may have seasonal series (retail sales high during christmas) -essential to incorporate seasonality in the forecasts -use constant methods in conjunction with seasonality index pertaining to each period
testing equal variance assumption
-create a scatter plot of residuals versus the predicted (fitted) values -in excel: insert-chart-scatterplot -residuals in y axis and predicted values in x-axis -if the vertical width of the scatter doesnt appear to increase or decrease across the fitted values, then we can assume that the variance in the error terms is constant
drawbacks to exponential smoothing
-difficulty in choosing alpha (practical value between 0.1 and 0.4) -lags behind continuing trend, however, method can be modified for trend and seasonal variations
drawbacks
-dominance of authoritative figures -bandwagon effect -persuasiveness of some individuals
4 qualitative methods
-executive committee consensus -survey of sales force -customer surveys -delphi method
delphi process
-facilitator sends a survey to panelists -panelists respond to facilitator and look for a consensus -facilitator prepares a statstical summary -summary shared to panel members and panelists are asked to change or revise opinions until a general consensus is reached
executive committee consensus
-forecasts for products and services are determined by a group of senior execs. -final forecast communicated to other employees -top-down
key principles of data analytics
-have to use real facts and real data -does not use assumptions or derived data which cloud the description -deals only with the PAST (future belongs to predictive analytics) -calculations made for a descriptive analytics report should be marked clearly
best practices for descriptive analytics
-if you can use a single number instead of a chart, do so -only include what is necessary -know your charts and when to use them ---Use line graphs when the items on the x-axis are continuous, e.g. a period of time. ---Use horizontal bars when the categories have lengthy descriptions. ---Pie chart percentage should add to 100 -remove uneccessary chart elements (like removing uncessary grid lines)
mean squared error (MSE)
-larger erros get penalized more due to squarring -computed as average of square of the errors of all time periods -also ignore sign of error
forecast error measures
-mean absolute deviation (MAD) -mean squared error (MSE) -Standard deviation of error (STD) -bias
Mean absolute deviation (MAD)
-measures the dispersion of the forecast errors (et) -computed as average of absolute value of the errors of all time periods -*ignore sign of error*
bias
-measures whether the forecast is overestimating or underestimating the actual demand over the forecast horizon -calculated as the sum of all the errors -*CANNOT ignore minus sign for bias*
testing indepedence assumption
-plot residuals (in y axis) against any time variable (i.e. order of observation) or with any independent variable -select data tab-data analysis-regression -check residual plots to plot residuals against each of the independent variables -if the residuals are randomly scattered over the observations, then they are independent
real estate example
-real estate agent wants to list a house for sale -agent wondering how much to list the house -you estimate the value by looking at other houses in same neighborhood -you plot data between square footage and house price, and we find there is a positive linear relationship
survey of salesforce
-regional sales people provide forecast -then reviewed by upper management -bottom up
customer surveys
-scientifically designed surveys -survey results tabulated at corporate level and forecasts prepared -grass-roots approach -common for new products
uses of forecast errors
-select a method by retrospective testing on past data -select the parameters of a particular method -monitor how well the selected method is performing
testing normality assumption
-select data tab-data analysis-regression -check normal probablitity plots to display the normality plot in output -if the plot is approx linear, than the errors are normally distributed -if the plot is curved, the residuals are skewed, normality is not satisfied
time series forecasting
-set of values for a sequence of random variables over time -goal: give observered demand for t periods, determine forecasted demand for period t+1 -assume future is related to past
advantages to exponential smoothing
-to forecast, you only need last periods actual demand and its forecast. -reacts more quickly to changes in data compared to averaging methods
steps in forecasting process (6)
1. Determine the purpose of the forecast 2. Establish a time horizon 3. Select a forecasting technique 4. Gather and analyze data 5. Prepare the forecast 6. Monitor the forecast
computing variance
1. Take each value and subtract the mean (deviation) 2. Square each of those deviations 3. Sum up all the squared deviations (sum of squares) 4. Divide the sum of squares by N-1 (the number of observations in the population-1) excel function: =VAR.S(num1,num2,...)
steps for descriptive analytics
1. collect relevant data 2. conduct analysis in accordance to the key principles 3. present data clearly *without* influencing the readers (descriptive leaves the data interpretation to the reader) 4. provide regular and consistent reports
steps for incorporating seasonlity in forecasting
1. compute seaosnlity index (SI) for each seasonal period 2. remove seasonality from demand data 3. select appropriate time series for forecasting method 4. apply method to deasonlized demand to get deseasonlized forecast 5. add seasonlity back to get actual forecast. Ft= deseasonlized forecast x SI(t)
4 assumptions for linear regression (LINE assumptions)
1. linear: errors are linear with respect to dependent variable 2. independence: errors are independent of each other 3. normality: errors are normaly distributed 4. equal variance: errors have constant variance One can use regression, only IF the LINE assumptions are satisfied
data captured by industry doubles every _______ years
1.2
90% of worlds data was created in the last ____ years
2
what does 1st quartile mean?
25% of your data is less than that value.
what does 3rd quartile mean?
50% of data is between Q1 value and Q3 value also 75% of the values are less than that value, and 25% of the data is more than that value.
todays data centers occupy an area of land equal to the size of almost __________ football fields
6,000
deseasonlized demand (Dt) D has line on top
= Dt / SI (t) SI(t) is seasonlity index for period t
Seasonality Index (SI)
= average demand during a seasonal period / overall average of demand for all periods
data analytics methodolody
CRISP-DM
sales effected by promotions
DV: sales IV: promotions
4 Vs
Volume, Velocity, Variety, Veracity
forecast errors
accuracy depends on forecast errors Error for period t= forecast for period t - demand for period t OR et= Ft-Dt
averaging method
all past data are averaged to get the new forecast
slope of estimated regression line
b1= sum (xi - x bar)(yi-ybar) / sum (xi - bar) squared
intercept of estimated regression line
bo= ybar-b1xbar
I or D: caffeine affects your appetite
caffeine= IV appetite= DV
systematic component
constant level seasonality trend (growth or decline) seasonlity and trend
phase 4: modeling
core of project. select modeling technique, generate test design, build model
testing linearity assumptions
create a scatter plot of dependent variable (y) and standardized residuals -in excel: insert- chart - scatter plot -if chart appears to be linear, then the linearity assumption is satisfied -otherwise, linear regression might not be an appropriate tool
volume
data at scale terabytes to petabytes of data
variety
data in many forms structured, unstructured, text, multimedia
velocity
data in motion analysis of streaming data to enable decisions within fractions of a second
standardized residuals
divide each error by the standard deviation of all the errors
estimating population parameters
drawing a sample from population of interest
sum of errors should be
equal to zero
residuals in simple regression
error (e) (residuals) is the difference between the observed value - the estimated value in the regression equation, e= y- y hat
after getting slope and intercept from excel, put them into the regression line equation
estimated house price = intercept + slope x house size
extrapolation
estimating or predicting beyond the observation range -may not be appropriate to use regression in our example the range of house size is between 1050 and 3570 sq ft -for example, it may not be appropriate to predict the house price using our model when the house size is 500 sq ft
using excel for simple linear regression
excel function for slope =slope(y-range,x-range) excel function for intercept =intercept(y-range,x-range) excel function for r-squared =RSQ(y-range,x-range)
testing for significance: f-test
f test is used to determine whether a significant relationship exists between the DV and the set of all the IVs f test is referred to as the test for overall significance
interpretation of slope
for every unit increase in the independent variable (x), we expect on average the dependent variable (y) to increase (or decrease) by the slope (b1) -if b1 is positive then y increses -if b1 is negative then y decreases
uses of marketing analytics
gaining new customers and retaining customers
simple linear regression
has ONE independent variable -quantify relationship between 2 quantitative variables -predict new observations
qualitative variable with more than 2 categories
i.e. eye color: blue, brown, green do not include one column because you can assume based on first two what the answer will be for last (called reference column)
interpretation of intercept
if independent variabel (x) is 0, then we expect on average the dependent variable (y) to be equal to the intercept (bo) -often times it is non-sensical or impossible to have x=0 -do not interpret intercept in such situations
making inferences about coefficients
in hypothesis testing for coefficients, if the p-value for coefficient b1 is <=0.05, then slope is significant (not equal to 0), there exists a linear relationship between x and y, and x can be used to predict y if p-value for coefficient b1 is > 0.05, the slope is NOT significant, and a linear relationship does NOT exist
Making inferences about regression model
in hypothesis testing... if the f-value for the regresion model is <= 0.05, then -reject the null hyp -the regression eq is significant for predicting the outcome variable If the f-value for regression model is > 0.05, -fail to reject the null hyp -there is not enough evidence to show that the regression model is significant
qualitative Independent variables
in many situations, we must work with qualitative IVs such as gender, method of payment, etc i.e. x2 might represent gender where x2=0 indicates male and x2=1 indicates female in this case, x2 is called an indicator variable
multiple regression
in most cases, more than one independent variable (x) can be used to explain the variation in dependent variable (y) statistical method to summarize linear relationship between 1 DV and many IVs is called multiple linear regression
interpreting coefficients
in multiple regression we interpret each regression coeff as follows: bi represents an estimate of the change in y corresponding to a 1-unit increase in xi when all other IVs are held constant
which is independent and which is dependent: relationship between duration of sleep and test scores
independent: sleep dependent: test score
intepretation of intercept
intercept: predicted value of dependent variable when the IV is 0 (x=0) not always meaningful to interpret the intercept for example: predicted house price = 254820.60 + 140.65xhouse size DOES NOT make sense to interpret the predicted house price to be $254,820.60 when the house size is 0 sq ft
when to interpret slopes
interpret slope only when -sope is significant (p value is less than 0.05), and -line assumptions are satisfied (because the normal probability plot has a linear trend) ***make sure to explain WHY slope is significant and WHY line assumptions are satisfied if slope is not significant, then we cannot use linear regression to quantify the relationship. therefore, interpretation of slope becomes meaningless
SAT scores would be ______ scale
interval
what is a best fit line
line that minimizes the total error between observed and estimated value error (e) = observed value - estimated value e= y- y^
best fitting line
line with least sum of squared errors is called the least squares line (or best fitting line)
simple linear regression motivation
managerial decisions are often based on relationship between 2 or more variables
ybar
mean value for the dependent variable
xbar
mean value for the independent variable
last value or native method
most recent observed value is the new forecast -given t period of demand data, the forecast for t+1 is Ft+1= Dt so to forecast the value for the 7th period, you use the 6th period value
weighted moving average
most recent values are given higher weights in the moving average -multiply the values by their weighted value 0.5 x D6 + 0.3 x D5 + 0.2 x D4
blood type would be ___________ scale
nominal
components of an observation
observed demand = systematic component + random component
regression model
observed value = linear component + random component y = Bo + B1x + error y is dependent variable Bo is y-intercept B1 is slope x is independent variable error variable is random component Bo + B1x is linear component
m-period moving average
only the last 'm' values of past data are averaged for forecast example: 3-period moving average for period 7 is the average of values 6, 5, and 4
measure for best fit line
option 1: minimize sum of magnitudes (absolute values) of errors -take absolute value of errors option 2: minimize sum of squared errors -square the error values
hyopothesis testing of slope
problem: is the house size (x) linearly related to the house price (y)? y hat = bo +b1x in other words, is the coefficient of x siginificant to quantify the relationship with hy? null hypothesis Ho: slope B1 is 0 Alternate hypothesis Ha: slope B1 is NOT 0
scatterplots
relationship between variables postiive linear, negative linear, curviliniear, and no relationship
examples of simple linear regression
retail industry= sales based on promotions healthcare= length of stay based on blood pressure
hypothesis testing for individual significance
same as simple regression for overall significance multiple regression: Ho: coefficient B1 is 0 Ha: coeff B1 is NOT 0 in multiple regression, we will do hypothesis testing for each coeff to known individual significance
multiple regression model
see image but add a +error to the end where B1 is coefficient of variable x1, B2 is coefficient of variable x2, etc all assumptions of simple regression extend to multiple regression regression line minimizes the sum of squared errors unlike simple regression, formula for multiple regression uses matrix algebra and we rely on computer software to perform calculations
constant level demand pattern
see screen shot
constant level with seasonality
see screen shot
constant level with seasonality and trend
see screen shot
constant level with trend
see screen shot
using excel for hypothesis testing
select data tab data analysis regression specify x range, y range, and click OK check labels output in new worksheet
obtaining errors (residuals) using excel
select data tab select data analysis select regression check "residuals" check "standardized residuals"
hypothesis testing for overall significance
simple regression: Null Hyp Ho: slope (B1) is 0 Alt H Ha: slope is NOT 0 Multiple regression: Ho: all coeffs are 0 (B1=B2=...Bn=0) Ha: one or more of the coeffs is NOT zero
coding qualitative variable to indicator variable
since qualitative variable only has 2 categories (Yes or No), we just need to create one indicator variable lets call indicator variable "Grad.Degree" and it is coded as follows: -0 if individual *does not* have a grad degree -1 if individual *does* have a grad degree
interpretation of slope
slope: estimated change in dependent variable (y) for each unit increase in the independent variable (x) i.e. predicted house price= 254820.60 + 140.65x house size - if the size of the house increases by 1 sq foot, we would expect the house price to increase by $140.65
standard deviation of forecast error (STD)
square root of MSE
linear regression
statstical method to summarize and study relationships between two quantitative variables -model relationship by fitting a straight line (linear equation) to observed data
qualitative types
structured vs unstructured
variance for a population=
sum of squares/N
variance for a sample=
sum of squares/n-1
phase 3: data perparation
takes usually over 80-90% of the project time. collection, consolidation, cleaning, data selection, transformations.
IQR indicates
the extent to which the central 50% of values within the dataset are dispersed
testing for significance: t-test
the f-test shows an overall significance, the t test is used to determine whether each of the individual IVs is significant a seperate t test is conducted for each of the IVs in the model we refer to each of these t tests as a test for individual significance
if bias is greater than zero,
the method is overestimating
if bias is less than zero
the method is underestimating (this is preferred to over forecasting)
example of equal variance
the spread of the residuals increases as the fitted value increases = UNEQUAL variance if the spread of the residuals remains constant as the fitted value increases, EQUAL variance
prediction using regression
using the linear model to predict the value of the response variable for a given value of independent variable is called prediction estimated housr price=254820.60 + 140.65x house size if house size is 2100 sq ft: estimated house price = 254820.60 +140.65 x 2100 =550,185.60
method 2 for using excel for simple linear regression
using trend line in excel -plot a scatter plot of x and y (highlight all data in excel) -right click on scatter plot and choose "add trend line" -linear fitted line should be selected already -check "display equation on chart" -check "display r-squared value on chart"
yi
value of dependent variable for the ith observation
xi
value of independent variable for the ith observation
dependent variable
variable we wish to predict or explain denoted as 'y' and plotted in y axis in scatter plot example: house price
independent variable
variables used to predict or explain the dependent variable -often denoted as x example: house size
exponential smoothing method
variant of weighted moving average with weights decreasing exponentially -uses ALL data points -most popular method Ft+1 = alphaDt +(1-alpha)Ft if alpha is not known/given: use 2/(t+1) where t is the number of known demand periods
linear equation
y=mx+b straight line on 2-D plot slope is m y-intercept is b
estimated regression equation
y^ = bo +b1x y^ is estimated value bo is intercept of the estimated regression line b1 is slope of the estimated regression line
r-squared
The amount of variability (variance) in the DV which is explained by the predictor variables (reported as a %age)
forecasting for multiple periods: under seasonality
Ft+n = Fbar t+1 x SI(t+n), for n=1,2,3... t+1 in first two F's are tiny- do not multiply
forecasting for multiple periods: constant level model
Ft+n = Ft+1, for n =1,2,3... In other words, the best forecast for t+2, t+3, etc is F(t+1).
amount of fertilizer impacts crop yield
IV: fertilizer DV: crop yield
when to interpret coefficients
ONLY when coefficient is significant (when bi is less than 0.05) and line assumptions must be satisfied if coeff is not significant in multiple regression, then we must remove it from the regression equation and re-run the analysis. it cannot be interpreted
interquartile range
Q3-Q1 Q2=median gives you the center 50% of values excel function for quartiles =QUARTILE.inc(select data set, 1,2,3, or 4) IQR= QUARTILE(array,3) - QUARTILE(array,1)
forecasting methods
Qualitative:Judgemental and Market Reasearch -based on subjective judgements, past data is not needed. -used for strategic decisions Quantitative: Statsitical methods -assume the past represents the future
measures of central tendency
-mean -median -mode