Exam 1 Data Analytics

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

uses of forecasts

accounting-cost/profit finance-cash flow human resources-hiring/recruiting etc.

big data

data characterized by 4Vs

types of quantitative data

discrete and continuous

race would be _________ scale

nominal

student letter grades would be _______ scale

ordinal

data types

qualitative and quantitative

ratio scale

quantitative data measurement. values are ordered. it DOES have an absolute zero (the 0 is meaningful). doubling is meaningful. example: weight, maps

measures of dispersion

range, interquartile range, variance, standard deviation

nominal scale

used for qualitative variables. distinct categories (mutually exclusive). no inherent order or ranking. examples: eye color, gender (can't say brown eyes are better than blue)

interval scale

used for quantitative measurement. values can be ordered. difference of 1 unit has same meaning for all values of the variable. arbitrary 0 point (zero has no meaning). doubling is not meaningful. examples: temperature, IQ score

histogram

useful way to illustrate frequency distribution of continuous data -Y axis is frequency of occurence and x axis is value or range of values useful to visualize skewness in data excel function: =FREQUENCY(data_array, bins_array) to count the number in a given range

how walmart uses data analytics for customer retention and acquisition

uses social media to stay acquainted with market trends

where does data come from

various sources

when to use line graphs

when your x axis is continuous, like a period of time

phase 6: deployment

*plan deployment* (develop a strategy for deployment, document procedure for deployment). *plan monitoring and maintenance* (helps to avoid unnecessarily long periods of incorrect usage results). *produce final report* (summary of project) *review project* (assess what went right and wrong, what needs to be improved).

descriptive analytics

-Highlight features and characteristics of a data set by using a summary -Typically used to convert large amount data into a small amount of information which is easier to understand -Answering the question of what happened -Retrospective analysis of historic data -describes or summarizes raw historical data, provides info on *what HAS happened*, learn from past behaviors and plan for future actions, uses data aggregation, data visualization, and data mining to provide insights. examples: sales reports, finance reports, etc.

big data facts

We generate 2.5 million terabyte every day 90% of world's data was created in the last 2 years 80% of world's data is unstructured Data captured by industry doubles every 1.2 years Today's data centers occupy an area of land equal to the size of almost 6,000 football fields

pivot table

an interactive worksheet that allows you to summarize large amounts of data

IQR for outliers

arrange data in order calculate Q1 and Q3 calculate IQR compute: -upper limit= Q3 + 1.5 x IQR -lower limit = Q1 - 1.5 x IQR *anything outside the above calucated value is an OUTLIER*

phase 5: evaluation

assess how well the model performs. evaluate model, evaluate results, review process

arithmetic mean

average common measure for mid-point of a set of values -can be badly affected by outliers excel function: =AVERAGE(num1,num2,...)

strategic decisions with demand forcasting

based on long term forecasts (several years) example: facility and capacity planning

two dimensional table

both row and column field eg: what is the expenditure by each employee for each category?

three dimensional table

can use multiple row/column filter fields eg: what is the expenditure by each employee for each month under each category?

ordinal scale

categories with implied order. no arithmetic operations. example: car size: compact, mid-size, full-size. or strongly are to strongly disagree.

descriptive statistics measures

central tendency measures of dispersion

phase 2: data understanding

collect and collate data, explore data with summary statistics and visualization, verify the quality with missing values, outliers, inconsistent or incorrect values

what is data?

collection of facts, observations or other information related to a particular question or problem Numbers, characters, symbols, images, etc., May contain different content May originate from various sources Meaningless unless interpreted by a human or machine

qualitative unstructured

data not in traditional database. info that is difficult to analyze. difficult and costly to analyze. example: customer reviews on amazon or posts on instagram and facebook.

qualitative structured

data that resides in a fixed field within a file. info with high degree of organization. easy to store an analyze. example: a persons gender, race, ethnicity

veracity

data uncertainty managing the reliability and predictability of inherently imprecise data types.

3 key enablers of descriptive analytics

descriptive statistics data visualization summarizing data into tables

types of business analytics

descriptive, predictive, prescriptive

how to handle outliers

drop outlier records(compute mean without outlier values) cap your data (you have a lower and upper limit) assign new value for outliers (if outlier is due to a mistake in your data) try a transformation (percentile version of original data)

why analytics?

gain valuable insights on data generated. faster and smarter decisions. competitive advantage. high demand. skill shortage.

scatter plot

graph in which two variables are plotted along two axes useful for revealing the presence of any correlation example: Example: A real estate agent surveyed a community and obtained the data on the house size and its selling price. He wants to determine if the two variables are correlated. Use a scatter plot to visually represent the relationship between the two variables

variance

how far away the values are from the mean smaller variance=closer the scores are to mean

forecast

inference of what is likely to happen in the future -best estimate of random variable based on available information Different from estimating probability distribution of demand

qualitative data

measures of TYPES, name symbol, or number code. descriptions. can be observed. time consuming to analyze. example: the painting is blue and green, large brush strokes, smells musty

quantitative data

measures of VALUES expressed as numbers. data can be measured. less time consuming to analyze. example: the painting is 10" by 14" and costs $300.

tactical decisions with demand forecasting

medium range forecast (yearly) example: aggregate planning, inventory policy, labor needs, production scheduling

median

middle value when a variables values are ranked in order point that divides a distribution into two equal halves excel function: =MEDIAN(num1,num2,...) the 50th percentile the BEST central measure for skewed and original data If 'n' denotes the number of data values and if 'n' is odd then, median will be found in the (n+1)/2 position If 'n' denotes the number of data values and if 'n' is even then, Find value at position n/2 Find value at position n/2 + 1 Find average of the above two values to get median

mode

most common or most frequent occuring value in a series of data excel function: =MODE(num1,num2,...) -there could be NO mode to a data set or there could be more than one mode (two modes is called bimodal series, multiple modes is called multimodial series) -LEAST used measure of central tendency -BEST central measure for *nominal data*

data measurement scales

nominal ordinal interval ratio

what does data look like?

numbers, characters, symbols, images, etc.

pareto chart

observed that 80% of income in italy was recieved by 20% of the italian population *most of the results are determined by a small number of causes.* graph in ranking order from most frequent to least frequent.

one dimensional table

only row field eg: what is the total expenditure by each employee?

cancer stages would be ______ scale

ordinal

relationship between variables of scatterplots

positive linear-mostly going up from left to right negative linear- mostly going down from left to right curvlinear-not a straight line but some type of curve relationship no relationship-no pattern to data points

predictive analytics

predicts or forecasts future outcomes, provides info on *what MIGHT happen*, allows companies to make informed decisions, uses machine learning algorithms, a subset of artificial intelligence. examples: will MU win the next football game? price of google stock tomorrow?

prescriptive analytics

prescribes best course of action for a given situation, provides recommendations for *what TO do*, allows to take advantage of the predictions, uses optimization and simulation mofels to provide best course of action. example: best strategies for promotion given sales forecast, or optimal price given sales forecast

how walmart uses big data analytics

processes over 40 petabytes of data per day more than 1M transactions per hour collects 2.5 PB of data from 1M customers every hour. second largest in-memory platform in the world.

CRISP-DM

provides framework for devising, creating, building, testing and deploying data analytics solutions

box plot

provides snapshot of data useful tool for visualizing and identifying outliers very versatile outliers are points that are not included in the plot max is the highest point at one end, min is lowest point at other end of line upper quartile: 25% of data is larger than this value at the median: 50% of data is larger than this value lower quartile: 25% of data is smaller than this value

to find IQR manually

put data into ascending order, find middle value for first half of data, and middle value for second half of data. then subtract second value from first value.

height would be ________ scale

ratio

how walmart uses data for customized recommendation system

recommends products which would suffice your budget based on previous purchase history and customers like you

data analytics

science of finding patterns and obtaining insights from raw data using algorithmic or mechanical process. qualitative and quantitative data used

operational decisions with demand forecasting

short-term forecast, low variability day-to-day operations

elements of good forecasts

should be timely should be reliable accurate equal chance of being over and over (should not be biased in one direction) easy to use and simply understood

range

spread or distance between lowest and highest values of a variable highest value minus the lowest value excel function: =MAX(num1,num2,..) *-* MIN(num1,num2,...)

standard deviation

square root of variance excel function: =STDEV.S(num1,num2,...)

why do we need standard deviation?

standard deviation is expressed in the *same units* as the mean (st. dev is good for interpretation) variance is expressed in squared terms (more useful for statistics terms, and better for developing theoretical models)

common features of forecasts

they are mostly wrong more accurate at the aggregate level (i.e. sales of product categories vs SKUs) short term forecasts are more accurate than long term forecasts forecasts are dynamic and change always garbage-in-garbage-out

phase 1: business understanding

understanding the project objectives and reqiurements from a business perspective, then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objects

when can data be meaningless

unless it is interpreted by a human or machine

80% of worlds data is ________________

unstructured

quantitative discrete

values belonging to the set are *distinct and separate* example: number of students in a class, number of iphones sold by apple

quantitative continuous

values belonging to the set can take *any* value within a *finite/infinite interval* example: persons height or weight distance traveled by a car

quartile

values that divide data set into 4 groups containing approximately equal number of observations

walmart labs analyze

what customers buy, what is trending on twitter, local events, how whether effects buying patterns to: improve shopping experience, increase sales and revenue

when to use pie charts

when percentages add up to 100

when to use horizontal bars

when the categories have lengthy descriptions

advantages of delphi

-*provides*: statistical feedback for each round, statistical group response, anonymous -*avoids*: dominant authoritative figures, bandwagon effect, persuasiveness -*used to be time consuming, but with electronic communication it can now be done quickly*

useful excel functions

-LARGE(range, k) - returns kth largest number in a range -SMALL(range, k) - returns kth smallest number in a range -PERCENTILE(range, k) - returns the 100*kth percentile in a range -PERCENTRANK(range, value) - calculates the percentage of elements in "range" below "value" -COUNTIF(range, criteria) - counts the number of cells in a range that meets the criteria -SUMIF(range, criteria, sum_range) - adds up the entry in the sum_range column for every row in which the cell in the range meets the desired criteria -

forecasting software

-automatic (most expensive) -semi-automatic (moderate cost) -manual (cheap/free) user has to have more knowledge of forecasting with this

delphi method

-based on expert opinion -panel of experts that are anonymous with different background and expertise -facilitator-obtains responses and presents them

qualitative forecasting

-based on subjective opinions -strategic decisions

incorporating seasonality

-constant level forecasting methods assume demand values in various periods form a stationary time series -some cases may have seasonal series (retail sales high during christmas) -essential to incorporate seasonality in the forecasts -use constant methods in conjunction with seasonality index pertaining to each period

testing equal variance assumption

-create a scatter plot of residuals versus the predicted (fitted) values -in excel: insert-chart-scatterplot -residuals in y axis and predicted values in x-axis -if the vertical width of the scatter doesnt appear to increase or decrease across the fitted values, then we can assume that the variance in the error terms is constant

drawbacks to exponential smoothing

-difficulty in choosing alpha (practical value between 0.1 and 0.4) -lags behind continuing trend, however, method can be modified for trend and seasonal variations

drawbacks

-dominance of authoritative figures -bandwagon effect -persuasiveness of some individuals

4 qualitative methods

-executive committee consensus -survey of sales force -customer surveys -delphi method

delphi process

-facilitator sends a survey to panelists -panelists respond to facilitator and look for a consensus -facilitator prepares a statstical summary -summary shared to panel members and panelists are asked to change or revise opinions until a general consensus is reached

executive committee consensus

-forecasts for products and services are determined by a group of senior execs. -final forecast communicated to other employees -top-down

key principles of data analytics

-have to use real facts and real data -does not use assumptions or derived data which cloud the description -deals only with the PAST (future belongs to predictive analytics) -calculations made for a descriptive analytics report should be marked clearly

best practices for descriptive analytics

-if you can use a single number instead of a chart, do so -only include what is necessary -know your charts and when to use them ---Use line graphs when the items on the x-axis are continuous, e.g. a period of time. ---Use horizontal bars when the categories have lengthy descriptions. ---Pie chart percentage should add to 100 -remove uneccessary chart elements (like removing uncessary grid lines)

mean squared error (MSE)

-larger erros get penalized more due to squarring -computed as average of square of the errors of all time periods -also ignore sign of error

forecast error measures

-mean absolute deviation (MAD) -mean squared error (MSE) -Standard deviation of error (STD) -bias

Mean absolute deviation (MAD)

-measures the dispersion of the forecast errors (et) -computed as average of absolute value of the errors of all time periods -*ignore sign of error*

bias

-measures whether the forecast is overestimating or underestimating the actual demand over the forecast horizon -calculated as the sum of all the errors -*CANNOT ignore minus sign for bias*

testing indepedence assumption

-plot residuals (in y axis) against any time variable (i.e. order of observation) or with any independent variable -select data tab-data analysis-regression -check residual plots to plot residuals against each of the independent variables -if the residuals are randomly scattered over the observations, then they are independent

real estate example

-real estate agent wants to list a house for sale -agent wondering how much to list the house -you estimate the value by looking at other houses in same neighborhood -you plot data between square footage and house price, and we find there is a positive linear relationship

survey of salesforce

-regional sales people provide forecast -then reviewed by upper management -bottom up

customer surveys

-scientifically designed surveys -survey results tabulated at corporate level and forecasts prepared -grass-roots approach -common for new products

uses of forecast errors

-select a method by retrospective testing on past data -select the parameters of a particular method -monitor how well the selected method is performing

testing normality assumption

-select data tab-data analysis-regression -check normal probablitity plots to display the normality plot in output -if the plot is approx linear, than the errors are normally distributed -if the plot is curved, the residuals are skewed, normality is not satisfied

time series forecasting

-set of values for a sequence of random variables over time -goal: give observered demand for t periods, determine forecasted demand for period t+1 -assume future is related to past

advantages to exponential smoothing

-to forecast, you only need last periods actual demand and its forecast. -reacts more quickly to changes in data compared to averaging methods

steps in forecasting process (6)

1. Determine the purpose of the forecast 2. Establish a time horizon 3. Select a forecasting technique 4. Gather and analyze data 5. Prepare the forecast 6. Monitor the forecast

computing variance

1. Take each value and subtract the mean (deviation) 2. Square each of those deviations 3. Sum up all the squared deviations (sum of squares) 4. Divide the sum of squares by N-1 (the number of observations in the population-1) excel function: =VAR.S(num1,num2,...)

steps for descriptive analytics

1. collect relevant data 2. conduct analysis in accordance to the key principles 3. present data clearly *without* influencing the readers (descriptive leaves the data interpretation to the reader) 4. provide regular and consistent reports

steps for incorporating seasonlity in forecasting

1. compute seaosnlity index (SI) for each seasonal period 2. remove seasonality from demand data 3. select appropriate time series for forecasting method 4. apply method to deasonlized demand to get deseasonlized forecast 5. add seasonlity back to get actual forecast. Ft= deseasonlized forecast x SI(t)

4 assumptions for linear regression (LINE assumptions)

1. linear: errors are linear with respect to dependent variable 2. independence: errors are independent of each other 3. normality: errors are normaly distributed 4. equal variance: errors have constant variance One can use regression, only IF the LINE assumptions are satisfied

data captured by industry doubles every _______ years

1.2

90% of worlds data was created in the last ____ years

what does 1st quartile mean?

25% of your data is less than that value.

what does 3rd quartile mean?

50% of data is between Q1 value and Q3 value also 75% of the values are less than that value, and 25% of the data is more than that value.

todays data centers occupy an area of land equal to the size of almost __________ football fields

6,000

deseasonlized demand (Dt) D has line on top

= Dt / SI (t) SI(t) is seasonlity index for period t

Seasonality Index (SI)

= average demand during a seasonal period / overall average of demand for all periods

data analytics methodolody

CRISP-DM

sales effected by promotions

DV: sales IV: promotions

4 Vs

Volume, Velocity, Variety, Veracity

forecast errors

accuracy depends on forecast errors Error for period t= forecast for period t - demand for period t OR et= Ft-Dt

averaging method

all past data are averaged to get the new forecast

slope of estimated regression line

b1= sum (xi - x bar)(yi-ybar) / sum (xi - bar) squared

intercept of estimated regression line

bo= ybar-b1xbar

I or D: caffeine affects your appetite

caffeine= IV appetite= DV

systematic component

constant level seasonality trend (growth or decline) seasonlity and trend

phase 4: modeling

core of project. select modeling technique, generate test design, build model

testing linearity assumptions

create a scatter plot of dependent variable (y) and standardized residuals -in excel: insert- chart - scatter plot -if chart appears to be linear, then the linearity assumption is satisfied -otherwise, linear regression might not be an appropriate tool

volume

data at scale terabytes to petabytes of data

variety

data in many forms structured, unstructured, text, multimedia

velocity

data in motion analysis of streaming data to enable decisions within fractions of a second

standardized residuals

divide each error by the standard deviation of all the errors

estimating population parameters

drawing a sample from population of interest

sum of errors should be

equal to zero

residuals in simple regression

error (e) (residuals) is the difference between the observed value - the estimated value in the regression equation, e= y- y hat

after getting slope and intercept from excel, put them into the regression line equation

estimated house price = intercept + slope x house size

extrapolation

estimating or predicting beyond the observation range -may not be appropriate to use regression in our example the range of house size is between 1050 and 3570 sq ft -for example, it may not be appropriate to predict the house price using our model when the house size is 500 sq ft

using excel for simple linear regression

excel function for slope =slope(y-range,x-range) excel function for intercept =intercept(y-range,x-range) excel function for r-squared =RSQ(y-range,x-range)

testing for significance: f-test

f test is used to determine whether a significant relationship exists between the DV and the set of all the IVs f test is referred to as the test for overall significance

interpretation of slope

for every unit increase in the independent variable (x), we expect on average the dependent variable (y) to increase (or decrease) by the slope (b1) -if b1 is positive then y increses -if b1 is negative then y decreases

uses of marketing analytics

gaining new customers and retaining customers

simple linear regression

has ONE independent variable -quantify relationship between 2 quantitative variables -predict new observations

qualitative variable with more than 2 categories

i.e. eye color: blue, brown, green do not include one column because you can assume based on first two what the answer will be for last (called reference column)

interpretation of intercept

if independent variabel (x) is 0, then we expect on average the dependent variable (y) to be equal to the intercept (bo) -often times it is non-sensical or impossible to have x=0 -do not interpret intercept in such situations

making inferences about coefficients

in hypothesis testing for coefficients, if the p-value for coefficient b1 is <=0.05, then slope is significant (not equal to 0), there exists a linear relationship between x and y, and x can be used to predict y if p-value for coefficient b1 is > 0.05, the slope is NOT significant, and a linear relationship does NOT exist

Making inferences about regression model

in hypothesis testing... if the f-value for the regresion model is <= 0.05, then -reject the null hyp -the regression eq is significant for predicting the outcome variable If the f-value for regression model is > 0.05, -fail to reject the null hyp -there is not enough evidence to show that the regression model is significant

qualitative Independent variables

in many situations, we must work with qualitative IVs such as gender, method of payment, etc i.e. x2 might represent gender where x2=0 indicates male and x2=1 indicates female in this case, x2 is called an indicator variable

multiple regression

in most cases, more than one independent variable (x) can be used to explain the variation in dependent variable (y) statistical method to summarize linear relationship between 1 DV and many IVs is called multiple linear regression

interpreting coefficients

in multiple regression we interpret each regression coeff as follows: bi represents an estimate of the change in y corresponding to a 1-unit increase in xi when all other IVs are held constant

which is independent and which is dependent: relationship between duration of sleep and test scores

independent: sleep dependent: test score

intepretation of intercept

intercept: predicted value of dependent variable when the IV is 0 (x=0) not always meaningful to interpret the intercept for example: predicted house price = 254820.60 + 140.65xhouse size DOES NOT make sense to interpret the predicted house price to be $254,820.60 when the house size is 0 sq ft

when to interpret slopes

interpret slope only when -sope is significant (p value is less than 0.05), and -line assumptions are satisfied (because the normal probability plot has a linear trend) ***make sure to explain WHY slope is significant and WHY line assumptions are satisfied if slope is not significant, then we cannot use linear regression to quantify the relationship. therefore, interpretation of slope becomes meaningless

SAT scores would be ______ scale

interval

what is a best fit line

line that minimizes the total error between observed and estimated value error (e) = observed value - estimated value e= y- y^

best fitting line

line with least sum of squared errors is called the least squares line (or best fitting line)

simple linear regression motivation

managerial decisions are often based on relationship between 2 or more variables

ybar

mean value for the dependent variable

xbar

mean value for the independent variable

last value or native method

most recent observed value is the new forecast -given t period of demand data, the forecast for t+1 is Ft+1= Dt so to forecast the value for the 7th period, you use the 6th period value

weighted moving average

most recent values are given higher weights in the moving average -multiply the values by their weighted value 0.5 x D6 + 0.3 x D5 + 0.2 x D4

blood type would be ___________ scale

nominal

components of an observation

observed demand = systematic component + random component

regression model

observed value = linear component + random component y = Bo + B1x + error y is dependent variable Bo is y-intercept B1 is slope x is independent variable error variable is random component Bo + B1x is linear component

m-period moving average

only the last 'm' values of past data are averaged for forecast example: 3-period moving average for period 7 is the average of values 6, 5, and 4

measure for best fit line

option 1: minimize sum of magnitudes (absolute values) of errors -take absolute value of errors option 2: minimize sum of squared errors -square the error values

hyopothesis testing of slope

problem: is the house size (x) linearly related to the house price (y)? y hat = bo +b1x in other words, is the coefficient of x siginificant to quantify the relationship with hy? null hypothesis Ho: slope B1 is 0 Alternate hypothesis Ha: slope B1 is NOT 0

scatterplots

relationship between variables postiive linear, negative linear, curviliniear, and no relationship

examples of simple linear regression

retail industry= sales based on promotions healthcare= length of stay based on blood pressure

hypothesis testing for individual significance

same as simple regression for overall significance multiple regression: Ho: coefficient B1 is 0 Ha: coeff B1 is NOT 0 in multiple regression, we will do hypothesis testing for each coeff to known individual significance

multiple regression model

see image but add a +error to the end where B1 is coefficient of variable x1, B2 is coefficient of variable x2, etc all assumptions of simple regression extend to multiple regression regression line minimizes the sum of squared errors unlike simple regression, formula for multiple regression uses matrix algebra and we rely on computer software to perform calculations

constant level demand pattern

see screen shot

constant level with seasonality

see screen shot

constant level with seasonality and trend

see screen shot

constant level with trend

see screen shot

using excel for hypothesis testing

select data tab data analysis regression specify x range, y range, and click OK check labels output in new worksheet

obtaining errors (residuals) using excel

select data tab select data analysis select regression check "residuals" check "standardized residuals"

hypothesis testing for overall significance

simple regression: Null Hyp Ho: slope (B1) is 0 Alt H Ha: slope is NOT 0 Multiple regression: Ho: all coeffs are 0 (B1=B2=...Bn=0) Ha: one or more of the coeffs is NOT zero

coding qualitative variable to indicator variable

since qualitative variable only has 2 categories (Yes or No), we just need to create one indicator variable lets call indicator variable "Grad.Degree" and it is coded as follows: -0 if individual *does not* have a grad degree -1 if individual *does* have a grad degree

interpretation of slope

slope: estimated change in dependent variable (y) for each unit increase in the independent variable (x) i.e. predicted house price= 254820.60 + 140.65x house size - if the size of the house increases by 1 sq foot, we would expect the house price to increase by $140.65

standard deviation of forecast error (STD)

square root of MSE

linear regression

statstical method to summarize and study relationships between two quantitative variables -model relationship by fitting a straight line (linear equation) to observed data

qualitative types

structured vs unstructured

variance for a population=

sum of squares/N

variance for a sample=

sum of squares/n-1

phase 3: data perparation

takes usually over 80-90% of the project time. collection, consolidation, cleaning, data selection, transformations.

IQR indicates

the extent to which the central 50% of values within the dataset are dispersed

testing for significance: t-test

the f-test shows an overall significance, the t test is used to determine whether each of the individual IVs is significant a seperate t test is conducted for each of the IVs in the model we refer to each of these t tests as a test for individual significance

if bias is greater than zero,

the method is overestimating

if bias is less than zero

the method is underestimating (this is preferred to over forecasting)

example of equal variance

the spread of the residuals increases as the fitted value increases = UNEQUAL variance if the spread of the residuals remains constant as the fitted value increases, EQUAL variance

prediction using regression

using the linear model to predict the value of the response variable for a given value of independent variable is called prediction estimated housr price=254820.60 + 140.65x house size if house size is 2100 sq ft: estimated house price = 254820.60 +140.65 x 2100 =550,185.60

method 2 for using excel for simple linear regression

using trend line in excel -plot a scatter plot of x and y (highlight all data in excel) -right click on scatter plot and choose "add trend line" -linear fitted line should be selected already -check "display equation on chart" -check "display r-squared value on chart"

value of dependent variable for the ith observation

value of independent variable for the ith observation

dependent variable

variable we wish to predict or explain denoted as 'y' and plotted in y axis in scatter plot example: house price

independent variable

variables used to predict or explain the dependent variable -often denoted as x example: house size

exponential smoothing method

variant of weighted moving average with weights decreasing exponentially -uses ALL data points -most popular method Ft+1 = alphaDt +(1-alpha)Ft if alpha is not known/given: use 2/(t+1) where t is the number of known demand periods

linear equation

y=mx+b straight line on 2-D plot slope is m y-intercept is b

estimated regression equation

y^ = bo +b1x y^ is estimated value bo is intercept of the estimated regression line b1 is slope of the estimated regression line

r-squared

The amount of variability (variance) in the DV which is explained by the predictor variables (reported as a %age)

forecasting for multiple periods: under seasonality

Ft+n = Fbar t+1 x SI(t+n), for n=1,2,3... t+1 in first two F's are tiny- do not multiply

forecasting for multiple periods: constant level model

Ft+n = Ft+1, for n =1,2,3... In other words, the best forecast for t+2, t+3, etc is F(t+1).

amount of fertilizer impacts crop yield

IV: fertilizer DV: crop yield

when to interpret coefficients

ONLY when coefficient is significant (when bi is less than 0.05) and line assumptions must be satisfied if coeff is not significant in multiple regression, then we must remove it from the regression equation and re-run the analysis. it cannot be interpreted

interquartile range

Q3-Q1 Q2=median gives you the center 50% of values excel function for quartiles =QUARTILE.inc(select data set, 1,2,3, or 4) IQR= QUARTILE(array,3) - QUARTILE(array,1)

forecasting methods

Qualitative:Judgemental and Market Reasearch -based on subjective judgements, past data is not needed. -used for strategic decisions Quantitative: Statsitical methods -assume the past represents the future

Exam 1 Data Analytics

संबंधित स्टडी सेट्स

Chapter 13

chap 10 stream erosion and river systems

Chapter 5 review questions

Unit 12

Health Assessment Chapter 14 Hair Skin and Nails LabManual

Weight Training Biomechanics and Kinesiology Terms

ECON 102 Questions

DC Speech unit 4

What is data

FIN Chapter 4

MGMT Final

ABC Server Training Study Guide

MUS 105 - Part 1: Elements

Cost Benefit Analysis (CBA)

Chapter 19

math

First Aid and CPR Final

CPC Practice Questions

Ecosystem Vocabulary

Chapter 13:Patient Assessment