Statistics (COMM 162)
Binominal probability in excel:
-Use one for cumulative or continous variables -Use zero for discrete variables
Making a pivot table in excel
Insert->pivot table->create pivot table (drop variable in to make count, relative, and cumulative frequency table)
Graphs for nominal and ordinal data
Need to beginning by deciding the type of data we want to analyze and what questions we are trying to answer (will affect the types of graphs we used)
Random variables mean/variances:
Random variables have mean or expected values and variances
Shapes of histograms
Symmetry Skewness - negatively skewed if tail is left and positively if tale is right Modality; peaks in data (bimodal or unimodal)
The poison distribution
When working with poison problems we are concerned with knowing the probability of n event happening over time -> to calculate this in excel: =poisson:dist(x, mean, cumulative) ->x is number of events ,mean (lambda) is the expected numeric value, cumulative logical value P(X=x) use 0 or fales, P(X<x) use cummulative 1 or true
Population
collection of individuals or items under consideration in statistical study Ie. All families living in Ontario
Correlation
correlation doesn't lead to causation (need consider other variable) Ie. Launching ice cream in summer increasing sales isn't only result of marketing campaign
Uniform distribution:
equally-sized intervals that have equal probability of occurring -Character by a and b as lower and upper bound of variable
Data
fact, numerical, and categorical, collected together for reference of information
Snowball:
first respondent refers a colleague or acquaintance and that person refers someone else, etc.
Chebyshev's rule
for any number k >_ 1, at least 100(1-1/k squared)% of the observations in any data set are within k standard deviations of the mean; the percentage value is typically conservative in that the actual percentages often considerably exceed the stated lower bound Use excel to find these numbers Descriptive statistics in excel: data -> data analysis -> descriptive statistics -> summary statistics
Qualitive data statistics
for data that is classified as nominal or ordinal Nominal: frequency table, bar graph, pie chart, Pareto chart Ordinal: frequency, bar graph, table Pie charts are bad for ordinal data because you don't know where to start
Histogram
for interval/ratio data that show distribution where Ogives highlight proportion that lie below each of the limits -A lot of techniques rely upon normal distributions; histograms verify if this distribution pattern is present in a data set
Central limit theorem:
if samples are drawn from population with pop. Average and standard deviation the sample mean will be distributed normally for sufficiently large sample (n greater or equal to thirty) regardless of shape of original population -Answer questions about averages -Need to remember if population distribution is normal we do not meet to evoke the central limit theorem, sample mean is normally distributed regardless of sample size n
Sampling distribution (of means):
if we take a lot of random samples of the sample size from a given population the variation of means from sample to sample will follow pattern
The four measures of spread:
range, variance/standard deviation, percentile; quartile/interquartile range/ coefficient of variation
Population proportion
ratio of members of a population with a particular characteristic to the total members of the population
Judgement:
sample based who they think would be fitting for the study; used when limited number of experts in area -No cap here or quota
Quota:
sample units are chosen on pre-specified characteristic until a set number is reached Ie. Survey all students at smith, but stay at Starbucks until they meet condition
Systematic sampling:
sample units are selected from population according to a random starting point and a fixed, periodic interval -Population units are not random but periodical -Procedures: decide population of n, divide population into (N) into n group of m=N/m, randomly select one item from the first group, following items selected at a uniform interval -Advantages: easy, comparable to SRS, importantly you can enumerate population because you are choosing every Kth item -Disadvantage: assume population is random and there could be precocity Ie. Taking every tenth person on a plane
Simple random sampling:
sampling procedure for which each possible sample of a given size or equally to be obtained -Basic random sampling -Two types: with replacement and without replacement (replacement means item can appear in sample more than once) -Advantage: sample to use -Disadvantages: difficult to enumerate population or access to specific items -Procedure of SRS: number your population, generate n random numbers (excel: rand(), randbetween (A,b)) Ie. Flight of 300 and generate 30 random seats
Self-selection:
sampling units decide whether they are willing or not to participate Ie. Advertise a sampling
Population mean
the sum of the values in the population divided by the population size
Probability
the value between 0 to 1 (inclusive) that describes the likelihood/chance of an event happening 1. The probability of event as A or P(A) any event 0<P(A)<1 2. The sum of probabilities of all outcomes in a sample space equal one P(Ak)=1
Normal uniform distribution; probability density
unimodal, symmetric, distribution -Characterized by mean and sigma -Probability of any value is charactered by f(x)=1/b-a
Parameter
unknown numerical summary of population (average Ontario car price)
Statistics
various tools that will be covered during this course
Data type hierarchy
you can move from a higher level to a lower level, but not reverse (always want to collect data from as high of point as you came) -Move from interval data to ordinal for grade (percentages go to A+)
Centre of normal shape:
-At the centre of a perfect bell curve distribution mean, median, and mode are all the same -Normal shape distributions occur in many physical circumstances ie. Height -Normal distributions are needed for many stat calculations (symmetrical, not skewed, unimodal)
Mean vs. median:
-Do not take one at face value; important to consider both numbers as when data is skewed they are different -Mean > median when data is positively skewed
What can go wrong with graphs
-Don't look for shape, center, and spread -Don't make a histogram of a categorical variable -Have large sample -Have consistent scales
Probability density calculations:
-Expected value - a+b/2 -Variance - (b-a)^2/12 -Standard deviation - (b-a)^2/ root 12 -If below A in range to calculate remove -Do not use excel; do by hand -If uniform in excel find min and max as a and b for parameter
Standard deviation
-Most commonly used to describe variance -Original variance square rooted -The more variety in data the larger variation
Cluster vs. Stratified:
-Strata involve homogenous measures (something in common for group) -Clusters have heterogenous measures (we haven't deliberately separated based on characteristics) Ie. Strata - assume 1st, 2nd , 3rd, four years assigned to floors - first year ,second year, third year, fourth year then take sample Cluster - set clusters as floors because they emulate population better (heterogenous mix)
Poisson distribution properties:
-The events are independent -Two events cannot occur at the same time -The probability of an event is an interval and is the same for all equal sized intervals -The probability of an event is proportional to the size of the interval
Deviations and spread
-The most important measures of variability are expressed in terms of deviations of individual value from the mean of the dataset -We do not look at sum of deviations rather then function themselves
Analysis of scatter plots: Different Relationships
1. Linear/Nonlinear 2. Positive/negative (negative means variable move in different direction 3. Strength/moderate/weak (how close points are to trend line 4. Outliers: points that don't model trend
Making a histogram in excel
1. Plug into formula for number of bins you want (number of bins = 1+3.3long(n) - n is number of observations) 2. Use formula to determine the width/interval size of each bin (class width = max - min/numbered of desired bins) 3. In excel create a bin array (order them) 4. Tools -> data analysis -> histogram 5. Input data points and bin range 6. Make sure to include title when selecting
Sample framework for central limit theorem:
1. What is the distribution of sample mean (x bar) 2. Use that to come up with confidence interval for population average (parameters)
Frequency tables
1.Frequency distribution table: lists the categories and the number of time data occurs in a data set -The number of times a given category occurs in data set - total number 2.Relative frequency distribution table: lists the categories and the proportions it occurs -The fraction it occurs in percentages 3.Cumulative frequency table: shows the accumulate frequency
Inverse calculations:
Allows us to determine area of X or Z values of a percentage will be below
Excel commands
The bar graph/pie charts you can make vertical or horizontal by using tools in excel (used for nominal graphs)
Empirical rule:
The rules gives the approximate % of observations w/in 1 standard deviation (68%), 2 standard deviations (95%) and 3 standard deviations (99.7%) of the mean when the histogram is well approx. by a normal curve
Charts/Quantitative data
Use the same graphs for both interval and ratio data 1. Line chart: helpful for time-series data as it illustrates trends, identifies seasonality, and forecasts in the timeline (Stock market) 2. Scatter plot: model the relationship between two quantitive variables There is a dependent and independent variable - can be multiple independent and only one dependent
Statistics
a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data Data -> Statistics -> Information
Event:
a collection of one or more outcomes of an experiment
Normal distribution
a distribution in statistics going from negative to positive infinity that occurs from average or summing other random variables Denoted by X - N (mean, deviation) - all you need for calculations -> mean determines location of graph along x-axis ->deviations determines the spread Cannot easily be calculated probability so use excel
Random variable (rv):
a function that assigns a numerical value to each experiment - X
Sample:
a portion of the population selected for analysis; sample to make generalizations of a population -Easy to learn about parameter of population to make conclusions -Make sampling frame; list of population (what are we going to sample)
Sample space:
a random experiment contains all possible outcomes
Cross-sectional data
a sample of data that does not depend on time Ie. Amount of sales for specific time from different vendors
Cluster sampling:
a technique when the entire population is divided into groups and a random sample of these clusters is selected -Procedure: divide the population into clusters, each representative of the population, then obtain SRS of cluster -Advantages: useful when cannot enumerate population, cost effective, useful when population is widely scattered -Disadvantage: less efficient than stratified sample (need larger sample) Ie. Divide all cities by geographic unit -> when common characteristic divided by is not relative to the sample itself
Discrete Random variable:
a variable that take on countable number of values -P(x) often describe in tables -Total number of students, shoe size, etc.
Population
all items of interest (all cars in Ontario)
Random experiment:
an experiment or process that leads to one of several possible outcomes -> outcome is result (denoted by X1, X2,Xk)
Non-probability sampling:
any sampling methods where units have no chance of being selected; non-random -Advantages: easy and faster than non random -Disadvantages: no way to ensure how representative -Conclusion: avoid if possible because we cannot acculturate draw conclusions
Quartiles/Interquartile
are the value that divide a equally sized group -Often used in content of an interquartile range (where middle 50% of data lies) - > Q3- Q1 -The 25th percentile is Q1, 50th percentile is Q2, etc. -In excel: =Quartile.Inc (array, quart) - select one you want
Continuous Random variable:
can take on any value of scale (describe by probability density functions) -Take on any value on scale (1.22) -Weight of cans, distanced traveled
Qualitative Data
categorical data characteristics and descriptors that can't be easily measured, but can be observed subjectively Ie. Major, place of birth, eye color
Qualitative Nominal Data
categorical where order is not important Ie. Car type, ID
Qualitative Ordinal Data
categorical where ranking is important for category Ie. Team standing, rank in class
Time-series data
data over a time period Ie. Number of sale over four years for one vendor
Panel Data
data that blends cross-sectional and time-series data producing a table Ie. Sales for different vendors over four year time
Probability distributions:
describe how likely it is for a RV to take on different values P(x=x)
Binominal random variable:
describes the number of successes -Only two possible outcomes: success or failure -Constant probability of success is denoted by p and probability of failure is denoted by (q=1-p) -N is fixed independent trials -Notation for number of successful trials in a binomial distribution X - B(n,p)
Sampling error:
difference between the sample and the population that exists only because of the observation that happened to be selected sample size -> increasing sample size reduces this error (closer to population) Other errors: response bias by design in survey design, incomplete sampling frames, sampling error, non response bias (sample where non-participant response would different from participant)
Standard normal distribution
distribution of normal distribution when Z - N (0,1) -We can transform any normally distributed random variable X into the standard distribution using Z-score formula (move the mean and deviations)
Sampling variability (error):
each time we take a random sample from population we are likely to get a different set of individuals and calculate different statistics -Get different sample every time; estimate different parameter
Census:
information on entire population -Time consuming, cost, impractical, borderline impossible -Sample to represent a population
Mean:
is the average of observation -In excel mean: =average (select data) -Mean is affect most by outliers -Pop: U - x/n sample: ex/n-1
Variance:
is the averaged of squared value of deviations -Takes into account the deviation of all data sets -Pop: o^2 = (X-u)^2/ n Sample: s^2 = (x- x)^2/n-1 -In excel Population variance: var.p(array) -In excel sample variance: =Var.S(array) -Remember that units is squared
Range
is the different between small and largest -Tells us the least about data -Excel: max(array) - min(array)
Sample
items selected from population; sub-set of population (1000 Ontario cars)
Information
knowledge obtained about a particular fact --> to know something about data we need to utilize statistics
Statistic
known numerical summary of sample (average car price of sample)
Measures of central location
mean, median, mode
Kurtosis:
measure of peakness relative to the normal distribution: -When equal zero it is platykurtic, safe investment, standard distribution -When > 0 it is platykurtic -> chance of extreme returns are low, safe, but lower peak -When < 0 it is leptokurtic when means higher peak and more volitile
Coefficient of skewness:
measures how symmetrical a data set -When symmetric mean the mean = mode - coefficient is zero -When < 0 negatively skewed where mean < mode -When > 0 positively skewed where mean > mean
Convivence:
members of population are chosen based on ease of access ex. Friends, co-workers
Median:
middle value of distribution -Used because not skewed as much as mean by outliers -Median in excel: =Median(data)
Mode:
most frequently occurring observation -May not be unique to (bimodal/multimodal) and may not occur near center -Mode in excel = MODE.SNGL(data)
Quantitative Data
numerical data where numbers things objectively Ie. Age, weight, temperature
Quantitative Interval Data
numerical data where zero is arbitrary Ie. Temperature, size
Quantitative Ratio Data
numerical data where zero signifies zero Ie. #of patients seen or calls
Strategized Sampling:
separatee population into mutually exclusive homogenous sets (Stra) based on specific characteristic then draw SRS from each stratum -Procedure: divide population to stratum subpopulations, obtain SRS proportional (does not have to always be proportional relative to research goals) to size, all member obtains are your sample -Advantages: ensures representation, required sample size is smaller so cheaper Ie. Divide students by factuality then sample (gender, occupation, etc.) -> common charcterstic
Continuous Distributions:
outcomes that are measured as opposed to counted -Probabilities are defied over interval (single point is zero) -Modelled by both a uniform and normal distribution
Types of sampling:
primary and secondary -Survey and interviews are most similar -Once you choose your sampling type you choose how to sample
Percentiles:
provide information about the position of particular values relative to the entire data set (divide data into 100 pieces to show you what percentage of a value lies below and above) -50th percentile is median not mean
Coefficient of variance
provides a unit independent measure of variability that can be used across datasets with different mean values -Most useful for comparing two or more data sets -The less variety the smaller the percentage (for investment you want less risk) I
Probability sampling strategies
random sampling Items in sample have know probabilities Use a device that generates random numbers; excel or tables Use to eliminate the human judgement in sample selection
Sample mean
the arithmetic average value of the responses on a variable
Pareto chart
the frequency of categories in cumulative impact (factors on left are more important than factors on the right) -The Pareto principle or 80/20 rule explains that 80% of the effect come from 20% of the cause -> relationship rule
Sample proportion
the number of cases falling into one category of the variable divided by the number of cases in the sample
Z-score:
the number of standard deviations your original random variable x is away from the mean Z= x- ux/standard deviation