Statistics (COMM 162)

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Binominal probability in excel:

-Use one for cumulative or continous variables -Use zero for discrete variables

Making a pivot table in excel

Insert->pivot table->create pivot table (drop variable in to make count, relative, and cumulative frequency table)

Graphs for nominal and ordinal data

Need to beginning by deciding the type of data we want to analyze and what questions we are trying to answer (will affect the types of graphs we used)

Random variables mean/variances:

Random variables have mean or expected values and variances

Shapes of histograms

Symmetry Skewness - negatively skewed if tail is left and positively if tale is right Modality; peaks in data (bimodal or unimodal)

The poison distribution

When working with poison problems we are concerned with knowing the probability of n event happening over time -> to calculate this in excel: =poisson:dist(x, mean, cumulative) ->x is number of events ,mean (lambda) is the expected numeric value, cumulative logical value P(X=x) use 0 or fales, P(X<x) use cummulative 1 or true

Population

collection of individuals or items under consideration in statistical study Ie. All families living in Ontario

Correlation

correlation doesn't lead to causation (need consider other variable) Ie. Launching ice cream in summer increasing sales isn't only result of marketing campaign

Uniform distribution:

equally-sized intervals that have equal probability of occurring -Character by a and b as lower and upper bound of variable

Data

fact, numerical, and categorical, collected together for reference of information

Snowball:

first respondent refers a colleague or acquaintance and that person refers someone else, etc.

Chebyshev's rule

for any number k >_ 1, at least 100(1-1/k squared)% of the observations in any data set are within k standard deviations of the mean; the percentage value is typically conservative in that the actual percentages often considerably exceed the stated lower bound Use excel to find these numbers Descriptive statistics in excel: data -> data analysis -> descriptive statistics -> summary statistics

Qualitive data statistics

for data that is classified as nominal or ordinal Nominal: frequency table, bar graph, pie chart, Pareto chart Ordinal: frequency, bar graph, table Pie charts are bad for ordinal data because you don't know where to start

Histogram

for interval/ratio data that show distribution where Ogives highlight proportion that lie below each of the limits -A lot of techniques rely upon normal distributions; histograms verify if this distribution pattern is present in a data set

Central limit theorem:

if samples are drawn from population with pop. Average and standard deviation the sample mean will be distributed normally for sufficiently large sample (n greater or equal to thirty) regardless of shape of original population -Answer questions about averages -Need to remember if population distribution is normal we do not meet to evoke the central limit theorem, sample mean is normally distributed regardless of sample size n

Sampling distribution (of means):

if we take a lot of random samples of the sample size from a given population the variation of means from sample to sample will follow pattern

The four measures of spread:

range, variance/standard deviation, percentile; quartile/interquartile range/ coefficient of variation

Population proportion

ratio of members of a population with a particular characteristic to the total members of the population

Judgement:

sample based who they think would be fitting for the study; used when limited number of experts in area -No cap here or quota

Quota:

sample units are chosen on pre-specified characteristic until a set number is reached Ie. Survey all students at smith, but stay at Starbucks until they meet condition

Systematic sampling:

sample units are selected from population according to a random starting point and a fixed, periodic interval -Population units are not random but periodical -Procedures: decide population of n, divide population into (N) into n group of m=N/m, randomly select one item from the first group, following items selected at a uniform interval -Advantages: easy, comparable to SRS, importantly you can enumerate population because you are choosing every Kth item -Disadvantage: assume population is random and there could be precocity Ie. Taking every tenth person on a plane

Simple random sampling:

sampling procedure for which each possible sample of a given size or equally to be obtained -Basic random sampling -Two types: with replacement and without replacement (replacement means item can appear in sample more than once) -Advantage: sample to use -Disadvantages: difficult to enumerate population or access to specific items -Procedure of SRS: number your population, generate n random numbers (excel: rand(), randbetween (A,b)) Ie. Flight of 300 and generate 30 random seats

Self-selection:

sampling units decide whether they are willing or not to participate Ie. Advertise a sampling

Population mean

the sum of the values in the population divided by the population size

Probability

the value between 0 to 1 (inclusive) that describes the likelihood/chance of an event happening 1. The probability of event as A or P(A) any event 0<P(A)<1 2. The sum of probabilities of all outcomes in a sample space equal one P(Ak)=1

Normal uniform distribution; probability density

unimodal, symmetric, distribution -Characterized by mean and sigma -Probability of any value is charactered by f(x)=1/b-a

Parameter

unknown numerical summary of population (average Ontario car price)

Statistics

various tools that will be covered during this course

Data type hierarchy

you can move from a higher level to a lower level, but not reverse (always want to collect data from as high of point as you came) -Move from interval data to ordinal for grade (percentages go to A+)

Centre of normal shape:

-At the centre of a perfect bell curve distribution mean, median, and mode are all the same -Normal shape distributions occur in many physical circumstances ie. Height -Normal distributions are needed for many stat calculations (symmetrical, not skewed, unimodal)

Mean vs. median:

-Do not take one at face value; important to consider both numbers as when data is skewed they are different -Mean > median when data is positively skewed

What can go wrong with graphs

-Don't look for shape, center, and spread -Don't make a histogram of a categorical variable -Have large sample -Have consistent scales

Probability density calculations:

-Expected value - a+b/2 -Variance - (b-a)^2/12 -Standard deviation - (b-a)^2/ root 12 -If below A in range to calculate remove -Do not use excel; do by hand -If uniform in excel find min and max as a and b for parameter

Standard deviation

-Most commonly used to describe variance -Original variance square rooted -The more variety in data the larger variation

Cluster vs. Stratified:

-Strata involve homogenous measures (something in common for group) -Clusters have heterogenous measures (we haven't deliberately separated based on characteristics) Ie. Strata - assume 1st, 2nd , 3rd, four years assigned to floors - first year ,second year, third year, fourth year then take sample Cluster - set clusters as floors because they emulate population better (heterogenous mix)

Poisson distribution properties:

-The events are independent -Two events cannot occur at the same time -The probability of an event is an interval and is the same for all equal sized intervals -The probability of an event is proportional to the size of the interval

Deviations and spread

-The most important measures of variability are expressed in terms of deviations of individual value from the mean of the dataset -We do not look at sum of deviations rather then function themselves

Analysis of scatter plots: Different Relationships

1. Linear/Nonlinear 2. Positive/negative (negative means variable move in different direction 3. Strength/moderate/weak (how close points are to trend line 4. Outliers: points that don't model trend

Making a histogram in excel

1. Plug into formula for number of bins you want (number of bins = 1+3.3long(n) - n is number of observations) 2. Use formula to determine the width/interval size of each bin (class width = max - min/numbered of desired bins) 3. In excel create a bin array (order them) 4. Tools -> data analysis -> histogram 5. Input data points and bin range 6. Make sure to include title when selecting

Sample framework for central limit theorem:

1. What is the distribution of sample mean (x bar) 2. Use that to come up with confidence interval for population average (parameters)

Frequency tables

1.Frequency distribution table: lists the categories and the number of time data occurs in a data set -The number of times a given category occurs in data set - total number 2.Relative frequency distribution table: lists the categories and the proportions it occurs -The fraction it occurs in percentages 3.Cumulative frequency table: shows the accumulate frequency

Inverse calculations:

Allows us to determine area of X or Z values of a percentage will be below

Excel commands

The bar graph/pie charts you can make vertical or horizontal by using tools in excel (used for nominal graphs)

Empirical rule:

The rules gives the approximate % of observations w/in 1 standard deviation (68%), 2 standard deviations (95%) and 3 standard deviations (99.7%) of the mean when the histogram is well approx. by a normal curve

Charts/Quantitative data

Use the same graphs for both interval and ratio data 1. Line chart: helpful for time-series data as it illustrates trends, identifies seasonality, and forecasts in the timeline (Stock market) 2. Scatter plot: model the relationship between two quantitive variables There is a dependent and independent variable - can be multiple independent and only one dependent

Statistics

a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data Data -> Statistics -> Information

Event:

a collection of one or more outcomes of an experiment

Normal distribution

a distribution in statistics going from negative to positive infinity that occurs from average or summing other random variables Denoted by X - N (mean, deviation) - all you need for calculations -> mean determines location of graph along x-axis ->deviations determines the spread Cannot easily be calculated probability so use excel

Random variable (rv):

a function that assigns a numerical value to each experiment - X

Sample:

a portion of the population selected for analysis; sample to make generalizations of a population -Easy to learn about parameter of population to make conclusions -Make sampling frame; list of population (what are we going to sample)

Sample space:

a random experiment contains all possible outcomes

Cross-sectional data

a sample of data that does not depend on time Ie. Amount of sales for specific time from different vendors

Cluster sampling:

a technique when the entire population is divided into groups and a random sample of these clusters is selected -Procedure: divide the population into clusters, each representative of the population, then obtain SRS of cluster -Advantages: useful when cannot enumerate population, cost effective, useful when population is widely scattered -Disadvantage: less efficient than stratified sample (need larger sample) Ie. Divide all cities by geographic unit -> when common characteristic divided by is not relative to the sample itself

Discrete Random variable:

a variable that take on countable number of values -P(x) often describe in tables -Total number of students, shoe size, etc.

Population

all items of interest (all cars in Ontario)

Random experiment:

an experiment or process that leads to one of several possible outcomes -> outcome is result (denoted by X1, X2,Xk)

Non-probability sampling:

any sampling methods where units have no chance of being selected; non-random -Advantages: easy and faster than non random -Disadvantages: no way to ensure how representative -Conclusion: avoid if possible because we cannot acculturate draw conclusions

Quartiles/Interquartile

are the value that divide a equally sized group -Often used in content of an interquartile range (where middle 50% of data lies) - > Q3- Q1 -The 25th percentile is Q1, 50th percentile is Q2, etc. -In excel: =Quartile.Inc (array, quart) - select one you want

Continuous Random variable:

can take on any value of scale (describe by probability density functions) -Take on any value on scale (1.22) -Weight of cans, distanced traveled

Qualitative Data

categorical data characteristics and descriptors that can't be easily measured, but can be observed subjectively Ie. Major, place of birth, eye color

Qualitative Nominal Data

categorical where order is not important Ie. Car type, ID

Qualitative Ordinal Data

categorical where ranking is important for category Ie. Team standing, rank in class

Time-series data

data over a time period Ie. Number of sale over four years for one vendor

Panel Data

data that blends cross-sectional and time-series data producing a table Ie. Sales for different vendors over four year time

Probability distributions:

describe how likely it is for a RV to take on different values P(x=x)

Binominal random variable:

describes the number of successes -Only two possible outcomes: success or failure -Constant probability of success is denoted by p and probability of failure is denoted by (q=1-p) -N is fixed independent trials -Notation for number of successful trials in a binomial distribution X - B(n,p)

Sampling error:

difference between the sample and the population that exists only because of the observation that happened to be selected sample size -> increasing sample size reduces this error (closer to population) Other errors: response bias by design in survey design, incomplete sampling frames, sampling error, non response bias (sample where non-participant response would different from participant)

Standard normal distribution

distribution of normal distribution when Z - N (0,1) -We can transform any normally distributed random variable X into the standard distribution using Z-score formula (move the mean and deviations)

Sampling variability (error):

each time we take a random sample from population we are likely to get a different set of individuals and calculate different statistics -Get different sample every time; estimate different parameter

Census:

information on entire population -Time consuming, cost, impractical, borderline impossible -Sample to represent a population

Mean:

is the average of observation -In excel mean: =average (select data) -Mean is affect most by outliers -Pop: U - x/n sample: ex/n-1

Variance:

is the averaged of squared value of deviations -Takes into account the deviation of all data sets -Pop: o^2 = (X-u)^2/ n Sample: s^2 = (x- x)^2/n-1 -In excel Population variance: var.p(array) -In excel sample variance: =Var.S(array) -Remember that units is squared

Range

is the different between small and largest -Tells us the least about data -Excel: max(array) - min(array)

Sample

items selected from population; sub-set of population (1000 Ontario cars)

Information

knowledge obtained about a particular fact --> to know something about data we need to utilize statistics

Statistic

known numerical summary of sample (average car price of sample)

Measures of central location

mean, median, mode

Kurtosis:

measure of peakness relative to the normal distribution: -When equal zero it is platykurtic, safe investment, standard distribution -When > 0 it is platykurtic -> chance of extreme returns are low, safe, but lower peak -When < 0 it is leptokurtic when means higher peak and more volitile

Coefficient of skewness:

measures how symmetrical a data set -When symmetric mean the mean = mode - coefficient is zero -When < 0 negatively skewed where mean < mode -When > 0 positively skewed where mean > mean

Convivence:

members of population are chosen based on ease of access ex. Friends, co-workers

Median:

middle value of distribution -Used because not skewed as much as mean by outliers -Median in excel: =Median(data)

Mode:

most frequently occurring observation -May not be unique to (bimodal/multimodal) and may not occur near center -Mode in excel = MODE.SNGL(data)

Quantitative Data

numerical data where numbers things objectively Ie. Age, weight, temperature

Quantitative Interval Data

numerical data where zero is arbitrary Ie. Temperature, size

Quantitative Ratio Data

numerical data where zero signifies zero Ie. #of patients seen or calls

Strategized Sampling:

separatee population into mutually exclusive homogenous sets (Stra) based on specific characteristic then draw SRS from each stratum -Procedure: divide population to stratum subpopulations, obtain SRS proportional (does not have to always be proportional relative to research goals) to size, all member obtains are your sample -Advantages: ensures representation, required sample size is smaller so cheaper Ie. Divide students by factuality then sample (gender, occupation, etc.) -> common charcterstic

Continuous Distributions:

outcomes that are measured as opposed to counted -Probabilities are defied over interval (single point is zero) -Modelled by both a uniform and normal distribution

Types of sampling:

primary and secondary -Survey and interviews are most similar -Once you choose your sampling type you choose how to sample

Percentiles:

provide information about the position of particular values relative to the entire data set (divide data into 100 pieces to show you what percentage of a value lies below and above) -50th percentile is median not mean

Coefficient of variance

provides a unit independent measure of variability that can be used across datasets with different mean values -Most useful for comparing two or more data sets -The less variety the smaller the percentage (for investment you want less risk) I

Probability sampling strategies

random sampling Items in sample have know probabilities Use a device that generates random numbers; excel or tables Use to eliminate the human judgement in sample selection

Sample mean

the arithmetic average value of the responses on a variable

Pareto chart

the frequency of categories in cumulative impact (factors on left are more important than factors on the right) -The Pareto principle or 80/20 rule explains that 80% of the effect come from 20% of the cause -> relationship rule

Sample proportion

the number of cases falling into one category of the variable divided by the number of cases in the sample

Z-score:

the number of standard deviations your original random variable x is away from the mean Z= x- ux/standard deviation


Kaugnay na mga set ng pag-aaral

Mastering Biology Reading Quiz #14 and #15

View Set

Alterations in Skin Integrity-Sherpath

View Set

Chapter 1: Financial Accounting and Accounting Standards: Questions

View Set

H-01:P2 State intervention goals in observable & measureable terms

View Set