STA1LS Semester 2, 2017 Revision
A function that assigns a unique numerical value to each outcome in a sample space is known as a: Select one: a. parameter b. unique number generator c. simple event d. random variable Correct
d.random variable
To compute P(A)
Add up the probabilities of each outcome, or simple event in A
Experiment
An activity which there are at least 2 possible outcomes and the result of the activity cannot be predicted with absolute certainty - i.e. Coin toss = Heads, Tails - i.e. Diabetes Test = Positive, Negative
Complement of an Event
Any event A denoted as A' , is the set of all outcomes in the same sample space that do not belong to A .
Name and Describe the 2 Correlation Co-efficients
Are use to numerically summarise the relationship between 2 quantitative variables and include : 1. Linear correlation coefficient: Pearson Product moment correlation coefficient: - Measures the strength of the LINEAR relationship between 2 quantitative variables 2.Spearman's rank correlation coefficient: Spearman's rank correlation coefficient: - The strength of the monotonic(of a function or quantity) varying in such a way that it either never decreases or never increases) relationship between 2 quantitative variables
Limiting Relative Frequency
As the relative frequencies tend to stabilize and become almost constant - One number is left, which is the limiting relative frequency - The probability of an event is the limiting relative frequency of that event. - The probability of an event A, P(A), is equal to the limiting relative frequency of event A.
Bimodal Distribution
- Has two peaks, not common and may occur with population mixing
Histograms (Describe, sample size, and detects)
- Its a bar chart, horizontal axis = data classes, vertical axis = frequencies , height of bars frequency values -Detects clusters, shape, location, spread and outliers - sample size n>30 - Its used to create a frequency distribution table
Unimodal distribution - Describe and Define
- Its a distribution with one peak - Its symmetric if there is a line down the middle - Its skewed if its not symmetric - Unimodal distributions include: Normal, bell-shaped, or normal curve distributions - Approximates many populations
Box Plots
- Its a graph that consists of a line extending from the minimum value to the maximum value and a box with lines drawn at the first quartile, the median and the third quartile -WEAKNESS: Doesn't Identify Clusters
Sample Standard Deviation
- Its the square root of the sample variance - s= square root s ^2 - Is more reliable to determine variability due to being in the same units as the original data - (Repeat write formulas in lecture notes - WK 1 )
Sample Mean
- Measures the central location of the data - Found by adding the values of each observation in the sample and dividing by the total sample size - SEE LECTURE SLIDES WK 1
Sample Variance
- Measures the variability in the sampled data, expressed in square units - Roughly interpreted as the squared difference between the data and the sample mean - The larger the sample variance the greater the variability in the data (REPEAT WRITE FORMULA IN LECTURE NOTES- WK1)
Multimodal Distribution
- More than one peak, also a distribution with more than 2 distinct peaks, very rare
State the notations for estimates and parameters (draw symbols)
- The sample mean , is an estimate of the population mean , mu - The sample variance s^2 is an estimate of the population variance - The sample standard deviation s, is an estimate of the population standard deviation - Greek Letters = Population Parameters - English Letters = Sample statistics
Relative Frequency Histogram
- Used to compare histograms with different sample sizes - Bar chart, horizontal axis = data classes, vertical axis = relative frequencies - Multiply relative frequency by x100 to obtain the percentage
Numerical summaries depend on ........
- Whether or not the distribution is bell-shaped
Properties of the Linear Correlation Coefficient
Values are to be between -1 and +1 - r=1 when x and y points fall on the increasing line (Strong) - r= -1 if x and y points fall on the decreasing line (weak) - r= 0 corresponds to a scatter plot cloud of points with no linear orientation
What are the 3 measures of spread ?
Variance, standard deviation, and IQR
Experts give a certain poor performing professional sports team an 80% chance of winning at least one game during their upcoming regular season. Based on this prediction, what is the probability that they will not win any games? Select one: a. 0.20 b. 0.15 c. 0.80 d. 0.50
a. 0.20
Suppose we want to explore whether a change in gender (male, female) explains a change in weight (kilograms). Is weight an explanatory or response variable? Is weight a categorical or quantitative variable? Select one: a. response, quantitative b. explanatory, categorical c. response, categorical d. explanatory, quantitative
a. response, quantitative
Define and Describe a Scatter Plot
Is a graphical summary for 2 quantitative variables - its a plot of paired (x,y) data with a horizontal axis (x) and a vertical axis (y) - Each individual pair (x.y) is plotted as a single point DOES NOT PROVIDE THE STRENGTH OF THE RELATIONSHIP BETWEEN 2 VARIABLES - the explanatory variable = x-axis - the response variable = y-axis
Sample Space
Is a list of possible outcomes in an experiment, denoted by S i.e. S= { E1, E2 ..... } where E1, E2, and E3 are possible outcomes of the experiment
Define Correlation coefficient
Is a numerical summary that measures both the STRENGTH and DIRECTION of the relationship between 2 QUANTITATIVE variables - Its a free numerical summary
Binomial Experiment
Is a statistical experiment that has the following properties: 1. The experiment consists of n trials 2. Each trial can only have one of two mutually exclusive outcomes, Denoted (S) Success or (F) Fail 3. The outcomes of the trials are independent 4. The probability of a success, p is constant from trial to trial
Spearman's correlation coefficient
Is calculated by first ranking the data for each quantitative variable and then applying the linear correlation co-efficient formula to the rankings 0.8 to 1 : Strong 0.5 to 0.8 : Moderate 0 to 0.50 : Weak
Response Variable
Is the focus of a question in a study or experiment.
How can a PROBABILITY distribution for a DISCRETE random variable be shown ?
It can be : - An itemized listing - A Table - A Graph - Or a function
Relative Frequency
Its the occurrence of an event is the number of times the event occurs divided by the total number of times the experiment is conducted
Define Interval Probability and how it is calculated
Its the probability that a random variable X takes on a value between a and b (where a and b are constants) - Calculated: Adding P(X = x) of an interval
Define Upper Tail Probability and how it is Calculated
Its the probability that a random variable X takes on a value greater than or equal to a (where a is a constant) - Calculated: Adding P(X = x) of the upper tail, see slide 4-15 for an example
Define Cumulative Probability and how it is calculated
Its the probability that a random variable X takes on a value less than or equal to a (where a is a constant) - Calculated : Adding P(X = x) values, see slide 4-13 for example
Joint Probability Table
Joint probability is a measure of two events happening at the same time, and can only be applied to situations where more than one observation can be occurred at the same time
What are the 3 measures of Central Location ?
Mean, median, mode
Mutually Exclusive Events
No 2 outcomes in the sample space can occur simultaneously on any 1 trial in the experiment - A and B have no elements in common, they are disjoint or mutually exclusive A (upside down ) u B = o (line across)
Quantitative Variable
Numbers of counts or measurements - 2 types of this data
Continuous
Numbers that can take on infinite values I.e. Body temperature
Data
Observed values of variables
Multiplication Rule
P(A ∩ B) = P(A) P(B|A) The rule of multiplication applies to the following situation. We have two events from the same sample space, and we want to know the probability that both events occur.
A listing of all the possible outcomes from an experiment using set notation is called the: Select one: a. sample space. b. experimental outcome list. c. sample point. d. event.
a. sample space
\bar{x} denotes the: a. population mean b. sample mean c. sample variance d. population variance
b. sample mean
\tilde{x} denotes the a. sample mean b. population mean c. sample median d. sample standard deviation
c. Sample median
If Ø is defined as the empty set, then P(Ø) must by definition be: Select one: a. between 0.5 and 1. b. less than 0. c. more than 0. d. equal to 0.
d. equal to 0
The variance, standard deviation and interquartile range are all measures of a. sample size b. central location c. central tendency d. spread
d. spread
A random variable that can take on an infinite number of values within the limits the variable ranges is said to be: Select one: a. discrete. b. compact. c. predictable. d. continuous
d.continuous
Suppose the random variable X denotes the number of shoppers who use the "self scan" aisle at the local supermarket during the day. X is a ________________ random variable. Select one: a. compact b. predictable c. continuous d. discrete
d.discrete
Law of Total Probability
is a fundamental rule relating marginal probabilities to conditional probabilities. It expresses the total probability of an outcome which can be realized via several distinct events—hence the name. see 3-36 for the formula
Population Standard Deviation
Is equal to the square root of the variance - Expressed in the same units as the variable of interest
Addition Rule
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
Probability Density Function
For continuous variables that can take on infinite values, the computing for the variable is done within a set interval
Dot Plots (Detect, sample size, help identify)
- For small sample sizes: <10 - Detects clusters - Helps identify the original population group under study
Relative frequency histogram ( Describe)
- A bar chart, Horizontal axis = data classes, vertical axis = relative frequencies, Bar height = relative frequency values - Its used to create a frequency distribution table - Relative frequency values must be between 0-1
Modified Boxplots (WRITE THE FORMULAS FOR UPPER AND LOWER FENCES)
- Can be done in SPSS and manually - Identifies mild and extreme outliers - Step 1: Find quartiles, median, and IQR - Step 2: Find the 2 inner fences upper and lower - IFL= 1-1.5X(IQR) -IFH= Q3+1.5X(IQR) - OFL= Q1-3X(IQR) - OFH= Q3+ 3X(IQR)
Name 2 Numerical Summaries for Quantitative Variables
1- Central Location (Central Tendency) 2- Spread (Dispersion)
Name 3 types of Graphical Summaries for Quantitative Variables and suitable sample size
1- Dot plots - small samples 2-Histograms and relative frequency histograms - Large samples 3- Box plots - large samples
What 2 measures of data are used for a bell-shaped distribution ?
1- Mean : To indicate central location 2- Sample variance/ deviation: To indicate the variance of the data
What 2 data distributions is used with skewed data ?
1- Median (Central Location) 2- IQR (Spread)
State the terms involved with the 5-number summary (Write out formulas for Q1 and Q3)
1- Min - Them minimum value of the data set 2- Q1 - The 25th Percentile, Separates the bottom 25% with the upper 75% of sorted values 3- Median: Separates the upper and bottom tails denoted Q2, it is the middle value in the set of values, even set = n+1/2 - then divide that by 2 4- Q3 : The 75th Percentile, Separates the bottom 75% with the upper 25% of sorted values : Q3 = n+(n+1/4)(n2-n1) = Q3 5- Max : The maximum value of the data set IQR= Q3-Q1
What are the 5 characteristics of distribution? (Quantitative Variables)
1- Shape: Symmetrical, bell-shaped , skewed 2- Centre: A typical value ( mean, median, mode) - Where most of the data lies 3- Spread (Dispersion): Variation in the data (standard deviation, IQR, and range) 4- Clusters: Groups of observations that give rise to a bi-modal or multi-modal appearance 5- Outliers: A data point that is not consistent with the rest of the data set
What 2 graphical summaries that show the relationship between the explanatory and response variable ?
1. Dot Plots (n<10) 2. Box Plots (n>10)
if the data in all groups is approximately bell-shaped thenn what measurements do we use for central location and spread ?
1. Mean = Central Location 2. Standard deviation = Spread
If at least one of the groups of data has a skewed distribution then what measurements do we use for central location and spread ?
1. Median = Central Location 2. IQR = Spread
Name the 3 measure of a population
1. Population mean 2.Population variance 3.Population standard deviation - All can be calculated using the PROBABILITY DISTRIBUTION of a DISCRETE RANDOM VARIABLE
Properties of Probability
1. Some events are more likely to occur than others 2. For an event A, we assign a number that conveys the likelihood of occurrence of A 3. This number is called the probability of the event A, denoted P(A)
What are the 2 requirements for a pdf ?
1. The pdf f(x) cant be negative for any values of x 2. The total area underneath the curve has to equal to 1 - See 5-5 and 5-7 for 2 good examples
Properties of Probability
4. Given the sample space the probabilities assigned to the outcomes must satisfy 2 basic requirements - The probability of an outcome must lie between 0 and1 0<P(Ei) < 1 for all i - The sum of probabilities of all of the outcomes in the sample space must equal 1
Pareto Chart
A bar chart for categorical data, where bars are arranged in order according to frequencies or relative frequencies - Bars are arranged, descending from the tallest (at the left) to the smallest (at the right)
Variable
A characteristic of a subject i.e. Students and height
Random Variable
A function that assigns a unique numerical value to each outcome in a sample space - Values can not be predicted with certainty - i.e. x and y
What graph is used to summarize the relationship between 2 categorical variables ?
A stacked bar graph : Is a graph that is used to break down and compare parts of a whole. Each bar in the chart represents a whole, and segments in the bar represent different parts or categories of that whole.
Categorical Variable
A variable split into categories with non-numeric categories - Has 2 types
Explanatory Variable
A variable that might cause or explain a change in the response variable
Continuous Probability Distributions / pdf
Completely describes the random variable and is used to compute probabilities associated with the random variable - Probability density function (pdf): Or a density curve, is the probability distribution for for a continuous random variable
Union of Events
Denoted A u B is the event that consists of all outcomes in the sample space that are in A or B or in Both
Intersection of Events
Denoted as A (upside down )u B is the joint event that consist of all outcomes in the sample space that are in A or B or in Both
Binomial Random Variable
Describes the outcome of a Binomial Experiment - Maps each value of a Binomial experiment between 0 and n (number of successes)
What are the 3 things that can show the distribution of the data set ?
Dot plots, histograms and boxplots: Detect shape, location, spread and outliers Dot plots and histograms: Detect Clusters Boxplots: Don't detect clusters
Simple Events
E1, E2 and E3 are called simple events - Its each individual trial in an experiment - You can't breakdown simple events down further
Frequency distribution table
For numerical data, it summarises and displays classes, frequencies, relative frequencies, and cumulative relative frequencies - SEE LECTURE SLIDES EXAMPLE WK-1 I.e.- Freq, interpretation: 4 observations fall in between 130 and 134 exc. 134 -Rel frequency interpretation: 10% if the observations fall in between 130 and 134 exc. 134 -Cumulative relative frequency: 85% of the observations are less than 134
Describe the numerical and graphical summaries for Categorical Variables (SEE WK 1 LECTURE NOTES)
Frequency and Relative Frequency Table 1- Count the frequency 2- Calculate the relative frquency 3- Store this all in a frequency table
Nominal
Has no order to it, characterised by data containing names, labels, and categories I.e. Gender
Ordinal
Has some order to it, characterised by data containing names, labels, and categories I.e. T-shirt size
Describe IQR and what it measures
IQR or the Interquartile Range is the difference between the 3rd and 1st quartile - IQR= Q3-Q1 - The IQR provides the range of the middle 50% of the data - Measure the variation like standard deviation
Binomial Random Variable in Statistical Sampling
If the sample size is smaller than population successes then it is a binomial distribution
Continuous (Random Variable)
If the set of all possible values can take on an infinite number of values within the variable ranges i.e. Height, weight - Associated with measuring
Discrete (Random Variable)
If the set of all possible values is finite or countably finite . i.e. Number of people taking STA1LS -Associated with counting
What is probability mass function (pmf) and how is it found ?
Pmf is denoted P(X = x) and is the probability that a discrete random variable X is equal to some specific value x - To find pmf look at x (NOT X) and outcomes then add the probabilities associated with outcomes - See slide 4-10 for properties of a valid probability distribution for a Discrete Random Variable
Population Variance
Population variance (σ2) tells us how data points in a specific population are spread out. It is the average of the distances from each data point in the population to the mean, squared. - Expressed in squared units -Expected value of squared standard deviations about the population mean
The Complement Rule (For a Binomial Random Variable)
States that the sum of the probabilities of an event and its complement must equal 1, or for the event A, P(A) + P(A') = 1 see 4-34 for formula
Complement Rule
The Complement Rule states that the sum of the probabilities of an event and its complement must equal 1, or for the event A, P(A) + P(A') = 1
What kind of table is used to summarize the relationship between 2 categorical variables ?
Two-way table: - A table that covers all contingencies for the combination of the 2 variables - Explanatory Variable = Rows - Response Variable = Columns
Define and describe Probability
The extent to which an event is likely to occur, measured by the ratio of the favorable cases to the whole number of cases possible. - It provides a link so that the sample statistics can be used to make inferences about the population parameters: - Population Parameters: Most difficult to measure i.e. the height of everyone at a University - Sample statistic i.e. mean height : Gives an estimation of population parameter
Population Mean
The population mean is an average of a group characteristic The population mean symbol is μ. μ = (Σ * X)/ N - Population mean may not be possible values of the random variable
Empty set
The probability = 0 - P(o) = 0 (line across) - That is the empty set contains no outcomes
Probability of an Event
The probability of an event A is a number between 0 and 1 (including those end points) that measures the likelihood of A will occur - If the probability of an event is close to 1 then the event is likely to occur - If the probability of an event is closer to 0 then the event is not likely to occur
Cumulative Probability / Cumulative distribution function (cdf)
The probability that a random variable X takes on a value less than or equal to a (where a is a constant) - cdf: see 4-32 for formula - see 4-35 and 4-36 for Cumulative Difference Rule I and II
Conditional Probability Rule
The probability that event A occurs, given that event B has occurred, is called a conditional probability. The conditional probability of A, given B, is denoted by the symbol P(A|B). - See Slide 3-29 for formula
Independence
Two events are independent when the occurrence of one does not affect the probability of the occurrence of the other. P(A|B) = P(A) or P(B|A) = P(B)
Mutually Exclusive Events
Two events are mutually exclusive if they have no sample points in common. P(A∩B) = 0
Discrete
Whole numbers that have a finite value I.e. Number of leaves on a plant