Stats Mid Session
Data discovery
- Allow decision makers to interactively org or visualise data and perform preliminary analyses - Can be used to take closer look at historical or status data, quickly review data for unusual values or outliers, to construct visualisations for management presentations - In simplest form, involves drill-down, the revealing of data that underlines a higher level summary, eg: by clicking on merchandise entry below would reveal more detailed info eg: table of sales by 'lands'. This summary can be drilled down to reveal sales by each store in theme park
Ethical issues
- Ethical consideration arise when deciding what results to include in a report - Should document good and bad results - When making oral presentations and compiling written reports, need to give results in fair, objective and neutral manner - Unethical behaviour occurs when you wilfully choose an inappropriate summary measure to distort the facts in order to support a particular position - Unethical behaviour occurs when you selectivity fail to report pertinent findings because it would be detrimental to the support of a particular position
Framework to min errors of thinking and analysis
- Need to follow framework/plan to min possible errors of thinking and analysis - Eg: following tasks help apply statistics to bus decision making: o 1) Define the data you want to study in order to solve a problem or meet objective o 2) Collect data from appropriate sources o 3) Organise data collected by developing tables o 4) Visualise data collected by developing charts o 5) Analyse data collected to reach conclusions and present those results § NOTE: usually do this in order listed. Must do first 2 to have meaningful outcomes, but order of other 3 can change or appear inseparable § When apply statistics to decision making, should be able to identify all 5 tasks, should verify that you did first 2 before other 3
Numerical descriptive measures
- Numerical measures can be used to summarise and describe numerical data - They are precise, objectively determined, easy to manipulate, interpret and compare - Allow for careful analysis of data - Can describe data set by describing its central tendency, variation, shape
Statistical independence
- Occurrence of event doesn't affect occurrence of second event
Relative frequency and percentage distributions
- Often more useful as shows the proportion or percentage of data that falls into each class - When comparing two or more samples with different sample sizes, use these - Relative frequency distribution: summary table for numerical data which gives proportion of data values in each class - Obtained by dividing frequency in each class by total number of values - From this, a % distribution can be obtained by multiplying each relative frequency by 100
Simple random sample
- One where each item in frame has equal chance of being selected. - Every sample of a fixed size has same chance of selection as every other sample of that size. - It is most elementary random sampling technique. - Forms basis for other random sampling techniques. - Use n to represent sample size and N to represent frame size. - Number every item in frame from 1 to N. Chance you will select any particular member of frame on first draw is 1/N. - Simpler to use but generally less efficient than other, more sophisticated probability sampling methods. - May not be a good rep of pop's underlying characteristics
Numerical descriptive measures for a population
- Population summary measures are call parameters - Population mean: Mean calculated from population data - Population variance and standard deviation: o Population variance: variance calculated from population data o Population standard dev: standard dev calculated from pop data o They measure variation in a pop about the mean o Pop standard dev is square root of pop variance
Probability distribution for discrete random variable
- Probability distribution for discrete random variable is a mutually exclusive list of all possible numerical outcomes of random variable with prob of occurrence associated with each outcome - For prob distribution for discrete random variable: o All prob must be between 0 and 1 o The sum of probs must equal 1 - Random variable: represents a possible numerical value from an uncertain event - Discrete random variables: can only assume a countable number of values
Joint probability
- Probability of an occurrence described by two or more characteristics - Eg: prob you get head on first toss and head on second coin toss
Survey error: coverage error
o Occurs when all items in frame don't have equal chance of being selected. Causes selection bias o Occurs if certain groups of items are excluded from frame so have no chance of being selected in sample. o If frame inadequate due to certain groups not properly included, any random probability sample selected will provide an estimate of characteristics of frame, not actual population. o Eg: computer-based surveys are useful where subjects all have Internet access. Coverage errors could result if unemployed, elderly or indig communities are not selected in frame due to lack of internet access. o Can result in selection bias and becomes ethical issue if particular groups/individuals are purposely excluded from frame so survey results are skewed, indicating a position more favourable to survey's sponsor
Select samples with replacement or without replacement
o Sampling with replacement: after you select an item you return it to the frame, where it has same probability of being selected again o Sampling without replacement: once you select an item, cannot be selected again. Chance you select any particular item in frame on first draw is 1/N. Chance you will select any shopping docket not previously selected on second draw is now 1 out of N - 1. Process continues until you have selected desired sample of size n.
Range
o Simplest numerical descriptive measure of variation in set of data o Range = X largest - x smallest o Measures total spread of data o Based only on two extreme values and ignores all other values, so doesn't take into account how data distributed o Distorted by very high or low values, so care needed when using it
Sampling process begins by defining the frame
o The frame: listing of items that make up population o Frames are data sources such as population lists, directories or maps o Samples are drawn from these frames. o Inaccurate/biased results can occur if frame excludes certain groups of pop
Median
o The value that splits an ordered set of data into 2 equal parts o Isn't affected by outliers, so may be a better measure of central tendency when there are extreme values o Middle value in set of data that has been ordered from lowest to highest o Equation: median = (n + 1) / 2 o If odd number of values, median is middle ranked value o If even number of values, median is mean of two middle ranked values
Tables of random numbers:
o Used for selecting the sample o Consists of series of digits listed in randomly generated sequence. o Because numeric system uses 10 digits (0, 1, 2...9), chance you will randomly generate any particular digit is equal to probability of generating any other digit. § This probability is 1 out of 10. § So, if sequence of 800 digits is generated, would expect about 80 of them to be digit 0, 80 to be digit 1, etc o Because every digit or sequence of digits in table is random, table can be read horizontally or vertically. o Margins of table designate row numbers and column numbers. Digits themselves are grouped into sequences of 5 in order to make reading table easier. o First need to assign code numbers to individual members of frame. Then get random sample by reading table of random numbers and selecting those individuals from frame whose assigned code numbers match digits found in table.
Mode
o Value in data set that appears most frequently o Extreme values don't affect it o Use it only for descriptive purposes as it is more variable from sample to sample than either mean or median o Often no mode or several modes o If have two modes, data set is bimodal
Data def
observed values of variables, collection or set of values. Without it, can't get stats
Sample def
portion of population selected for analysis. They represent a portion/subset of population
Non-probability samples
- Judgement sample - Quota sample - Chunk sample - Convenience sample
Ordinal scale
- Level of measurement o Data from categorical variable o Classifies data into distinct categories where ranking is implied o Eg: answers like very happy, happy, not happy, very not happy o Stronger than nominal scaling but still relatively weak form of measurement as it doesn't account for amount of differences between categories. Ordering implies only which category is greater/better, but not by how much
Nominal scale
- Level of measurement o Data from categorical variable o Classifies data into various distinct categories in which no ranking is implied o Weakest form of measurement because no ranking o Eg; what is your political party affiliation, gender, hair colour
Ratio scale
- Level of measurement o Data from numerical variable, applies to discrete and continuous o An ordered scale where difference between measurements involves a true 0 point. 0 represents the absence of the phenomena being considered o Eg: length, weight, age, salary measurements, height, profit and loss o Highest level of measurement
Interval scale
- Level of measurement o Data from numerical variable, applies to discrete and continuous o An ordered scale where difference between measurements is a meaningful quantity but doesn't involve a true 0 point o Eg: shoe size as shoe doesn't have a true zero, temperature, calendar time Highest level of measurement
Collecting data
- Managing a bus effectively requires collecting appropriate data - Important that correct inferences drawn from research and appropriate stat methods assist in making right decision - Usually data collected from a sample
Measures of central tendency
- Many data sets have distinct central tendency, with data values grouped around a point - Mean: o Arithmetic mean most common measure of central tendency o Uses all data values, so affected by outliers (if there are outliers, take care when using mean as measure of central tendency) o Calculated by adding all values of data set then dividing by number of variable values in set o X bar used to represent the mean of a sample
Data cleaning
- May find irregularities in values you collect - Undefined value: in categorical variable, it a value that doesn't represent one of categories defined for variable - Impossible value: in numerical variable, it a value that falls outside defined range of possible values for variable - Outliers: found in numerical variable without a defined range of possible values, are values that seem excessively different from most values. May or may not be errors, but demand a second review - Missing values: values not able to be collected, and therefore not available for analysis, eg: you record a non-response to a survey question as a missing value. - When spot irregularity, may have to 'clean' data collected
Expected value of discrete random variable
- Measure of central tendency, describes centre and variation of prob distribution - It is the mean of a discrete random variable - To calculate it, multiply each outcome X by corresponding probability P(X) and then sum these products
Covariance and its application in finance
- Measure of strength of relationship between two random variables, X and Y - Positive covariance = positive relationship - Neg covariance = neg relationship - If two variables are independent, then their covariance is 0
Z scores
- Measures of relative standing that take into consideration both mean and standard dev - Represents distance between a given observation and mean expressed in standard devs. - Outlier will have a larger Z score, either positive or neg - Useful in identifying outliers - As general rule, value said to be outlier if its Z score is less than -3.0 or greater than +3.0, ie: the value is more than 3 standard dev below or above mean
Kurtosis
- Measures relative concentration of values in centre of distribution compared with the tails - Based on differences around the mean raised to the fourth power
Coefficient of correlation
- Measures relative strength of linear relationship between two numerical variables. - Values of coefficient of correlation range from -1 for a perfect negative linear correlation to +1 for a perfect positive linear correlation - Perfect means that, if points are plotted in a scatter diagram, all the points will lie in straight line - With sample data, sample coefficient of correlation r can be calculated - Correlation alone cannot prove there is causal effect, that is, the change in value of one variable caused change in other variable. A strong correlation can be produced simply by chance, by effect of third variable not considered in calculation of coefficient of correlation or by a cause and effect relationship. - The closer coefficient of correlation is to +1 or -1, the stronger the linear relationship. - When coefficient of correlation is near 0, there is little or no linear relationship between the two numerical variables. - Sign of coefficient of correlation indicates whether data are positively correlated or negatively correlated
Systematic sample
- Method involves selecting first element randomly then choosing every kth element after - Divide N items in frame into n groups of k items where: k = N / n o k = size of selection interval o N = pop size o n = sample size - You round k to nearest integer - To select a systematic sample, choose first item to be selected at random from first k items in frame. Then select remaining n - 1 items by taking every kth item thereafter from entire frame - To take a systematic sample of n = 40 from the pop of N = 800 employees, partition frame of 800 into 40 groups, each of which contains 20 employees. Then select random number from first 20 individuals and include every 20th individual after first selection in sample. Eg: if first number you select is 008, your subsequent selections are 028, 048, 068, 088, 108.....768, 788 - If frame consists of a listing of prenumbered cheques, sales receipts or invoices, a systematic sample is faster and easier to take than a simple random sample. - Convenient mechanism for collecting data from phone directories, class rosters and consecutive items coming off an assembly line. - Simpler to use but generally less efficient than other, more sophisticated probability sampling methods. - More possibilities for selection bias and lack of representation of pop characteristics occur from systematic samples than from simple random samples. To overcome potential problem of disproportionate representation of specific groups in a sample, you can use either stratified sampling methods or cluster sampling methods
Continuous Probability Distributions
- A continuous random variable is a variable that can assume any value on a continuum (can assume an infinite number of values), ie: measurement, eg: weight, height, time - A normal distribution is a continuous probability distribution, ie: not discrete - Continuous distributions are fundamental to understanding and applying inferential statistics techniques - Continuous probability distributions include normal, uniform and exponential - NOTE: Discrete distribution is binomial distribution (see above)
Binomial Distribution
- A mathematical model (a mathematical expression representing a variable of interest) - Means can easily calculate exact probability of occurrence of any particular outcome of random variable - Binomial distribution: discrete probability distribution, where random variable is number of successes in a sample of n observations from either an infinite population or sampling with replacement - One of most important and widely used discrete probability distributions - Arises when the discrete random variable is the number of successors in the sample of n observations - The binomial distribution has 4 essential properties: o 1) The sample consists of a fixed number of observations, n o 2) Each observation is classified into one of two mutually exclusive and collectively exhaustive categories, usually called success and failure § Generally called 'success' and 'failure' § Probability of success is p, probability of failure is 1-p o 3) The probability of an observation being classified as a success, p, is constant from observation to observation. Thus, the probability of an observation being classified as a failure, 1 - p, is also constant for all observations § Constant probability for each observation § Eg: prob of getting a tail is same each time we toss the coin o 4) The outcome (ie: success or failure) of any observation is independent of the outcome of any other observation. To ensure independence, the observations can be randomly selected either from an infinite population without replacement or from a finite population with replacement
Probability
- A numerical value that represents the chance, likelihood or possibility that a particular event will occur - Given either as a proportion or fraction whose values lies between 0 and 1, inclusive - Event: each possible outcome of a variable - Event with no chance of occurring has prob of 0 - Event sure to occur has prob of 1
Coefficient of variation
- A relative measure of variation that is expressed as a % - Measures the scatter in the data relative to the mean - Useful when comparing 2 or more sets of data that have different units of measurement, or when the scale of data sets is substantially different - Often used in stock exchange to see whether an investment is risky or not
Five number summary
- A way of describing numerical data - Five number summary: o Consists of 5 statistics: smallest, Q1, median, Q3, largest o Characterises a sample/pop reasonably well o Useful for exploratory data analysis o Provides a way to determine the shape of distribution
Box and whisker plots
- A way of describing numerical data o Provides graphical rep of data based on 5 number summary o Shows range, IQR and quartiles
'Big data'
- Advances in info ech allow bus to collect, process, analyse very large volumes of data - Def: large data sets characterised by their volume, velocity and variety - Exists as both structured and unstructured data
Recoding variables
- After collected data, may discover that need to reconsider categories you have defined for categorical variable - Recoded variable: variable that has been assigned new values that replace the og ones. Supplements or replaces og variable in analysis, eg: when defining households by location, suburb/town recorded might be replaced by new variable or postcode. - When recoding, be sure that category definitions cause each data value to be placed in only one category, aka mutually exclusive (means two events cannot occur simultaneously). - Ensure set of categories you create for new, recorded variables include all data values being recoded, aka collectively exhaustive (means set of events such that one of events must occur) - If recoding a categorical variable, can preserve one or more of og categories, as long as your recoded values are both mutually exclusive and collectively exhaustive
Contingency tables
- Aka cross classification table - Presents data for two categorical variables - Rows contain categories of one variable and columns the categories of other variable - Intersections of each row and column category, called cells, contain joint responses, ie data that are in row category and also in column category - Depending on type constructed, cells may contain frequency, percentage of overall total, percentage of row total or percentage of column total in both categories - To construct one, classify or sort data into one of r x c possible cells in table, where r is number of row categories and c is number of column categories. - Note: cells must be manually exclusive and exhaustive so each data value belongs only one cell. - For further exploration of possible patterns/relationships, can construct contingency tables based on % o To do this, convert cell frequencies into % based on one of following 3 totals: overall total, row totals or column totals
Frequency distributions
- Allows you to condense set of data - A summary table where data is arranged into numerically ordered classes/intervals - To construct one, first select appropriate number of classes and suitable class width (distance between upper and lower boundaries of a class). - Classes should be exhaustive and mutually exclusive, so that any one data value belongs to only one class. - Number of classes chosen depends on amount of data. Should have at least 5 classes, but no more than 15 - Each class should be equal width. To determine required width of classes, divide range of data by required number of classes o Choose a class width that simplifies reading and interpretation of distribution and resultant graphs. - Centre of each class, called class mid-point, is halfway between lower boundary and upper boundary of class. - Allows you to draw conclusions about major characteristics of data and shape of data - Number of observations in each ordered class or interval becomes corresponding frequency of that class or interval - To get % (aka relative frequency), do frequency divided by total x 100 - Cumulative frequency: add on every time, ie: 7, 7+2 = 9, etc
Bus analytics applications: descriptive analytics
- Bus analytics: skills, tech and practices for continuous iterative exploration and investigation of past bus performance to gain insight and drive bus planning - Analytics represents an evolution of pre-existing statistical methods combined with advances in info systems and techniques from management science. - Descriptive, predictive and perspective analytics form 3 broad categories of analytic methods o Descriptive analytics: explores bus activities that have occurred or are occurring § Giving decision makers ability to combine, collect, organise and visualise data that could be used for day to day, or minute by minute, business monitoring in the present, rather than business activity in the past, is one of the main goals of descriptive analytics § Real time monitoring useful for bus that handles perishable inventory (inventory that will disappear after particular event takes place, eg: seat at a concert). By constantly monitoring sales, promoter can use dynamic pricing model where prices of tickets fluctuate in near time based on whether sales are exceeding or failing to meet predicated demand. § Real time monitoring useful for bus that manages flows of people/objects that can be adjusted in near real time, eg: monitoring flows of fan into stadium and re-directing them to assist congestion o Predictive analytics: identifies what likely to occur in (near) future and finds relationships in data that may not be readily apparent using descriptive analytics o Prescriptive analytics: investigates what should occur and prescribes best course of action for future
Portfolio expected return
- Covariance expected return and standard dev of sum of two random variables can be applied to study of investment portfolios where investors combine assets into portfolios to reduce their risk - Portfolio: combined investment in two or more assets - Objective is to max the return while minimising the risk - For such portfolios, rather than studying sum of two random variables, each investment is weighed by the proportion of assets assigned to that investment Portfolio expected return: measure of central tendency, mean return on investment
Covariance
- Covariance is measure of strength and direction of linear relationship between 2 numerical variables (X and Y) - Positive value indicates a positive linear relationship between 2 variables and neg value indicates neg relationship - Value of 0 indicates no linear relationship between the variables - Linear relationship can be graphed by straight line, sloping upwards if positive and downwards if neg - As covariance can have any value, it's difficult to use it as a measure of relative strength of linear relationship. A better measure is coefficient of correlation
Data formatting
- Data collected may be formatted more than one way, eg: tables, contents of standard forms, continuous data stream - Data can exist in either structural or unstructured form - Structured data: data that follow some organising principle or plan, usually a repeating pattern. Tables and forms are structured, eg: once identify that second column of table contains family name of person, then know all entries in second column contain family name - Unstructured data: follows no repeating pattern - Electronic format: data in a form that can be read by a computer - Encoding: representing data by numbers of symbols to convert data into useable form
What are the levels of measurement
- Data described in levels of measurement - 4 levels of measurement: nominal, ordinal, interval, ratio scales
Dashboards
- Descriptive analytics method to present up to the minute operational status about a bus - Provides info in a visual form that is easy to comprehend and review - Contain summary tables and charts, newer or more novel forms of info presentation that can summarise data - Gauges: visual display of data inspired by speedometer in a car. Can consume lot of visual space in a dashboard. - Bullet graph: horizontal bar chart inspired by thermometer o Gauges popular choice in bus but most info design specialists prefer bullet graphs because they foster direct comparison of each measurement - Treemaps: o Dashboards may contain them o Help users visualise two variables, one of which must be categorical o Useful when categories can be grouped to form a multilevel hierarchy or tree - Gauges, bullet graphs and treemaps use colour to rep the value of a second variable. But avoid using colour spectrums that run from red to green - Sparklines: A descriptive analytics method that summaries time-series data as small, compact graphs designed to appear as part of a table. One of descriptive analytic methods that dashboards can contain
Hypergeometric distribution
- Discrete probability distribution where random variable is number of successes in a sample of n observations from a finite population without replacement - The random sample is selected without replacement from a finite population. Thus, the outcome of one observation is dependent on outcomes of prior observations
Cumulative percentage polygons (ogives)
- Displays variable of interest along horizontal axis and cumulative percentages (percentiles) on vertical axis - Percentile: value below which a given % of observations in a data set fall
Shape
- Distribution is symmetrical if lower and upper halves of graph are mirror images. Median and mean equal each other - Distribution is skewed to right (positively skewed) if there a long tail to the right. Indicates that most values concentrated in lower portion of distribution. Usually mean is less than median - Distribution is skewed to left (negatively skewed) if there long tail to the left, ie: most values concentrated in upper portion of distribution. Usually, mean greater than median
Ethical issues and probability
- Ethical issues can arise when any statements relating to prob are presented to public, particularly when part of advertising campaign for product or service - Many people not comfortable with numerical concepts and tend to misinterpret meaning of prob. Sometimes, misinterpretation is not intentional but ads may unethically try to mislead customers - Eg: Lotto commercial said 'We won't stop until we have made everyone a millionaire' is deceptive and possibly unethical application of probability. Misleading as in a lifetime, no one can be certain of becoming a millionaire by winning Lotto - Statement in investment newsletter promising 90% probability of 20% annual return on investment is example of potentially unethical application of prob. To make the claim in the newsletter an ethical one, author needs to (a) explain basis on which this prob estimate rests, (b) provide probability statement in another format, such as 9 chances in 10, and (c) explain what happens to investment in 10% of cases in which a 20% return is not achieved (e.g. Is the entire investment lost?).
The Chebyshev Rule
- For heavily skewed or non-bell shaped data sets, use this rule - States that, for all data sets, pop or sample, the percentage of values within k standard dev of mean must be at least: - Can use this rule for any value of k greater than 1 - Consider k = 2. Rule states at least 75% of values must be within + or - 2 standard deviations of mean - Gives the % of values that must at least be within a given distance from the mean - Very general and applies to any distribution
Cluster sample
- Frame is divided into representative groups (or clusters) then all items in randomly selected clusters are chosen - Cluster: a naturally occurring grouping, such as a geo area, eg: postcodes, electorates - Divide N items in frame into several clusters so that each cluster is representative of entire population. Then take a random sample of clusters and study all items in each selected cluster. - Often more cost effective than simple random, particularly if pop is spread over a wide geo region. Often requires larger sample size to produce results as precise as those from simple random or stratified sampling
Cumulative percentage distributions
- Gives % of values that are less than a certain value - A percentage distribution used to form corresponding cumulative % distribution
Summary Table
- Gives frequency, proportion or percentage of data in each category which allows you to see differences between categories. - It lists categories in one column and frequency, percentage or proportion in separate column(s)
Misusing graphs and ethical issues
- Good graphical displays should present data in clear, understandable way - Many graphs are incorrect, misleading or unnecessarily complicated - One principle of good graphs is that, when using 3D icons, frequency/quantity must be proportional to volume - Good graphs should be properly scaled along each axis and clearly labelled - Often improper use of vertical and horizontal axes that leads to distortions in presenting data - Vertical axis on a good graph should usually begin at zero - The graph should not contain chartjunk - Any 2D graph should contain a scale for each axis - All axes should be properly labelled, and with a title - Simplest possible graph should be used for a given set of data - Eg: doing a 3D pie chart is unwise as it can complicate a viewer's interpretation of data - Doughnut, radar and surface charts may look visually striking, but in most cases obscure data - Inappropriate graphs raise ethical concerns, especially if they, deliberately or not, present false impression of data
Histograms
- Graphical rep of a frequency, relative frequency or % distribution - The area of each rectangle represents class frequency, relative frequency or percentage - Horizontal is divided into intervals corresponding to the classes - Rectangles constructed above these intervals, heights of which measure the frequency, relative frequency or % of data values in class - Vertical axis is either frequency, relative frequency or percentage
Scatter diagrams
- Graphical rep of relationship between two numerical variables - Plotted points represent given values of independent variable and corresponding dependent variable - Independent variable on horizontal axis (x axis), and dependent variable on vertical (y) axis - May have positive linear relationship or negative relationship
Decision Trees
- Graphical rep of simple and joint probabilities as vertices of a tree - Aka tree diagram
Time series plot
- Graphical rep of value of numerical variable over time - Used to study patterns in value of a variable over time - Displays time period on horizontal axis and variable of interest on vertical axis
Pie charts
- Graphical representation of a summary table - Often used for qualitative data - Each category represented by slice of circle of which area represents proportion or percentage share of category relative to total of all categories - Is a circle used to represent total, divided into slices - If observing portion of whole that lies in particular category is most important, use it - Should be no more than 8 categories in it. If there more than 8, merge smaller categories into category called 'other'
Bar Charts
- Graphical representation of summary table - Often used for qualitative data - Length of each bar represents proportion, frequency or percentage of data values in a category - Each category is represented by a bar - If comparison of categories most important, use bar chart
Identifying sources of data
- Identifying most appropriate source of data critical aspect of statistical analysis - If biases, ambiguities, other errors flaw the data being collected, then won't produce accurate info - 5 important sources of data are: data distributed by org or individual, designed experiment, survey, observational study, data collected by ongoing bus activities - Data sources classified as either primary or secondary sources o Primary: when data collected being used by one collecting it for analysis o Secondary: when another org/person has collected data then you use it - Observational study: researcher observes behaviour directly, usually in its natural setting, eg: focus group
Stratified sample
- Items randomly selected from each of several populations or strata - Strata: subpopulations composed of items with similar characteristics in stratified sampling design - First subdivide N items in frame into separate subpopulations, or strata - Select simple random sample, in proportion to size of strata, and combine results from separate simple random samples. - More efficient than either simple random or systematic sampling because are assured of representation of items across entire population. - Homogeneity of items within each stratum provides greater precision in estimates of underlying population parameters.
Events and sample spaces
- Random experiment: precisely described scenario that leads to an outcome that cannot be predicated with certainty, eg: toss coin twice and record whether heads or tails occur - An event: specified by one or more outcomes of a random experiment. Event said to have occurred if one of outcomes specified has occurred, eg: when rolling die, event of even number consists of 3 outcomes: 2, 4, 6 - Sample space: collection of all possible events, eg: roll die has 6 simple events, 1, 2, 3, 4, 5, 6 - Simple event: o An event specified by a single outcome of random experiment o Denoted A o An outcome from a sample space with one characteristic - Joint event: o Event described by two or more characteristics. o Can be simple event, eg: in experiment of tossing coin twice, simple event HH has two characteristics, H on first toss and second o Involves two or more characteristics simultaneously o Denoted A∩B - Complement of event A: o Includes all simple events not included in event A, eg: when tossing coin, complement of head is tail o Denoted A'
Sampling
- Rather than taking complete census of pop, statistical sampling procedures focus on collecting small representative group of larger pop. Results used to estimate characteristics of entire pop - Draw a sample because less time consuming, less costly to admin than census, less cumbersome and more practical to admin than census
Probability samples
- Simple random sample - Systematic sample - Stratified sample - Cluster sample
Survey errors
- Surveys subject to potential errors - Good survey research design attempts to reduce these survey errors, often at considerable cost - Coverage error, non-response, sampling error, measurement error
Pitfalls in numerical descriptive measures and ethical issues
- The next step is analysis and interpretation of calculated statistics - Analysis is objective, but interpretation is subjective - Avoid errors that may arise either in objectivity of analysis or in subjectivity of interpretation - Objectivity in data analysis means reporting most appropriate descriptive summary measures for given data set. - Data interpretation is subjective as different people form different conclusions when interpreting analytical findings - Must attempt to present findings in a fair, neutral and transparent manner
Marginal probability
- The probability P(A) of an occurrence of an event, A described by a single characteristic - Marginal probability of an event is the sum of joint probabilities - Mutually exclusive: two events are mutually exclusive if the two events cannot occur simultaneously, eg: heads and tails on coin toss - Collectively exhaustive: set of events is this is one of the events must occur, eg: heads and tails on a coin toss - Example:
Standard error
- The standard dev divided by square root of sample size
Side-by-side bar charts
- To display results of contingency table data, construct this
Evaluating survey worthiness
- To identify surveys that lack objectivity or credibility, must critically evaluate what read and hear by examining worthiness of survey. - First, must evaluate purpose of survey, why conducted and for whom, eg: an opinion poll conducted to satisfy curiosity is mainly for entertainment. - Second, determine whether based on a probability or a non-probability sample. o Only way to make correct statistical inferences from a sample to a pop is through use of probability sample. o Surveys that use non-probability sampling methods are subject to serious, perhaps unintentional, bias that may render results meaningless
Poisson Distribution
- Used to calculate probs when counting number of times a particular event occurs in an interval of time or space if: o 1) Probability an event occurs in any interval is same for all intervals of same size o 2) Number of occurrences of event in one non-overlapping interval is independent of the number in any other interval o 3) Probability that two or more occurrences of event in an interval approaches 0 as the interval becomes smaller - If these properties hold, then the average or expected number of occurrences over any interval is proportional to the size of the interval
Measures of variation
- Variation measures the spread or dispersion of values in data set - Includes range, interquartile range, variance and standard deviation
Polygons
- When comparing two or more sets of data, construct polygons on same axes - Percentage polygon: constructed by plotting the % for each class above respective class mid-point and joining the mid-points by straight lines. Graph extended at each end to classes with frequency of 0 so polygon starts and ends on horizontal axis
Calculating numerical descriptive measures from a frequency distribution
- When have frequency distribution and raw data not available, you can calculate approximations of mean and standard dev by assuming all values within each class are located at class mid-point
Organising numerical data
- When have large amount of raw numerical data, first step is present data as ordered array or stem and leaf plot - They are of limited used when have very large quantities of data or data is highly variable. In this case, use tables and graphs to condense/present data - Ordered arrays: o Numerical data sorted by order of magnitude, ie: smallest to largest o Eg: 2, 4, 8, 10, 15, etc o Provides some signals about variability within range, may help identify outliers o If data set is large or data highly variable, ordered array less useful - Stem and leaf displays: o Graphical rep of numerical data o Partitions each data value into a stem portion and leaf portion o Allows you to see how data is distributed and where concentrated
Counting Rules
- When there large number of possible outcomes and difficult to determine exact number, rules for counting number of possible outcomes have been developed - Counting rule 1: o If any of k different mutually exclusive and collectively exhaustive events can occur on each of n trials, the number of possible outcomes is equal to k n squared. o Eg: toss a coin 5 times, number of different possible outcomes are 2 to the power of 5, ie: 2 x 2 x 2 x 2 x 2 = 32 - Counting rule 2: o Allows for number of possible events to differ from trial to trial o If there are k1 events on first trial, k2 events on second trial...., and kn events on nth trial, then number of possible outcomes is k 1 x k2 x .... X kn o Eg: At one stage, standard NSW number plates consisted of 3 letters and numbers. How many possible number plates are there?, So do 26 x 26 x 26 x 10 x 10 x 10 = 17,576,000. Or do 26 to power of 3 x 10 to power of 3 - Counting rule 3: o Involves calculating of number of ways that set of items can be arranged in order - Counting rule 4: o Number of ways in which a subset of entire group of items can be arranged in order o Each possible ordered arrangement is called a permutation - Counting Rule 5: o When interested in number of ways that X items can be selected from n items, irrespective or order o Each unordered selection is called a combination
Empirical Rule
- You can use empirical rule to examine variability in bell shaped distributions, both pop and sample - The rule states that for bell shaped distributions: o Apprx 68% of values are within a distance of +1 or - 1 standard dev from mean. That is, approx. 68% of the data values have Z scores between -1 and 2 o Approx 95% of values are within distance of + or - 2 standard dev from mean. That is, approx. 95% of the data values have Z scores between -2 and 2 o Approx 99.7% of values are within a distance of + or - 3 standard dev from mean. That is, approx. 99.7% of data values have Z scores between -3 and 3. - Helps identify outliers when analysing set of numerical data - As general rule, values not found in 95% are potential outliers, and those beyond 3 standard devs from mean are almost always considered outliers
Types of variables
1) Categorical o Aka qualitative variable o Yield categorical responses, eg: yes, no, male, female 2) Numerical o Aka quantitative variables o Yield numerical responses, eg: height in cms o There 2 types of numerical variables: § 1) Discrete variable: produce numerical responses that rise from counting process, finite number of integers § 2) Continuous variable: produce numerical responses that arise from a measuring process, eg: height
2 branches of statistics
1) Descriptive statistics: focuses on collecting, summarising and presenting set of data, eg: surveys, tables, graphs and characterisation of data 2) Inferential statistics: uses sample data to calculate statistics that provide estimates of characteristics of entire pop. Drawing conclusions about pop based on sample data, ie: estimating parameter based on a statistics - Both are applicable to managing a bus - Process of inferential statistics:
Statistics def
Branch of mathematics that examines ways to process and analyse data. Provides procedures to collect and transform data in ways that are useful to bus decision makers numerical measure that describes a characteristic of a sample.
Variables def
Characteristics/attributes that can be expected to differ from one individual to another, eg: age, country of birth, weight, etc.
Population def
Consists of all members of a group about which you want to draw a conclusion
Parameter def
numerical measure that describes a characteristic of a pop. Info from whole pop needed to calculate parameter
Sampling process: once select frame, draw a sample from frame
o 2 kinds of samples: non-probability and probability sample o Non-probability sample: § Select items of individuals without knowing their probabilities of selection § Common type is convenience sampling: items selected based on fact that they easy, inexpensive, convenient to sample § Judgement sample: get opinions of preselected experts in subject matter as to who should be included in survey § Also includes quota sample and chunk sample § Advantages: convenience, speed, and lower cost. § Disadvantages: lack of accuracy due to selection bias and poorer capacity to provide generalised results § Should restrict use of non-prob sampling methods to situations where want to get rough approx at low cost to satisfy your curiosity about particular subject o Probability sampling: § Select items based on known probabilities § Whenever possible, should use this method § Includes simple random sample, systematic sample, stratified sample, cluster sample. (see below for info) § Vary in cost, accuracy and complexity. § Samples based on these methods allow you to make unbiased inferences about pop of interest § In practice, often difficult/impossible to take probability sample. But should work towards achieving it and acknowledge any potential biases that may exist.
3 approaches to assigning a prob to an event
o A priori classical probability § Prob of an event based on prior knowledge of process involved § Eg: already know deck of cards have 52 and die has 6 faces § In simplest case, each outcome is equally likely and chance of occurrence of event is X / T (X = number of ways event occurs, T = total number of possible outcomes) § Eg: prob of selecting black card from deck is 26/52 = 0.5 § Number of ways events occurs and total number of possible outcomes are known o Empirical classical probability § Outcomes based on observed data, not prior knowledge of a process § Eg: proportion of registered voters who prefer a certain political candidate, proportion of students who have part time job, eg: survey them o Subjective probability § Differs from person to person § Eg: development team assign prob of 0.6 for chance of success but managing director less optimistic and says 0.3 prob § Assignment of prob usually based on combination of individual's prior knowledge, personal opinion, analysis of particular situation § Useful when making decisions in situations where cannot use priori classical prob or empirical
Statistical methods in different bus areas
o Accounting: use stat method to select samples for auditing and to understand cost drivers in cost accounting o Finance: uses stat method to choose between alternative portfolio investments and track trends in financial measures over time o Management: use stat methods to improve quality of products manufactured or services delivered by an org o Marketing: use stat method to estimate proportion of customers who prefer one product over another and draw conclusions about what advertising strategy might be most useful in increasing sales
Contingency Tables
o Aka cross classification table o Represents sample space for joint event classified by two characteristics o Each cell represents joint event satisfying given values of both characteristics
Variance and standard deviation
o Both take into account how all values in data set are distributed o Variance: measure of variation based on squared deviations from the mean o Standard deviation: measure of variation based on squared deviations from the mean o Measure the average scatter around the mean, how larger values fluctuate aove it and how smaller values distributed below it o Based on difference between each data value and the mean o Sum of squares = sum of squared deviations. Because of this, neither variance nor standard deviation can be negative o Only be 0 if there is no variation, that is, all values are equal o Sample standard dev more useful measure because, unlike sample variance, the value is expressed in same units of measurement as og sample data o Square root of sample variance is sample standard dev o For most data sets, majority of data values lie within 1 standard dev of mean - The more spread out/dispersed the data, the larger the range, IQR, variance and standard dev - The more concentrated/homogenous the data, smaller the range, IQR, variance and standard deviation - If values are all same, range, IQR, variance and standard dev will all = 0 - None of measures of variation (range, IQR, standard dev and variance) can ever be negative
Survey error: measurement error
o Difference between survey results and true value of what being measured o Process of getting measurement often governed by what is convenient, not what is needed o Can occur due to wording of question. Question should be clear, not ambiguous or leading o 3 sources of measurement error: ambiguous wording of questions, halo effect and respondent error. o Halo effect occurs when respondent feels obligation to please interviewer. Proper interviewer training can minimise halo effect o Respondent error occurs due to overzealous or under zealous effort by respondent. Can minimise this by carefully scrutinising data and calling back individuals whose responses seem unusual, and by est a program of random call backs to determine reliability of responses o Becomes ethical issue in one of 3 ways: 1) survey sponsor chooses leading questions that guide responses in particular direction, 2) interviewer, through mannerisms and tone, purposely creates halo effect or guides responses in a particular direction, and 3) respondent, having a disdain for survey process, wilfully provides false info
Interquartile range
o Difference between third and first quartiles in set of data o IQR = Q3 - Q1 o More meaningful measure of variation than range as ignores extreme values by finding range of the middle 50% of ordered array o Not affected by extreme values (called resistant summary measure)
Survey error: sampling error
o Difference in results for different samples of same size o Reflects heterogeneity, or 'chance differences', from sample to sample, based on probability of certain individuals/items being selected in particular samples. o Can reduce sampling error by taking larger sample sizes, but this increases cost of conducting survey o Becomes ethical issue if findings purposely presented without reference to sample size and margin of error, so that sponsor can promote viewpoint that might otherwise be truly insignificant. o Is going to occur but can be minimised by larger or more representative sample
Quartiles
o Divide set of data into quarters, ie: 4 equal parts o Q1 divides lower 25% of values from other 75% o Q2 is median, 50% of values below median and 50% above o Q3 has 75% of values below it and 25% above o Q1, Q2, Q3 are the 25th, 50th and 75th percentiles o If result is whole number, quartile equal to that ranked value o If result is fractional half, quartile equal to mean of corresponding ranked values, eg; if Q1=2.5, then it is halfway between second and third ranked values o If result not whole number or half, round result to nearest number and select that ranked value, eg; Q1=2.75, round it to 3, and Q1= third ranked number
Venn diagram
o Graphical rep of sample space o Shows various events as unions and intersections of circles o To construct one, events A and B must be defined and must determine value of intersection of A and B in order to divide sample space into its parts
Geometric mean
o Measures the average rate of change of a variable over n periods o The geometric mean rate of return measures the average rate on an investment over time
Survey error: non-response error
o Occurs due to failure to collect info on all items chosen for sample. Causes non-response bias o Research shown individuals in upper and lower socioeconomic classes tend to respond less frequently to surveys than middle class. o Need to follow up on non-responders and make several attempts to persuade them to complete survey o Mode of response used affects rate of response. Personal and phone interview usually produce higher response rate than mail survey, but at higher cost. o Can lead to non-response bias and becomes ethical issue if sponsor knowingly designs survey in way that particular groups/individuals less likely to respond.