stats midterm
independent variable
(cause/explanatory) -> variable that is controlled or manipulated by the researcher. Influences the dependent variable
Dependant variable
(effect/response) -> cannot be changed by the researcher, but changes in response to the effect of the independent variable
KURTOSIS
*measure of tailness or preakness of our data*. Descriptive method used to describe a shape of a data distribution . Measures the tail-ness and peaked-ness of a distribution relative to a normal distribution. Describes the degree to which values are clustered in the tail and peak of a distribution. More emphasis on tail-ness than peak-ness
What is tabulation?
- A useful way to organize and present data through tables. The process of creating tables or placing data into tabular format for statistical analysis is known as tabulation. Involves systematically arranging data into columns or rows to understand or explain the problem
Inferential Statistics
- provides various methods that can be used to draw conclusions about sample data, then infer the results to a population (inductive reasoning - infer the characteristics of the population based on what you know about the sample, moves from precise to general). -Purpose is to use sample results to draw conclusions about the population. -Focus is on comparing, testing and predicting the future of the data. -Language of probability is often used because inferences cannot be absolutely certain. -Process of making a conclusion about the population based on evidence obtained from the sample is called a statistical inference -Examples of methods of inferential statistics include hypothesis testing, probability, confidence interval estimation, and regression analysis
What is Data?
-Data are collection of raw facts and figures -All the data in a study are referred to as the dataset -A single value in data is called datum or data point
What are the 4 scales of data measurement
-Nominal scale, ordinal scale, interval scale, ratio scale
Statistics as a Scientific Process:
-Scientific inquiry provides a method of examining things and requires evidence by gathering information and subjecting it to statistical analysis. -Hypothesis-gather data-analyze. -Statistics provide systematic and scientific ways of collecting, organizing, summarizing, presenting and analyzing data, as well as providing means to draw valid conclusions and make decisions based on analyzed information.
Sampling and inferences
-Statistical conclusions are derived based on a sample due to the expense and time to do an entire population. -Selecting a good representative sample is called sampling. Summary info. Gathered from the sample is a statistic. A summary derived from a population is called a parameter. -Through statistical inference, we can infer the information obtained from a sample (statistics) to the entire population (parameter). Thus, through statistical inference, we can make conclusions about a population based on evidence from a sample.
Scales of measurement
-Used to classify data -Concept of measurement is central to visualizing statistical analysis -Measuring data gives context for statistical analysis -Scale of measurement refers to the degree of quantification - as some data differs only in quality, while other data differs in quantity
Numerical data
-Values that can be quantified -Data differs in quantity can be subdivided into discrete or continuous variables. -Discrete variables are numbers that can only be whole numbers (integers-obtained by counting) and can assume finite only, continuous variables can be fractions-obtained by measuring, infinite.
inductive reasoning vs deductive reasoning
-inductive= using facts and experiments -deductive= using logic and basic knowledge -both are used in the scientific method
descriptive statistics
-provides various methods to summarize, describe and present sample data without drawing conclusions about the larger population (deductive reasoning - deduce the characteristics of the sample based on existing theories about the population, moves from general to an effective conclusion). -Purpose is to organize data through tables, graphs, etc. -Using a deductive approach to analyze data which does not allow you to make conclusions beyond the analyzed data. -Mean-median-mode, range, skewness, kurtosis and standard deviation
Categorical data
-qualitative data - not numerical -Values are placed in distinct and separate categories based on some qualitative attributes -Data values differ in quality but not quantity -Only way to statistically summarize this data is to perform a frequency count or find a percent to determine the mode
How to create a group frequency table
1- Number of groups. 2- Class width - find the difference between the highest and lowest values and divide the result by the number of groups. 3- Class interval - start with lowest value and add equal increments. 4- Frequency - count the values that fall within each group/category. 5- Percent - divide the frequency of each group by the total frequency and multiply the result by 100. 6- Cumulative percent - add the percent value to all previous percent values below the specified group
what are the three steps to data to decision?
1- transform raw data into information that is organized, structured, and can be placed in a context to explore relationships and connections between variables. 2- transforming information to knowledge, which provides learning through logical reasoning, reveals patterns, etc. 3- transforming knowledge into wisdom by providing critical insights to help address the issues and problems faced.
Standard deviation STEPS BY LINDSEY
1. Put items into a table 2. Find the mean for the data set 3. Subtract the mean from the value 4. Find the square root of the answer from step 3 5. Complete for each value in the dataset 6. Calculate the total of the SQRTs 7. To find variance - divide the total by (n-1) where n represents the number of values in a dataset 8. To find standard deviation, SQRT the variance
What is a data dashboard?
A data visualization tool that visually tracks, analyzes and displays key performance indicators KPI, metrics and key data points. Goal is to provide timely summary information that is easy to read, understand and interpret in order to improve business operations and processes. Provides an objective view of performance metrics. Connects to files, attachments, services. Displays in the form of tables, line charts, bar charts and gauges. Most efficient way to track multiple data sources. Provides a central location for businesses to monitor and analyze performance in real time. Designed to help decision makers, executive, and senior leaders establish targets, set goals, and understand why something happened so they can implement changes
Multivariate
A third variable is introduced to explain the relationship between the two variables. Eg., control table
Constant
A value that does not change
Five Number Summary
Any data set can be reduced into 5 key numbers. They are the lowest number (LN), first quartile (Q1), median (MD), third quartile (Q3) and highest number (HN).
coefficient of variation
CV preferred over the standard deviation when comparing datasets measured in different units/that have different values -> measures the spread of data from the central value. Expressed as a percentage rather than a unit; represents the ratio of the standard deviation to the mean. CALCULATED-> standard deviation, divided by the mean, times 100. Formula - std/meanx100. It means in percentage how scattered the values are from the mean, Interpret by saying they vary a BLANK percent from the average
Data types and Measurements
Categorical data leads to nominal and ordinal measurements, Numerical data leads to interval and ratio
What charts (2) are best used to show data comparison?
Clustered bar graph, Stacked bar chart
Bivariate
Data is summarized for two variables with the purpose of exploring the relationship between these variables eg., contingency table
What are the 4 types of data analytics
Descriptive analytics, Diagnostic analytics, Predictive analytics, Prescriptive analytics
How to find the cumulative percentage of a data set
Divide the number of times the event occurred by the total sample size to find the cumulative percentage. In the example, 25 days divided by 59 days equals 0.423729 or 42.3729 percent
Importance of graphs
Effective way to visualize, summarize and describe statistical information. Shows trends. Seen as a picture of the data. Used when its important to illustrate a specific trend in the data or general relationship among groups. Can be used to discuss an issue, reinforce a critical point, summarize data, identify clusters, describe variation and define skewness of a distribution
What are the 4 types of applied statistical research
Exploratory, descriptive, explanatory, evaluation
How to calculate percentile
Formula (n+1) x K/100 -> n is the number of cases in a dataset -> k is the specific percentile in question 1- Rank values from smallest to largest. 2- Apply formula. 3- Count to the ranked value (the answer from the formula)
types of statistical variables
Independent, dependant and constant
Skewness
Looks for the direction in a data distribution. Left-skew distribution, right-skewed distribution and normal distribution
Measures of Shape
Measures the direction/frequency of outliers away from the central value
Data reclassification
Numerical data can be reclassified and transformed into categorical data - grouping. Eg - instead of listing ages 1-100, you can have groups like 1-5, 5-10, etc. They are then labels with no quantitative meaning
Univariate
Refers to a tabular format where data is summarized for a single variable - eg., simple frequency table
What charts (2) are best for data relationships?
Scatterplot, trendline
What charts (7) are best used for data visualisation?
Simple bar chart, Pie chart, Histogram, Time-series, Ogive, Line chart, Pareto chart
Types of mode
Single mode - unimodal distribution. Two modes - bimodal distribution. Three or more modes - multimodal distribution
Quantitative meaning
Some categorical data is written as numbers, but they are not numerical values because they have no quantitative meaning. This is because it makes no sense to perform mathematical operations on them. Eg: social insurance numbers, drivers license numbers, telephone numbers - makes no sense to find average, mode, etc.
Sources of data
Surveys, observations, experiment, published by govs., etc, technology
What is the cumulative percentage?
The Cumulative frequency column lists the total of each frequency added to its predecessor
What are the two types of tables?
Univariate and Bivariate
Stacked bar graph
Used to visualize bivariate data. Helpful in exploring relationships or comparing cases across various categories of the second variable
What are the components of a dataset?
Variables, observation, data list, datum, elements
What are the 4 big v's of big data?
Volume, Variety, Velocity and veracity
Time series
a line chart which shows sequences of data points collected over a successive time interval. Goal is to identify and analyze trends so that one can predict or forecast future outcomes. Helps to determine skewness
Key metrics for assessing data quality are:
accuracy, completeness, timeliness, validity, reliability, relevance, unbiased, limitation
Right-skewed distribution
aka positive skewed. Characteristics -> long tail to the right of the distribution, right side of the graph, extreme value lies on the right, bulk of data is below the mean, mean is more than the median, excel skewness coefficient is more than +0.5
Platykurtic distribution
also known as negative kurtosis - distribution with less variation than a normal distribution, less tailed, less data around the centre, small changes are more common and large changes are less likely due to the thinner tails, in excel its excess kurtosis coefficient will be less than -0.5, and its coefficient will be between -0.5 and -0.1 if its a moderate negative kurtosis, but will be less than -0.1 if its a high negative kurtosis
Left-skewed distribution
also known as negative skewed. Characteristics: observe a long-tail to the left of the distribution, stretching towards the negative direction/left side of the graph, the extreme value lies to the left of the distribution, the mean is less than the median, the bulk of the data are above the mean, computed Excel skewness coefficient is less than - 0.5. Happens when the mean is less than the median - and the mean minus the median will create a negative result
Mesokurtic data
also known as the zero kurtosis, describes a distribution with similar characteristics as a normal distribution - in Excel it will be between -0.5 and +0.5
Weighted mean
considers the frequency assigned to each value in the distribution . weight will indicate the relative importance of the values in the dataset. Multiply the data value by their respective weight or frequency, then sum up the result and divide the total by the frequency to obtain the weighted. *think weighted as in assignments *
Percentage table
constructed from a frequency table by relating the frequency count of each category to the total count to determine the fraction or percentage of cases that belong to each category
Ogive chart
constructed like a line chart, but instead of having percentages plotted at each interval, cumulative percentages are plotted instead. Shows cumulative trend of data distribution to depict percentages of cases below a specific point
Line chart
continuous line formed by placing a dot over the midpoint of each class interval against the percent value of the class. Can reveal the shape of distribution in the form of clustering, spread, skewness, outliers, gaps and peaks
Interval scale
establishes a quantitative difference between values with arbitrary zero as the starting point (eg., IQ or temperature level). The data items are numerical, and can be quantified and ranked from lowest to highest. We can establish the actual interval difference between two values - we can say what is higher/lower than another data item while knowing how much the difference is.
Ratio scale
establishes a quantitative difference between variables with true zero as the starting point (eg., age and income). Can be quantified. Can be ranked from lowest to highest. Can establish actual numerical interval differences between the two values to determine how much one item is greater or lesser than another. Can also establish a ratio difference to determine how many times a value is more or less than another.
Ordinal scale
establishes differences between the values and these values can be ranked in order. (eg., income groups - low, middle, high). Items can be separated and ranked in order from lowest to highest on a value scale. Cannot establish the actual interval difference between 2 data items - we can say one group is higher/lower than another, but not by how much
Nominal scale
establishes differences between the values by category with no relation to order (eg- male and female). Thus, items can be categorized, but not ranked
Types of quantitative graphs
histogram, line chart, ogive, time series, scatterplot, trendline
Explanitory Research
identify causes and effects of phenomena. To explain and predict how one phenomenon will change in response to change in another phenomenon. Eg., what factors are related to teenage pregnancy
Control table
introduces a third variable in a multivariate analysis to explore whether the presence of a third variable could explain the relation between the other 2 variables *most common*
Diagnostic analytics
involves using analytical tools to understand why something happened in the past - find the root causes of events but provide no insight
Median
middle value of a ranked observation. Rank values from least to greatest then find the middle. Cannot find the median for nominal data because it cannot be ranked
Grouped frequency table
organizes and simplifies a large set of numerical data into class intervals to find out the number and percentage of cases that fall within each class/group interval *most common*. Used to tabulate numerical data into convenient groups for the purpose of analysis. Eg., grouping student grades such as 50-60, 60-70, etc. A 'less than' is applied to the upper boundary of each class - so 50-60 actually represents cases more than 50, but less than 60. The class interval must be continuous with no gap between successive groups, and the data in each group should be less than the upper limit of their class interval. No overlap between groups, all groups within equal width. Provides info. To describe the proportion of cases that belong to each class interval, as well as the percentage of cases that are less than an upper limit of each class interval (cumulative percentage). Between 5-10 groups are recommended
Types of qualitative graphs
pie chart, bar chart, stacked bar graph, clustered bar graph, pareto chart
Leptokurtic distribution
positive, more variation than a normal distribution, curve is peaked at the centre with heavy fatty tails at the extreme ends, more extreme values in the tails, more data in the tails, less in the shoulder, and more around the centre. Excel excess kurtosis coefficient of more than +0.5. Coefficient values between +0.5 and +1.0 if its moderate positive kurtosis, and +1.0 as a high positive kurtosis
Contingency table
provides a summary for bivariate data showing the relation between 2 variables by exploring how one affects the other *most common*. Used to tabulate categorical data. Examining this table helps to explore and understand whether the independent variable influences the dependent variable. The frequency counts should be transferred into percentage based on their respective column totals, row totals, and overall total. To explore the cause-and-effect relationship, it is proper to have the independent variable placed on top of the table, and the dependent variable on the side of the table. Frequency counts converted into percentages based on column totals, row totals and overall totals
Simple frequency table
provides a summary of univariate data showing the number of observations in each of the several categories *most common*
Pareto table
provides frequency count and percentage of cases for each category in a descending order with the objective to identify a limited number of tasks that produce a significant overall effect - helps to identify the top portion of issues to be addressed to resolve a majority of cases
What is percentile?
provides info about how a person or thing relates to a larger group. A percentile tells you what percentage of the scores are less than the data point you're analyzing
Standard deviation
quantifies the amount of variation in a dataset - measure of amount of variation in a dataset. Based on the difference between each observation and the mean. Tabular format is prefered
relative variability
range, interquartile range, variance and standard deviation are absolute measures while coefficient of variation is a relative measure of variation
Interquartile range
refers to the difference between the first and third quartile. Defines the middle half of the data, thereby avoiding the effect of outliers or extreme values. Still doesn't consider all the values in a dataset
Range
refers to the difference between the highest and lowest value in a dataset. To find: subtract highest value by lowest value. Easiest and simplest to calculate. Disadvantage is that it doesn't consider outliers - ignores values in between - can mislead conclusions about variability
Mode
refers to the value that appears frequency. Used when wanting to find the most common value. Finds central tendency - least powerful. Mode can be applied to all types of data and all scales of measurements
Clustered bar graph
represents discrete values for more than one item that shares the same category. Instead of stacking each item on top of one another, they are placed side by side to compare the data for many categories side by side
Simple mean
represents the average value in the data distribution. Most commonly used to summarize a distribution. Calculated by summing all the values in a dataset and then dividing this by the total number of observations in the data set. Cannot calculated for categorical data. When a dataset has an outlier, there will not be an accurate mean
Cumulative table
shows the number or proportion or percentage of cases with values less than the upper limit of each class
Types of tables
simple frequency, percentage table, contingence table, grouped frequency table, cumulative table, control table, pareto table
Descriptive analytics
the use of data to understand past and current business performance (without explaining why) and make informed decisions
Grouped mean
to find an average for data that are grouped (grouped data). Calculate the midpoint (eg., 50-60 midpoint is 55) for each class interval and then multiply the midpoint by the frequency assigned to that class interval
Evaluation research
used to assess the effectiveness of policies and program outcomes against a baseline. Eg., whether after-school programs reduce teenage delinquencies?
Discriptive research
used to describe a prevailing issue of interest. Provides a method measuring and describing phenomena so we have a clear idea of the issue. Eg., monitor the teenage pregnancy rate for the past 5 years
Exploratory research
used to find out if a particular issue is becoming a growing concern in a community. Using facts and figures to explore a problem. Eg.Is teenage pregnancy a growing societal problem?
Bar graph
used to graphically display relative frequencies for 2 or more categories when the emphasis is on comparison.
how to calculate quartiles
used to rank data and to locate the relative position of a value in a data set. Split an entire distribution into four equal parts -> 1st quartile is the 25th percentile, second is 50th, third 75th, fourth is 100th. Difference between the third quartile and first quartile is called the interquartile range
Scatterplot
used to reveal the relationship between 2 numerical sets of data. Shows how a variable is associated with others. Dots may follow a trend which may reveal a positive or negative relationship between the variables. A trend line can be constructed to show three forms of relationships - positive (both variables increase), negative (increase in one variable decreases the other), zero relationship
Histogram
used to visualize frequency tables with the variable of interest placed on the horizontal axis bars with no gap in between them. Reveals different shapes of data distribution, including skewness, peaks and gaps in data distribution.
Pie chart
used when there are a small number of categories and you want to emphasize on the relative important of a particular category to the total
Predictive analytics
uses analytical tools to tell what is most likely to happen in the future as analytical models are constructed from past data to predict future outcomes or to assess the impact of one variable on another
Prescriptive analytics
uses optimization models that yield the best course of actions, and recommend actions to be taken to produce desired outcome/eliminate future problems
Normal distribution
zero skewed. Characteristics -> long tail to both left and right, extreme values lay on both ends of the distribution, bulk of data is clustered around the middle of the distribution , mean is equal to the median. The mean minus the median will be zero which is why it's zero-skewed