stats midterm

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

independent variable

(cause/explanatory) -> variable that is controlled or manipulated by the researcher. Influences the dependent variable

Dependant variable

(effect/response) -> cannot be changed by the researcher, but changes in response to the effect of the independent variable

KURTOSIS

*measure of tailness or preakness of our data*. Descriptive method used to describe a shape of a data distribution . Measures the tail-ness and peaked-ness of a distribution relative to a normal distribution. Describes the degree to which values are clustered in the tail and peak of a distribution. More emphasis on tail-ness than peak-ness

What is tabulation?

- A useful way to organize and present data through tables. The process of creating tables or placing data into tabular format for statistical analysis is known as tabulation. Involves systematically arranging data into columns or rows to understand or explain the problem

Inferential Statistics

- provides various methods that can be used to draw conclusions about sample data, then infer the results to a population (inductive reasoning - infer the characteristics of the population based on what you know about the sample, moves from precise to general). -Purpose is to use sample results to draw conclusions about the population. -Focus is on comparing, testing and predicting the future of the data. -Language of probability is often used because inferences cannot be absolutely certain. -Process of making a conclusion about the population based on evidence obtained from the sample is called a statistical inference -Examples of methods of inferential statistics include hypothesis testing, probability, confidence interval estimation, and regression analysis

What is Data?

-Data are collection of raw facts and figures -All the data in a study are referred to as the dataset -A single value in data is called datum or data point

What are the 4 scales of data measurement

-Nominal scale, ordinal scale, interval scale, ratio scale

Statistics as a Scientific Process:

-Scientific inquiry provides a method of examining things and requires evidence by gathering information and subjecting it to statistical analysis. -Hypothesis-gather data-analyze. -Statistics provide systematic and scientific ways of collecting, organizing, summarizing, presenting and analyzing data, as well as providing means to draw valid conclusions and make decisions based on analyzed information.

Sampling and inferences

-Statistical conclusions are derived based on a sample due to the expense and time to do an entire population. -Selecting a good representative sample is called sampling. Summary info. Gathered from the sample is a statistic. A summary derived from a population is called a parameter. -Through statistical inference, we can infer the information obtained from a sample (statistics) to the entire population (parameter). Thus, through statistical inference, we can make conclusions about a population based on evidence from a sample.

Scales of measurement

-Used to classify data -Concept of measurement is central to visualizing statistical analysis -Measuring data gives context for statistical analysis -Scale of measurement refers to the degree of quantification - as some data differs only in quality, while other data differs in quantity

Numerical data

-Values that can be quantified -Data differs in quantity can be subdivided into discrete or continuous variables. -Discrete variables are numbers that can only be whole numbers (integers-obtained by counting) and can assume finite only, continuous variables can be fractions-obtained by measuring, infinite.

inductive reasoning vs deductive reasoning

-inductive= using facts and experiments -deductive= using logic and basic knowledge -both are used in the scientific method

descriptive statistics

-provides various methods to summarize, describe and present sample data without drawing conclusions about the larger population (deductive reasoning - deduce the characteristics of the sample based on existing theories about the population, moves from general to an effective conclusion). -Purpose is to organize data through tables, graphs, etc. -Using a deductive approach to analyze data which does not allow you to make conclusions beyond the analyzed data. -Mean-median-mode, range, skewness, kurtosis and standard deviation

Categorical data

-qualitative data - not numerical -Values are placed in distinct and separate categories based on some qualitative attributes -Data values differ in quality but not quantity -Only way to statistically summarize this data is to perform a frequency count or find a percent to determine the mode

How to create a group frequency table

1- Number of groups. 2- Class width - find the difference between the highest and lowest values and divide the result by the number of groups. 3- Class interval - start with lowest value and add equal increments. 4- Frequency - count the values that fall within each group/category. 5- Percent - divide the frequency of each group by the total frequency and multiply the result by 100. 6- Cumulative percent - add the percent value to all previous percent values below the specified group

what are the three steps to data to decision?

1- transform raw data into information that is organized, structured, and can be placed in a context to explore relationships and connections between variables. 2- transforming information to knowledge, which provides learning through logical reasoning, reveals patterns, etc. 3- transforming knowledge into wisdom by providing critical insights to help address the issues and problems faced.

Standard deviation STEPS BY LINDSEY

1. Put items into a table 2. Find the mean for the data set 3. Subtract the mean from the value 4. Find the square root of the answer from step 3 5. Complete for each value in the dataset 6. Calculate the total of the SQRTs 7. To find variance - divide the total by (n-1) where n represents the number of values in a dataset 8. To find standard deviation, SQRT the variance

What is a data dashboard?

A data visualization tool that visually tracks, analyzes and displays key performance indicators KPI, metrics and key data points. Goal is to provide timely summary information that is easy to read, understand and interpret in order to improve business operations and processes. Provides an objective view of performance metrics. Connects to files, attachments, services. Displays in the form of tables, line charts, bar charts and gauges. Most efficient way to track multiple data sources. Provides a central location for businesses to monitor and analyze performance in real time. Designed to help decision makers, executive, and senior leaders establish targets, set goals, and understand why something happened so they can implement changes

Multivariate

A third variable is introduced to explain the relationship between the two variables. Eg., control table

Constant

A value that does not change

Five Number Summary

Any data set can be reduced into 5 key numbers. They are the lowest number (LN), first quartile (Q1), median (MD), third quartile (Q3) and highest number (HN).

coefficient of variation

CV preferred over the standard deviation when comparing datasets measured in different units/that have different values -> measures the spread of data from the central value. Expressed as a percentage rather than a unit; represents the ratio of the standard deviation to the mean. CALCULATED-> standard deviation, divided by the mean, times 100. Formula - std/meanx100. It means in percentage how scattered the values are from the mean, Interpret by saying they vary a BLANK percent from the average

Data types and Measurements

Categorical data leads to nominal and ordinal measurements, Numerical data leads to interval and ratio

What charts (2) are best used to show data comparison?

Clustered bar graph, Stacked bar chart

Bivariate

Data is summarized for two variables with the purpose of exploring the relationship between these variables eg., contingency table

What are the 4 types of data analytics

Descriptive analytics, Diagnostic analytics, Predictive analytics, Prescriptive analytics

How to find the cumulative percentage of a data set

Divide the number of times the event occurred by the total sample size to find the cumulative percentage. In the example, 25 days divided by 59 days equals 0.423729 or 42.3729 percent

Importance of graphs

Effective way to visualize, summarize and describe statistical information. Shows trends. Seen as a picture of the data. Used when its important to illustrate a specific trend in the data or general relationship among groups. Can be used to discuss an issue, reinforce a critical point, summarize data, identify clusters, describe variation and define skewness of a distribution

What are the 4 types of applied statistical research

Exploratory, descriptive, explanatory, evaluation

How to calculate percentile

Formula (n+1) x K/100 -> n is the number of cases in a dataset -> k is the specific percentile in question 1- Rank values from smallest to largest. 2- Apply formula. 3- Count to the ranked value (the answer from the formula)

types of statistical variables

Independent, dependant and constant

Skewness

Looks for the direction in a data distribution. Left-skew distribution, right-skewed distribution and normal distribution

Measures of Shape

Measures the direction/frequency of outliers away from the central value

Data reclassification

Numerical data can be reclassified and transformed into categorical data - grouping. Eg - instead of listing ages 1-100, you can have groups like 1-5, 5-10, etc. They are then labels with no quantitative meaning

Univariate

Refers to a tabular format where data is summarized for a single variable - eg., simple frequency table

What charts (2) are best for data relationships?

Scatterplot, trendline

What charts (7) are best used for data visualisation?

Simple bar chart, Pie chart, Histogram, Time-series, Ogive, Line chart, Pareto chart

Types of mode

Single mode - unimodal distribution. Two modes - bimodal distribution. Three or more modes - multimodal distribution

Quantitative meaning

Some categorical data is written as numbers, but they are not numerical values because they have no quantitative meaning. This is because it makes no sense to perform mathematical operations on them. Eg: social insurance numbers, drivers license numbers, telephone numbers - makes no sense to find average, mode, etc.

Sources of data

Surveys, observations, experiment, published by govs., etc, technology

What is the cumulative percentage?

The Cumulative frequency column lists the total of each frequency added to its predecessor

What are the two types of tables?

Univariate and Bivariate

Stacked bar graph

Used to visualize bivariate data. Helpful in exploring relationships or comparing cases across various categories of the second variable

What are the components of a dataset?

Variables, observation, data list, datum, elements

What are the 4 big v's of big data?

Volume, Variety, Velocity and veracity

Time series

a line chart which shows sequences of data points collected over a successive time interval. Goal is to identify and analyze trends so that one can predict or forecast future outcomes. Helps to determine skewness

Key metrics for assessing data quality are:

accuracy, completeness, timeliness, validity, reliability, relevance, unbiased, limitation

Right-skewed distribution

aka positive skewed. Characteristics -> long tail to the right of the distribution, right side of the graph, extreme value lies on the right, bulk of data is below the mean, mean is more than the median, excel skewness coefficient is more than +0.5

Platykurtic distribution

also known as negative kurtosis - distribution with less variation than a normal distribution, less tailed, less data around the centre, small changes are more common and large changes are less likely due to the thinner tails, in excel its excess kurtosis coefficient will be less than -0.5, and its coefficient will be between -0.5 and -0.1 if its a moderate negative kurtosis, but will be less than -0.1 if its a high negative kurtosis

Left-skewed distribution

also known as negative skewed. Characteristics: observe a long-tail to the left of the distribution, stretching towards the negative direction/left side of the graph, the extreme value lies to the left of the distribution, the mean is less than the median, the bulk of the data are above the mean, computed Excel skewness coefficient is less than - 0.5. Happens when the mean is less than the median - and the mean minus the median will create a negative result

Mesokurtic data

also known as the zero kurtosis, describes a distribution with similar characteristics as a normal distribution - in Excel it will be between -0.5 and +0.5

Weighted mean

considers the frequency assigned to each value in the distribution . weight will indicate the relative importance of the values in the dataset. Multiply the data value by their respective weight or frequency, then sum up the result and divide the total by the frequency to obtain the weighted. *think weighted as in assignments *

Percentage table

constructed from a frequency table by relating the frequency count of each category to the total count to determine the fraction or percentage of cases that belong to each category

Ogive chart

constructed like a line chart, but instead of having percentages plotted at each interval, cumulative percentages are plotted instead. Shows cumulative trend of data distribution to depict percentages of cases below a specific point

Line chart

continuous line formed by placing a dot over the midpoint of each class interval against the percent value of the class. Can reveal the shape of distribution in the form of clustering, spread, skewness, outliers, gaps and peaks

Interval scale

establishes a quantitative difference between values with arbitrary zero as the starting point (eg., IQ or temperature level). The data items are numerical, and can be quantified and ranked from lowest to highest. We can establish the actual interval difference between two values - we can say what is higher/lower than another data item while knowing how much the difference is.

Ratio scale

establishes a quantitative difference between variables with true zero as the starting point (eg., age and income). Can be quantified. Can be ranked from lowest to highest. Can establish actual numerical interval differences between the two values to determine how much one item is greater or lesser than another. Can also establish a ratio difference to determine how many times a value is more or less than another.

Ordinal scale

establishes differences between the values and these values can be ranked in order. (eg., income groups - low, middle, high). Items can be separated and ranked in order from lowest to highest on a value scale. Cannot establish the actual interval difference between 2 data items - we can say one group is higher/lower than another, but not by how much

Nominal scale

establishes differences between the values by category with no relation to order (eg- male and female). Thus, items can be categorized, but not ranked

Types of quantitative graphs

histogram, line chart, ogive, time series, scatterplot, trendline

Explanitory Research

identify causes and effects of phenomena. To explain and predict how one phenomenon will change in response to change in another phenomenon. Eg., what factors are related to teenage pregnancy

Control table

introduces a third variable in a multivariate analysis to explore whether the presence of a third variable could explain the relation between the other 2 variables *most common*

Diagnostic analytics

involves using analytical tools to understand why something happened in the past - find the root causes of events but provide no insight

Median

middle value of a ranked observation. Rank values from least to greatest then find the middle. Cannot find the median for nominal data because it cannot be ranked

Grouped frequency table

organizes and simplifies a large set of numerical data into class intervals to find out the number and percentage of cases that fall within each class/group interval *most common*. Used to tabulate numerical data into convenient groups for the purpose of analysis. Eg., grouping student grades such as 50-60, 60-70, etc. A 'less than' is applied to the upper boundary of each class - so 50-60 actually represents cases more than 50, but less than 60. The class interval must be continuous with no gap between successive groups, and the data in each group should be less than the upper limit of their class interval. No overlap between groups, all groups within equal width. Provides info. To describe the proportion of cases that belong to each class interval, as well as the percentage of cases that are less than an upper limit of each class interval (cumulative percentage). Between 5-10 groups are recommended

Types of qualitative graphs

pie chart, bar chart, stacked bar graph, clustered bar graph, pareto chart

Leptokurtic distribution

positive, more variation than a normal distribution, curve is peaked at the centre with heavy fatty tails at the extreme ends, more extreme values in the tails, more data in the tails, less in the shoulder, and more around the centre. Excel excess kurtosis coefficient of more than +0.5. Coefficient values between +0.5 and +1.0 if its moderate positive kurtosis, and +1.0 as a high positive kurtosis

Contingency table

provides a summary for bivariate data showing the relation between 2 variables by exploring how one affects the other *most common*. Used to tabulate categorical data. Examining this table helps to explore and understand whether the independent variable influences the dependent variable. The frequency counts should be transferred into percentage based on their respective column totals, row totals, and overall total. To explore the cause-and-effect relationship, it is proper to have the independent variable placed on top of the table, and the dependent variable on the side of the table. Frequency counts converted into percentages based on column totals, row totals and overall totals

Simple frequency table

provides a summary of univariate data showing the number of observations in each of the several categories *most common*

Pareto table

provides frequency count and percentage of cases for each category in a descending order with the objective to identify a limited number of tasks that produce a significant overall effect - helps to identify the top portion of issues to be addressed to resolve a majority of cases

What is percentile?

provides info about how a person or thing relates to a larger group. A percentile tells you what percentage of the scores are less than the data point you're analyzing

Standard deviation

quantifies the amount of variation in a dataset - measure of amount of variation in a dataset. Based on the difference between each observation and the mean. Tabular format is prefered

relative variability

range, interquartile range, variance and standard deviation are absolute measures while coefficient of variation is a relative measure of variation

Interquartile range

refers to the difference between the first and third quartile. Defines the middle half of the data, thereby avoiding the effect of outliers or extreme values. Still doesn't consider all the values in a dataset

Range

refers to the difference between the highest and lowest value in a dataset. To find: subtract highest value by lowest value. Easiest and simplest to calculate. Disadvantage is that it doesn't consider outliers - ignores values in between - can mislead conclusions about variability

Mode

refers to the value that appears frequency. Used when wanting to find the most common value. Finds central tendency - least powerful. Mode can be applied to all types of data and all scales of measurements

Clustered bar graph

represents discrete values for more than one item that shares the same category. Instead of stacking each item on top of one another, they are placed side by side to compare the data for many categories side by side

Simple mean

represents the average value in the data distribution. Most commonly used to summarize a distribution. Calculated by summing all the values in a dataset and then dividing this by the total number of observations in the data set. Cannot calculated for categorical data. When a dataset has an outlier, there will not be an accurate mean

Cumulative table

shows the number or proportion or percentage of cases with values less than the upper limit of each class

Types of tables

simple frequency, percentage table, contingence table, grouped frequency table, cumulative table, control table, pareto table

Descriptive analytics

the use of data to understand past and current business performance (without explaining why) and make informed decisions

Grouped mean

to find an average for data that are grouped (grouped data). Calculate the midpoint (eg., 50-60 midpoint is 55) for each class interval and then multiply the midpoint by the frequency assigned to that class interval

Evaluation research

used to assess the effectiveness of policies and program outcomes against a baseline. Eg., whether after-school programs reduce teenage delinquencies?

Discriptive research

used to describe a prevailing issue of interest. Provides a method measuring and describing phenomena so we have a clear idea of the issue. Eg., monitor the teenage pregnancy rate for the past 5 years

Exploratory research

used to find out if a particular issue is becoming a growing concern in a community. Using facts and figures to explore a problem. Eg.Is teenage pregnancy a growing societal problem?

Bar graph

used to graphically display relative frequencies for 2 or more categories when the emphasis is on comparison.

how to calculate quartiles

used to rank data and to locate the relative position of a value in a data set. Split an entire distribution into four equal parts -> 1st quartile is the 25th percentile, second is 50th, third 75th, fourth is 100th. Difference between the third quartile and first quartile is called the interquartile range

Scatterplot

used to reveal the relationship between 2 numerical sets of data. Shows how a variable is associated with others. Dots may follow a trend which may reveal a positive or negative relationship between the variables. A trend line can be constructed to show three forms of relationships - positive (both variables increase), negative (increase in one variable decreases the other), zero relationship

Histogram

used to visualize frequency tables with the variable of interest placed on the horizontal axis bars with no gap in between them. Reveals different shapes of data distribution, including skewness, peaks and gaps in data distribution.

Pie chart

used when there are a small number of categories and you want to emphasize on the relative important of a particular category to the total

Predictive analytics

uses analytical tools to tell what is most likely to happen in the future as analytical models are constructed from past data to predict future outcomes or to assess the impact of one variable on another

Prescriptive analytics

uses optimization models that yield the best course of actions, and recommend actions to be taken to produce desired outcome/eliminate future problems

Normal distribution

zero skewed. Characteristics -> long tail to both left and right, extreme values lay on both ends of the distribution, bulk of data is clustered around the middle of the distribution , mean is equal to the median. The mean minus the median will be zero which is why it's zero-skewed


Kaugnay na mga set ng pag-aaral

Government - Chapter 12: The Presidency

View Set

Assignment #2: Medical Terminology Quizlets Part 1 “Roots"- Karla Sandoval

View Set

Which bone articulates with what?

View Set

Visual Perception (Duplex Retina)

View Set

Management 4970 (Capstone) quiz questions 1-4

View Set