DESCRIPTIVE STATISTICS

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

MODE.MULT stands for

"mode multiple," and it will return every mode in the set of values selected.

Mean

(often referred to as average) is the sum of all values divided by the number of values in the dataset.

When calculating variance of a set of values in Google Sheets, you will use the VAR function. Below is an image from the Google Sheets documentation. The parameter for both these functions is

Just the range of values you are interested in analyzing.

Standard deviation can tell you more than how spread out your whole dataset is. Analysts also use standard deviation to see how,

Likely — or how normal — a specific value is compared to the whole dataset. They do this by creating ranges from standard deviations

Similarly, with a negatively skewed (or left-skewed) distribution, the tail on the left side of the distribution is

Longer than on the right side. Most of the values within the dataset tend to cluster toward the right side of the x-axis.

After you identify your lower and upper halves, you can find the median for each of these halves. The median for the upper and lower halves are colored blue: In this dataset: The Q1, or the median of the lower half = $34,000. The Q2, or the median of the whole dataset = $42,500. The Q3, or the median of the upper half = $56,000.

Lower Half: $27K, $33K, $34K, $37K, $42K. Because there are five values in the lower half, the median is the value listed in the third position: $34k. Median: $42.5K Upper Half: $43K, $54K, $56K, $66K, $1M. Because there are also five values in the upper half, the median is the value listed in the third position: $56k.

You can discover how spread out your data is by using another type of descriptive statistics called

Measures of spread.

Depending on how wide the spread of values are, the measures of central tendency will be

More or less representative of the dataset.

The last measure of central tendency is the mode. Mode is the value that appears

Most frequently in the dataset. It is often referred to as the most "popular" value.

Line has a taller peak near the mean. The taller peak implies that a larger percentage of the data values fall

Near the mean. This distribution of data points is less spread out and has a smaller variance because more of its points fall near the mean.

When it comes to the shape of a distribution, it can either be symmetrical or asymmetrical. The most common distribution shapes are

Normal distribution, positive skewed distribution, and negative skewed distribution.

Standard deviation is useful in the context of the data because

Now you can compare its value in relation to the mean. For example, an analyst could say that the values from LESS AND MINUS THE NUMBER OF DEVITION units fall within one standard deviation

Q1: $34,000 Q3: $56,000 IQR: $22,000

Now, calculate for the maximum value so you can see if there are any outliers on the upper end: Q3 + 1.5 * IQR = Maximum $56,000 + 1.5 * $22,000 = $89,000 This means that any value more than $89,000 is an outlier. This makes the $1,000,000 salary an outlier. Because it is an extreme outlier, it is good practice to eliminate it from your calculations for further analysis.

However, outliers are not always that obvious. We've defined an outlier as a data point that is far outside the expected range of values. To remove this uncertainty, you can define an outlier using

Numbers

Statistics is the branch of mathematics that focuses

On analysis and interpretation of data.

Steps to calculate the median:

Order your list of values in ascending order. Count the number of values in your list and determine if the number of values you have is odd or even.

Descriptive statistics allows you to

Organize, display, and describe data using data visualizations, which can turn difficult-to-understand variables across a large dataset into bite-sized descriptions.

Range can be easily influenced by

Outliers and may not always be representative of the spread if there is one exceptionally high or low value in the dataset.

When it comes to skewed distributions, the median is

The best measure of central tendency to represent your dataset. The median would best represent the dataset since it is least impacted by outliers.

In addition to skewness, you should also familiarize yourself with the shape of the distribution. The shape of a distribution is described by

The distribution's peaks and its symmetry, skew, or uniformity.

Skewness is the measure of symmetry or asymmetry of a distribution. It describes if the shape of data is

The same on both sides or if the majority of data falls to the left or to the right of the center (median).

A distribution is right-skewed if the taper

is on the right side and left-skewed if the taper is on the left side.

Positively skewed most of the values within the dataset tend to cluster toward the left side of the x-axis. The right tail is then longer on the right side. You can think of the curve as being stretched to the right, which is why

it is referred to as positively skewed, right-skewed, or sometimes right-tailed.

Now that you know how to visualize distributions,

it's time to explore the intricacies of their shapes.

While calculating measures of spread such as range, quartiles, and interquartile range are useful, there are some limitations. These measures don't take into account

Every data point contained in the dataset. To get a more representative idea of spread, you need to take into account every data point in the dataset.

A uniform distribution has a flat top because

Every group of value is equal.

Interquartile range:

A measure that describes the difference between the third quartile and the first quartile, which tells you about the range of the middle half of the values.

In statistics, the sample is the subset of the population. Anytime you do not have the entire population of the thing you are studying, you are using

A sample of that population.

There is no function in Google Sheets called "mean." However, there is an

AVERAGE function that has the same functionality. You will use the AVERAGE function when finding the mean of a set of values. Below is the Google Sheets documentation for reference.

DISTRIBUTION AND HISTOGRAM TAKEAWAY

Buckets (also called bins) let you break the range of your data into smaller ranges. Histograms display the count of data points in each bucket. Grouping data into a histogram lets you display the overall spread of your data.

A single value from a dataset can carry plenty of meaning, but it can mean even more when

Compared to a larger dataset (such as your math test score when compared to the scores of the entire class).

Histograms are most frequently used to show the distribution of

Continuous data, which lends itself well to being grouped and visualized by intervals.

Outliers are important to investigate, but how do they fit into the larger process of data analysis?

In one instance, they can exist in the initial data given to you. In this case, they should be cleaned or dealt with accordingly when you've reached that step in the data wrangling checklist

A residual is the distance between a data point and the mean. For example, a data point of 7 and a mean of 9 would create a residual of

2. To calculate variance, you have to divide the residuals by the count of the numbers in your set: Variance = Sum of squared residuals / # of values in the set

In general, the larger the variance, the more spread out your data is. Variance is more useful to analysts than quartiles and interquartile ranges for understanding spread because

It takes every data point into account when calculating its value. This means variance can be more easily swayed by outliers, which may be ignored when calculating a specific quartile.

Parameters are fairly simple for this function. Google Sheets only needs

Range of values and the quartile number you want (1, 2, or 3) in the parameters. You will need to call each quartile individually, meaning that you will need to call the functions three separate times to get values for Q1, Q2, and Q3.

Outliers easily does what to data

Skew the data.

The median is the middle value in a list of values. It is very important that the list of values is

Sorted from least to greatest, or greatest to least, when calculating the median by hand.

Variance is useful to analysts because it can shed light on how

Spread out the data is.

Once you identify the center of your dataset, your next step is to find out how .

Spread out your data is. This allows you to see how well the measures of central tendency represent the data.

Measures of spread help you uncover how

Spread out your data points are. In a larger spread, there are likely bigger spaces between the values within a dataset.

Standard deviation is the measurement of average distance between each value and the mean. Calculating it is as simple as:

Standard deviation = Variance−−−−−−−√ SQUARED

The difference between $56,000 (Q3) and $34,000 (Q1) is $22,000 (IQR). This means

That of all of the data points in the center, 50% of the dataset are within $22,000 of each other.

The best part about this QUARTILE function is

That you do not need to order your values, find medians, or split your data beforehand.

Because the range is the difference between the minimum and maximum values of a dataset, you can use

The MAX and MIN Google Sheets functions you have already learned in this course. Specifically, assuming Salary is in Column B: =MAX(B:B)-MIN(B:B)

When calculating the mode of a set of values, you can use

The MODE function in Google Sheets.

You'll notice that variance is in "units squared." This is because

The area between each data point and the mean must be a positive number so that difference in distance gets represented fairly across all points.

Range:

The difference between the lowest and the highest value within a dataset.

In statistics, the population is

The entire group that you want to draw conclusions about.

Together, five of these numbers: minimum, lower quartile (Q1), median, upper quartile (Q3), and maximum can make up what is called

The five number summary.

You can also use quartiles to calculate

The interquartile range (IQR), which shows how spread out the middle 50% of your data is.

There are two types of variance and standard deviation formulas:

The population and sample formulas.

A five number summary puts data in order and breaks it into four equal ranges. The ranges are equal by

Their distribution and each range comprises 25% of your total data.

The first step in calculating your quartiles is

To sort the data from lowest to highest (ascending order) locate the median. Since there are only 10 values in the employee salaries dataset, the median is calculated by averaging the two center values. Once you find the median ($42,500), you can divide the remaining data into the lower and upper halves. Lower half: $27K, $33K, $34K, $37K, $42K Median: $42.5K Upper half: $43K, $54K, $56K, $66K, $1M

Data cannot have a mode if all the values appear the same number of times. There can also be multiple modes if

Two or more values appear the same number of times.

Median vs. Mean

Unlike the mean, the median is less affected by outliers and skewed data. This is because the median only uses the central value(s) for its calculation, which signifies that the only impact an outlier can have is in shifting your median number over by one position.

Calculating variance and standard deviation manually can be tricky, especially with a large dataset. Google Sheets has formulas for calculating variance and standard deviations:

VAR and STDEV.

Quartiles:

Values that divide your dataset into quarters. Similar to how the median divides the dataset in half, the quartiles split your data into four equal parts.

Standard deviation lets you take into account every data point. You find

Variance on the way to calculating standard deviation.

Descriptive statistics is a subset within statistics that allows

You to describe (or explain) and summarize the data you collect and then present it in an efficient and meaningful way.

There is no specific calculation to find the mode: you simply look for

the most frequent values. For example, take a look at this list: [11, 12, 72, 12, 49, 11, 77, 13, 12]. The mode would be 12 because it appears the most often in the list (three times).

Notice that the mean, median, and standard deviation all share the same units, while

the variance is units squared.

When calculating the median of a set of values, you can use the

MEDIAN function in Google Sheets.

When you use descriptive statistics, it is helpful to summarize a group of data using

Measures of spread and measures of central tendency and to include visualizations that represent the distribution of data.

Variance is useful to see how spread out a dataset is, but

if the data contains outliers, the variance is easily swayed. Another issue with variance is the units. Since the calculation requires units to be squared, variance is not in the same units as the other values in the dataset, meaning they cannot be directly related. To correct the units squared, you need to take the square root of the variance.

A histogram

is a chart that lets you show the distribution (shape) of a set of continuous data. It is the most commonly used chart to show a numerical variable's frequency (i.e., how many times it appears within a dataset)

The three measures of central tendency are

mean, median, and mode.

You'd want to use the population standard deviation formula when you use

population data.

The first three measures of spread you will learn are

range, quartiles, and interquartile range.

In Google Sheets, standard deviation works in the same way that variance does: STDEV only

requires the range of values as input.

The range is the simplest measure of spread, and it allows you to

see the boundaries of your dataset.

The five number summary is useful because

It can provide you with context about the spread of your data using very little information.

A normal distribution (also called bell curve) is

A distribution characterized by its bell-like shape, where most data is centered around a mean. In this shape, the measures of central tendency are all equal and will all be in the middle of the distribution.

frequency distribution table (or frequency table) is

A mathematical graph that identifies the number of times that pieces of data occur in a given sequence. Frequency distribution tables can show either qualitative or quantitative variables, as depicted below.

Bar Chart VS Histogram Ordering‌

Bar Chart-Data can be moved and reordered Histogram-Data cannot be moved or reordered‌

Bar Chart VS Histogram Grouping‌

Bar Chart-Each data point is shown as a separate bar Histogram-Data points are grouped and displayed according to bin value

Frequency Distribution Tables, Bar Charts, and Histograms

Bar charts feature categorical data, while histograms only display numeric data. Bar charts let you make comparisons between groups, sort the data, and make an overall assessment of groups in the data. A frequency distribution table lets you consolidate data based on quantitative variables, such as decades of people's age. Histograms let you graph continuous numerical data on the x-axis to display the overall distribution of a variable.

The normal distribution looks like a

Bell because it has a peak in the middle and tapers off on either side.

When a distribution is considered skewed when it

Is not symmetrical, it has a peak on one side and a longer taper on the other.

Since the normal distribution is perfectly symmetrical, it has no

Skew. When your dataset can be represented by a normal distribution, it is best to report the mean of these values.

A histogram makes it easy to spot

Skewness and see how symmetrical your data is. By smoothing the differences from bin to bin with a curve line, it becomes easier to spot any skew, as well as the general shape (e.g. bell curve, uniform, etc.) of the distribution

These are the measures that indicate the symmetry and weight of the tails of the distributions of a dataset. The most important measure of shape is

Skewness. These measures are important because they help you to further characterize the location and variability of your data.

A distribution is said to be positively skewed (or right-skewed) when

The tail on the distribution's right-hand side is longer than the tail on the left-hand side.

It's easiest to categorize skewness of a distribution by looking at the tails. The tail of a distribution refers

To the part of the distribution that's far away from the mean. Tails are the right and left ends of the chart, where the line meets the horizontal axis.

A bimodal distribution looks like

Two bell curves combined and has two different modes at each peak.

The shapes of the distribution can help you identify

Which of the descriptive statistics will be most useful to use.

Usually, the mean will be the most positive measure of central tendency and will be the closest measure to the tail, while

the mode will usually be the furthest measure from the tail.

HOW TO FIND THE MEDIAN WITH AN EVEN SET OF NUMBERS:

If your list of values is an even number, you will need to take an extra step to find the mean of the two middle values. The mean of the two middle values ($42,000 and $43,000) will be the median of an even-numbered dataset.

An outlier is a data point that

Sticks out from the rest. It does not fit in a trend that the other data shows, or it falls outside (above/below) a range of values in which we would expect the data to fall.

The interquartile range (IQR) is calculated by Calculating Interquartile Range

Subtracting Q1 from Q3. Since this measure is less sensitive to outliers, the interquartile range can help you understand the spread of your data without extreme data points.

The range is calculated by

Subtracting the minimum value from the maximum value of a dataset.

If the variances for the TALL and WIDE lines were 25 and 50, respectively, you could interpret them as follows:

TALL distribution: On average, each value falls about 25 units squared from the mean. WIDE distribution: On average, each value falls about 50 units squared from the mean.

The mode is most useful when the variable is discrete, since there are likely multiple occurrences of one value. When you need to figure out which discrete value occurs the most, the mode will t

Tell you this information. When the data is continuous, it is less likely to have repeat values because there are typically more marginal differences (e.g., 3.00 vs. 3.01) between values.

On the other hand, the solid line's bell shape that is much wider. This implies

That more points fall farther away from the mean, which suggests the variance would be larger.

Quartiles are a useful measure of spread because .

They are less affected by extreme data points such as outliers. If Q1 is farther away from Q2 than Q3 is from Q2, then you know that there is a greater spread among the smaller values than among the larger values

Quartiles are helpful because

They let you breakdown the spread of data into more digestible pieces. In addition, they let you analyze and compare the spread in each quartile to one another to better understand how the distribution of data points changes or remains the same across their range.

Although outliers skew the data, they can also tell analysts a lot of information. As an analyst, you should always question

Why the outlier is in the data: Could it be an error from when the data was collected? Maybe it really is an employee's salary — for example, the CEO's?

Using this rule, you can create ranges of plus and minus one standard deviation, two standard deviations, and three standard deviations away from your mean as follows:

68% of the data falls between mean - 1SD and mean + 1SD 95% of the data falls between mean - 2SD and mean + 2SD 99.7% of the data falls between mean - 3SD and mean + 3SD

Many consider the interquartile range to be a more representative measure of spread. For example, a dataset like [3, 4, 6, 6, 7, 9, 12, 1569] would have

An incredibly wide range, even though most of the data points are within a few values.

Sometimes outliers can be explained, and sometimes they occur by chance. Other times, they may require further investigation. As an analyst, you will need to

Ask questions when investigating why an outlier might be in your data. Most of the time, these outliers can give you insight into your data.

Bar Chart VS Histogram Type of Data

Bar Chart-Categorical variables and/or discrete variables Histogram-Continuous numeric variables

Bar Chart VS Histogram Spacing ‌

Bar Chart-There is a space between each bar, and the distance between each bar is irrelevant Histogram-No space between each bar

‌Bar Chart VS Histogram Usage

Bar Chart-Used to compare different categories of data Histogram Usage-Used to display the distribution of a variable

With the empirical rule, you can make generalizations about values that fall outside these ranges. For example, a value that fell outside one standard deviation from the mean, or 68% of the center of the data, would

Be uncommon.

The mean is affected by outliers, while the median is not. For this reason,

Both provide value to analysts even though they both measure central tendency.

Quartiles are the values that divide a dataset into quarters. Similar to how the median divides the dataset in half, the quartiles split

Data into four equal parts.

There are a few ways to quantitatively define outliers, and one of them is using the IQR. Using a quantitative definition, an outlier is a value that

Falls more than 1.5 IQRs below Q1 or 1.5 IQRs above Q3 in a dataset. This just means that there is a minimum and maximum value that you will calculate to determine the range of non-outliers. If a value falls outside that range, then it is considered an outlier. This removes the subjectivity of determining an outlier.

The first step to finding the variance would be to calculate the mean, which is x¯=18x¯=18. Then-

Find the squared distance that each value falls from the mean. Remember that: Variance (or s2) =Sum of Squared Distance from Mean /Number of Data Points Take each data point and subtract the mean from its value. Then, square that result. Once you calculate this for every data point, you will add all of them together and divide the sum by the total number of data points (n=5n=5). You also need to subtract one from the number of data points when you are using sample data. Take a look at this in practice: s2=(23−18)2 + (12−18)2 + (16−18)2 + (34−18)2 + (5−18)25 − 1s2=(23−18)2 + (12−18)2 + (16−18)2 + (34−18)2 + (5−18)25 − 1 s2=25 + 36 + 4 + 256 +1695 − 1s2=25 + 36 + 4 + 256 +1695 − 1 s2=4904=122.5

Q1: $34,000 Q3: $56,000 IQR: $22,000

First, calculate the minimum value so you can see if there are any outliers on the low end: Q1 - 1.5 * IQR = Minimum $34,000 - 1.5*$22,000 = $34,000 - $33,000 = $1,000 This means that any value less than $1,000 is an outlier. Thankfully, the lowest value in this data is $27,000, so there are no outliers on the low end.

In other cases, they appear in your results at the end of your analysis. In these instances, they do not

Get cleaned and instead become part of your findings, which turn into inferences and insights.

Measures of spread describe

How similar or varied a set of values is from the central values. These measures include range, quartiles, interquartile range, variance, and standard deviation

In addition to helping you understand the spread of your data, IQR also helps you

Identify outliers in your data.

When you are working with thousands of rows of data, quickly spotting your outliers will not be as simple. These methods involve using the mean and median together. RESULTING IN

If there are no extreme outliers in a dataset, the mean and median will be similar. If outliers do exist in the dataset, then the mean and median will be very different.

HOW TO FIND THE MEDIAN WITH AN ODD SET OF NUMBERS: :

If your list of values is an odd number, then you simply need to find the value that falls in the exact middle of the sorted list. If you have a large number of values, use a formula below to calculate the middle position. Median=n + 12Median=n + 12 In this formula, n represents the number of values in the list.

MODE will return the first mode it finds. You would be able to identify this quickly in this set of values because it is fairly small, but if you had a larger set of values, you would not immediately notice that the MODE function did not return all the modes. Thus,

In Google Sheets, you will use the MODE.MULT function.

Measures of central tendency are used to

Indicate and describe the central position of a group of data. These measures are important because they help you condense data, find representative values, make comparisons, and perform further statistical analysis.

Median

Is the middle value that separates the higher half from the lower half of a dataset.

Mode

Is the value in a dataset that appears most frequently.

Visually, a bar chart and a histogram look very similar, as they both use bars to display the data. So, how do you know when it's most appropriate to visualize your data with a histogram or with a bar chart?

It helps to look at the differences between the two types of charts.

It is the most popular measure of central tendency because

It uses every data point in its calculation. Mean is calculated using the following formula: Mean=Sum of All Values/Number of All Values

A quartile measures the spread of values above and below the median by dividing the distribution into four groups (or quarters).

Q1: median of the lower half of the data. Q2: median of the whole dataset. Q3: median of the upper half of the data. Q4: maximum value of the set of data.

When calculating the quartiles of a set of values in Google Sheets, you will use the

QUARTILE function.

Descriptive statistics will allow you to

Quantitatively and qualitatively describe the main features of a collection of information, such as where the middle of the data is, how spread out it is, and where clumps of values or anomalies may exist.


Kaugnay na mga set ng pag-aaral

IB Biology Practice Question (core)

View Set

Week 2 Day 6 - 질문 (question)

View Set

AR 670-1 Wear and Appearance of Army Uniforms

View Set

"Hematology" AAB MOCK EXAM missed questions

View Set

Wk 4 - Practice: Topic 11: How We Record the Effects of Transactions Quick Check

View Set

PP RNSG 1538 Intracranial Regulation Mastery Quiz

View Set

Chapter 22 Antipsychotics and Anxiolytics

View Set

Chapter 6 (Indexes, Synonyms,Sequences)

View Set

cap 1,2 - recapitulare bio + celula + compozitia chimica a materiei vii

View Set

Leadership Test 1 excluding chapter 3

View Set

Section 3 - Advantages & Disadvantages of Debt Financing (3.14)

View Set