AP Statistics Unit 1 Review

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Examples of statements outlining categorical data using proportions:

"In a survey of 100 people, 50% identified as male and 50% identified as female." "In a sample of 300 customers, 20% reported having a positive experience with the company's customer service, while 80% reported a negative experience." "Of the 1000 people surveyed, 30% reported having a bachelor's degree, while 70% reported having a high school diploma or lower level of education." In each of these examples, the proportion of individuals or items in each category is described using percentages. This allows us to see the relative frequency or prevalence of each category within the data.

Mean or Median? *t's always a good idea to report both the mean and median when describing the statistical properties of a dataset, and to explain why they are different if they are not close to each other. This can help to provide a more complete picture of the distribution of the data and how it is dispersed around the center.

* If the distribution is symmetric and unimodal, the mean is often the best measure of central tendency because it takes into account all of the values in the dataset and reflects the overall trend in the data. * On the other hand, if the distribution is skewed or has outliers, the median is often a better measure of central tendency because it is resistant to the influence of these values. In right-skewed distributions, the mean is generally higher than the median, while in left-skewed distributions, the mean is generally lower than the median.

Standard Deviation or IQR?

* It's generally true that the IQR is larger than the standard deviation for symmetric distributions without outliers, although the specific relationship between these measures will depend on the characteristics of the data set. For a symmetric, unimodal distribution, the mean and median will be approximately equal, and the standard deviation and IQR will provide complementary information about the dispersion of the data. In this case, it is appropriate to report both the mean and standard deviation to provide a sense of the center and spread of the distribution. For skewed distributions, the median is often a better measure of central tendency than the mean, as the mean can be influenced by extreme values or outliers. In this case, it is appropriate to report both the median and IQR to provide a sense of the center and spread of the distribution. In general, report both measures of center and spread together is a good plan-of-action, as this provides a more complete understanding of the characteristics of a data set. Reporting only one measure, such as the standard deviation or IQR, can be misleading or incomplete, as it does not provide a full picture of the data.

Tips

* The choice between bar graphs and pie charts will depend on how many categories that variable of your interest assumes and the size of it. Whenever you have many categories or few categories with about the same frequencies, then the bar graph should be your first choice. If the pie has many slices or slices of the same size, it will be hard to compare the groups. * Be careful of quantity distortions and keeping the area principle.

Remember that quantitative variables refer to variables that can be measured or counted and have a numerical value. Under quantitative variables are two mini-types:

1. A discrete variable can take on a countable number of values. The number of values may be finite or countably infinite, as with the counting numbers. Examples of discrete variables include the number of children in a family or in a class, the number of cars in a parking lot, and the number of votes received by a political candidate in a mayoral election. 🚗 2. A continuous variable can take on infinitely many values, but those values cannot be counted. No matter how small the interval between two values of a continuous variable, it is always possible to determine another value between them. For example, it is not possible to count the number of possible values for height, because there are an infinite number of possible values between any two given values. ♾️ * Other examples of continuous variables include the length of a piece of wood, the time it takes to run a marathon, and the temperature of a room.

Reminder

1. The mean, median, quartiles, and percentiles measure the center and position for quantitative data 2. The range, IQR, and standard deviation measure the variability for quantitative data

Box Plots

A box plot, also known as a box and whisker plot, graphically represents the five number summary. It is a way to visualize the distribution of a dataset and to identify any outliers or unusual values! * To create a box plot, you start by drawing a horizontal line called the "axis" and marking the minimum, first quartile, median, third quartile, and maximum values of the dataset on it. These marks are then used to create a box shape: the bottom of the box corresponds to the first quartile, the top of the box corresponds to the third quartile, and the line inside the box corresponds to the median. * The "whiskers" of the box plot extend from the ends of the box to the minimum and maximum values of the dataset. Any points outside of the whiskers are considered outliers and are plotted separately * Using the interquartile range, or IQR, we can erect fences to detect the outlier in our data: Upper fence = Q3 + 1.5 IQR and Lower fence = Q1 - 1.5 IQR *The fences are not included in the box plot, but it helps us to draw the whiskers of the box plot. Any number beyond the whiskers will be displayed in asterisk, indicating that those values are outliers, something that we could hardly know from other quantitative display.

Contingency Table (Two-Way Table) *If the numbers in the cells of the contingency table are the same for all categories, we can say that the variables are independent, If the numbers in the cells are different for different categories (with some having higher values than others), then the variables might be related. For example, if you are analyzing data on the relationship between gender and income, you might find that the proportions of men and women in different income categories are different, indicating some sort of relationship between the two variables.

A contingency table is a type of table that is used to organize and (later on) analyze categorical data. It shows how the observations in a dataset are distributed among different categories of two or more variables. Contingency tables can help in understanding relationships between variables and identifying patterns or trends in the data.

Five Number Summaries

A five number summary provides a concise summary of a dataset. It consists of the minimum value, the first quartile (Q1), the median, the third quartile (Q3), and the maximum value of a dataset.

Frequency Tables * A frequency table is a list that shows how often each value occurs in a set of data. It shows the number of times (or frequency) that each value appears in the data. *Ex: The variable is stress on job, which assumes three categories; very, somewhat, and none. Since there is some order, stress on job can be ranked as an ordinal variable. The frequency table always reports the sum of the frequencies that makes up our sample

A frequency table for qualitative data lists all categories in one column and the number of elements that belong to each of the categories on the next column. Tallies (e.g., ||||) can be used to number the raw data. For ex: Let's say we just finished sending out a survey to our AP Statistics class on how stressful being a student is as a hypothetical occupation with our categories being "very," "somewhat," and "none" (not stressful at all). We, then, have collected 30 responses. How do we organize these responses and make sense of them? One way to do so is to "pile" the data by counting the number of data values in each category of interest. From there, we can organize these counts into a frequency table, which records the totals and the category names.

Histograms

A histogram is a graphical representation of a distribution of data, where the data is divided into piles called bins. The bins are created by dividing the range of the data into equal-width intervals, and the height of each bar in the histogram corresponds to the number or proportion of observations that fall within the interval represented by that bar. The width of the bars represents the interval width, and the x-axis represents the values of the data. *Unlike the bar graphs, there is no space between histogram bins. If there is a space, then that indicates an actual gap in data with no values. The height of bins represents the frequencies of the classes. Remember, always check the quantitative data assumption to verify the right graph or display.

Pie Charts *It's important to keep in mind that pie charts are best used to compare the relative proportions (percentages and relative frequencies, for example) of different categories. They're not as effective at showing precise values or small differences between categories. If you want to show detailed values or compare the values of multiple categories, it is usually better to use a different type of graph, such as a bar chart.

A pie chart is a circular graph that is divided into slices, with each slice representing a different category. The size of each slice is proportional to the fraction of the whole that is represented by that category. Pie charts are often used to show the relative proportions of different categories within a dataset.

Intro to Z-Scores *For this reason, z-scores are also called standardized values. In sports, when the judges have to calculate the final score for athletes, they use z-scores. Negative z-scores mean that the data value is below the mean, while positive z-scores mean that the data value is higher than the mean. The further the value is from the mean, irrespective of the sign, the more unusual the value is. *As you see, when we are standardizing data into z-scores, we are shifting them by the mean and rescaling by the standard deviation. In general, shifting data changes the distribution but leaves the shape and spread unchanged. The center shifts with other measures of the position such as percentiles, minimum, and maximum by the same amount of value. What about rescaling? You may guess already that with rescaling data when we multiply or divide any number to a data set, the shape of distribution won't change (it will just look stretched or squeezed), but everything else will change, the mean, minimum, maximum, range, IQR, and standard deviation

A z-score, also known as a standard score, is a measure of how many standard deviations a data point is from the mean (not median) of a data set. It is calculated by subtracting the mean of the data set from the value of the data point, and then dividing the result by the standard deviation of the data set. FORMULA: z = (x - x̄) / s where z = z-score, x = a data point, x̄ = mean value, s = standard deviation For example, consider a data set with a mean of 50 and a standard deviation of 10. If a data point has a value of 70, the z-score for that data point would be calculated as follows: z-score = (70 - 50) / 10 = 2 This z-score of 2 means that the data point is 2 standard deviations above the mean of the data set. z-scores are useful for comparing values within a data set and for determining whether a value is unusual or extreme relative to the rest of the data. They can also be used to standardize data for comparison between different data sets.

Going Deeper into Data and the "W"s

As mentioned earlier, data can refer to numbers or other subjective labels, and they are useless without their context. One easy way to provide context is to answer the Ws—who, what, when, where, why (if possible), and how—of the dataset we're working with.

Bar Graphs *To create a bar graph, you first need to decide on the categories you want to include. Each category corresponds to a separate bar on the graph. The height of each bar represents the frequency or count of observations in that category. All the bars have the same width, and there is a gap between adjacent bars to distinguish them from each other.

Bar charts (or bar graphs) are used to display frequencies (counts) or relative frequencies (proportions) for categorical data. The height or length of each bar in a bar graph corresponds to either the number or proportion of observations falling within each category

For categorical variables, the choices are limited.

Bar graphs and pie charts are the most common displays when looking at data in chart-like formats.

Box Plots and Skew

Box plots can help us find important features about the distribution. The central box stretches from Q1 to Q3 and shows the middle (50%) of data. If the median (Q2) is situated in the right middle of the quartiles, then the box will look symmetric. However, we should also look at whiskers. If the whiskers have different lengths, the distribution will be skewed on to the longer whisker's side To determine whether a box plot is skewed or symmetric, you can look at the position of the median relative to the first and third quartiles. * If the median is roughly in the middle of the box, with about the same amount of data above and below it, the distribution is symmetric. *If the median is closer to one end of the box, with more data on the other end, the distribution is skewed

Continuous Variables

Continuous were related to those numbers that come in the intervals. The price, weight, age are continuous because it can assume numbers in intervals. When data assumed are numbers, then it makes sense to find an average.

Representing a Categorical Variable with Tables

Data can be enormous and hard to understand when observed in its raw, unprocessed likeness. For this purpose, statistics was created to help us organize and analyze data. First, values are organized in tables; then, data are graphed in different displays. Tables are a necessary step to start analyzing data, but it may fail to highlight essential features with data. The graphical displays are visually attractive, easy to read, and see important patterns of the distribution.

Variables

Data is actually in plural form; it contains information about individuals or units that have characteristics, also called variables.

Statistics & Data, Data, Data

Depending on how we use data, the study of statistics is divided into two main areas: descriptive and inferential.

Dotplots *To create a dotplot, you need to plot a dot for each data value on the horizontal axis, with the position of the dot corresponding to the value of the data. You can also use different symbols or colors to distinguish between different categories or groups of data.

Dotplots are more similar to stemplots, but this time each observation is represented by a dot. The position of each dot on the horizontal axis corresponds to the data value of that observation, and the dots are stacked on top of each other when the values are nearly identical. Moreover, if you forget how to write the numbers, then this is the best display for you! It is simple and less time-consuming as it use dots instead of digits to construct it. Dot-plots are the first choice when we deal with a small set of data.

Example of describing categorical data

For example, a description could look like this: "Our most likely outcome was people who prefer donuts with a proportion of 0.45 and our least likely outcome was people who prefer cookies with a proportion of 0.15." Sometimes it is also beneficial with categorical data to discuss raw counts rather than proportions. However, it is more likely that the AP exam will ask you to describe a distribution of a quantitative data set rather than a categorical data set

Frequency Polygons

Frequency polygons display the distribution of quantitative data by using lines and connecting points at the midpoints of the classes for each bin. It is similar to a histogram, but instead of using bars to represent the frequencies, it uses lines that connect the points at the top of the bars. *To create a frequency polygon, you need to first create a frequency table that shows the number of observations (frequencies) for each value or interval of values in the data. The x-axis of the frequency polygon represents the values or intervals of the data, and the y-axis represents the frequencies. Then, you can plot the points at the top of each bar and connect them with lines to create the polygon. ✏️

Categorical Data

Have you ever asked a group of people whether they liked coffee? This type of survey would be an example of categorical data. The reason why is because each individual chooses a category: whether they liked coffee or not. Because of this separation of data, it is impossible to calculate the average dessert preference. Instead, we typically measure categorical datasets using measures like proportions. It makes a lot more sense to make a statement like, "the proportion of people who like coffee is 0.65." Categorical data uses percentages, or proportions, to make inference.

Quantitative Data

Have you ever wondered what the average AP score was? This is an example of quantitative data because each individual is assigned a quantity. Assigning each test taker an AP score means that each individual being measures is assigned a number. Remember: Quantitative data uses means, or averages, to make inference!

Example of describing quantitative data

If we had a set of data regarding the amount of bananas per bunch purchased, a model response may look like the following: "The mean number of bananas purchased was 5 bananas, There was one outlier when a customer purchased a bunch of 12 bananas. The shape of our data distribution was fairly symmetric. The range of bananas per bunch was 10, with the largest bunch being 12 and the smallest bunch being 2."

Describing Data

In categorical data, this process may look different. It is usually more valuable with context data to discuss which category was most likely to happen and which was least likely to happen.

Statistics of Spread (Interquartile Range) *However, the IQR does not capture the entire distribution of values in the data set and therefore may not fully reflect the variability of the data. Other measures such as the range, standard deviation, and variance can provide a more comprehensive view of the dispersion and variability in a data set. These measures are often used in conjunction with the IQR to provide a more complete understanding of the characteristics of a data set.

Interquartile Range (IQR) * Recall that the interquartile range (IQR) is based on the difference between the upper and lower quartiles. It is calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a data set. IQR = upper quartile (Q3) - lower quartile (Q1). The first quartile, Q1, is the median of the half of the ordered data set from the minimum to the position of the median. The third quartile, Q3, is the median of the half of the ordered data set from the position of the median to the maximum. Q1 and Q3 form the boundaries for the middle 50% of values in an ordered data set.

(1) Who (Respondents, Subjects (or participants), experimental units). *It is important to consider the who of data when designing a study or analysis, as the characteristics of the units being studied can affect the results and conclusions that can be drawn.

Knowing who is involved in generating the data we have at hand provides more information about the cases (circumstances) for which (or whom) data is collected. There are a lot of ways to describe these individuals involved: 1. Respondents refer to individuals who contribute and answer surveys, providing information about themselves or their opinions on a particular topic. 2. Subjects (or participants) refer to individuals (or sometimes other types of units, such as groups or organizations) involved in experiments, where they are exposed to a treatment or intervention and the effect of that treatment is measured. 3. In addition to human subjects, data can be collected from a wide range of other types of units, such as animals, plants, or inanimate objects. These units are often referred to as experimental units.

Statistics of Center (Mean)

Mean, or average, as you learned before, is easy to calculate, we add up all the values of the variable and divide the sum by number. The formula follows: x̄ = ∑x / n x is read as an x bar; it's the mean value of the x values of data. By the way, it doesn't need to be x; it can be y as well. Means are the best summary measures for a symmetric distribution because, as mentioned before, they are the balancing point of the distributions. However, the mean has few drawbacks.It does not tell about all individuals (that is why we also need summary measures of spread), and it is not resistant to outliers. The mean number can easily be affected by one high value in our data set and affect our study results, leading us to make wrong decisions if we wrongly choose to report the mean instead of the median

Statistics of Center (Median) *We will need only to find its position by dividing the total number of our data by 2. If the total amount is odd, we add one (n/2 for even cases and (n + 1)/2 for odd ones).

Median is the middle number of data. When data are even we calculate the median by finding the average of the middle two numbers. Medians are good alternatives of summarizing the center of for skewed distributions or distribution with an outlier. The median is resistant to outliers. However, it is not easy to find the median from the histogram, but you don't need to do it.

Normal models

Normal models are appropriate for symmetric and unimodal distributions. The normal model has two parameters (the population mean, µ, and the population standard deviation, σ) and is often written as N(mean, sd). These parameters do not come from data but are part of the model. What variables in daily life can a normal distribution model. A lot! Here's a list of a small sample of all the variables out there that follow a normal distribution: Height IQ scores Blood pressure Birth weight Body temperature Life expectancy Income

The Empirical (68-95-99.7) Rule *The EMPIRICAL RULE: "For a normal distribution, approximately 68% of the observations are within 1 standard deviation of the mean, approximately 95% of observations are within 2 standard deviations of the mean, and approximately 99.7% of observations are within 3 standard deviations of the mean."

Often we ask ourselves whether we are normal or not. If we are normal, then we should be doing about the same things as the average people do. The 68-95-99.7 rule (Empirical Rule) tells us that if we all behave normally then about 68% of the values fall within one standard deviation of the mean, about 95% of the values fall within two standard deviations of the mean, and about 99.7%—almost all—of the values fall within three standard deviations of the mean

What is one big giveaways for quantitative data?

One big giveaway for quantitative data is that we can take the mean, or the average, of the data set. In other words, quantitative data is average-able.

Describing Data

Perhaps the biggest concept and skill of this first unit is being able to describe data. In quantitative data, this consists of four main parts: center, outliers, spread, and shape. It is also important to include context in your answer.

A Note About Outliers *Both of these methods can be useful for identifying unusual or unexpected values in a data set, but they may not be suitable for all types of data or in all situations. It is important to consider the characteristics of the data set and the goals of the analysis when deciding which method to use to identify outliers.

Previously, we've talked about what outliers are, but how do we know a data point is an outlier or not? There are many methods for determining outliers. Two methods frequently used in this course are: 1. Method I: 1.5 x IQR * We can use the IQR to identify outliers involves calculating the IQR for the data set and then using this value to determine which values are outside the normal range of the data. * Specifically, values that are more than 1.5 × IQR above the third quartile (Q3) or more than 1.5 × IQR below the first quartile (Q1) are considered outliers. This method is based on the assumption that most of the values in the data set should fall within the range defined by the IQR, with only a small number of values falling outside this range. 2. Method II: Standard Deviations * We can also use standard deviations to identify outliers involves calculating the mean and standard deviation for the data set and then using these values to determine which values are outside the normal range of the data. Specifically, values that are more than 2 standard deviations above or below the mean are considered outliers. This method is based on the assumption that most of the values in the data set should fall within two standard deviations of the mean, with only a small number of values falling outside this range.

Discrete Variables

Recall your algebra class when we called discrete to those numbers that were whole.

Relative frequency of a category equation

Relative frequency of a category = Frequency of that cat category / Sum of all frequencies Percentage = Relative frequency * 100

What's the Point of Statistics?

Some common tasks in statistics include: 1. Collecting data through surveys, experiments, or other methods 2. Describing and summarizing data using measures such as mean, median, mode, and standard deviation 3. Visualizing data using graphs and plots 4. Testing hypotheses and making inferences about population parameters based on sample data 5. Building statistical models to predict outcomes or understand relationships between variables 6. It is a powerful tool for understanding and interpreting real-world phenomena, and is used to inform decision-making, policy-making, and research in a variety of contexts.

Statistics of Spread (Standard Deviation) *When we change units, we are either multiplying or dividing by a conversion factor. When multiplying or dividing, this changes both the center and spread.

Standard Deviation The standard deviation is like lungs in statistics. You cannot breathe without it. You cannot analyze data without it. It shows how far or close the values are dispersed, deviated, or vary from the mean. The process of calculating standard deviation is lengthy and time-consuming, and definitely, you already know by now. You will mostly rely on your calculator to do it for you, but in case here is the formula: s = √[∑(x-x̄)^2/n-1]

Summary Statistics for a Quantitative Variable

Statistics is a measure taken from the sample to help us analyze the data. Meanwhile, a parameter is the measure taken from the population. In inferential statistics, we will use statistics to make inferences about the parameters. Mean, median, standard deviation, IQR, range, all are summary statistics for a quantitative variable.

What is Statistics all about?

Statistics is all about data. We collect sets of data, analyze our data and ultimately, use our data sets to make inferences about larger sets of individuals in our population.

Stem-and-Leaf Plots (Stemplots) *To create a stem-and-leaf plot, you need to first split each data value into a "stem" (the first digit or digits) and a "leaf" (usually the last digit). The stems represent the tens digit of each value, and the leaves represent the units digit. Then, you can arrange the stems and leaves in a table, with the stems on the left and the leaves on the right. Whenever you make a stemplot, don't forget to provide the key to help the reader how to read it with the appropriate context in mind!

Stem-and-leaf plots are a simple graphical representation of a distribution of a quantitative variable. They are similar to histograms, in that they show the distribution of the data, but they differ in that they preserve the individual values of the data (histogram relies on grouped data, thus missing the individuals in the bins). Here is an example of how to create a stem-and-leaf plot for the data values 23, 28, 35, 40, 45, 65, 68, 69, and 84: 2 | 3 8 3 | 5 4 | 0 5 6 | 5 8 9 8 | 4 In this example, the stem "2" represents the values 20-29, and the leaf "3" represents the value 23. Similarly, the stem "4" represents the values 40-49, and the leaf "5" represents the value 45. YOU NEED A KEY! 💡 TIP: Turn the stem-and-leaf plot on its side to see any unusual things that data will have for you to be aware of it.

Spread *In general, it is a good practice to report both the center and spread of a dataset when describing its statistical properties. In symmetric distributions, it is common to report the mean with the standard deviation, while in skewed distributions, it is common to report the median with the IQR. This allows you to provide a more complete description of the distribution of the data and how it is dispersed around the center.

The center is a good measure, but it's definitely not perfect if we don't report it with the spread. There are several measures that can be used to describe the spread or dispersion of a dataset, including the range, standard deviation, and interquartile range (IQR). 🕸️ 1. The range is calculated by subtracting the minimum value in a dataset from the maximum value. While it can be a useful measure in some cases, it has the disadvantage of not taking into account the values of all of the data points, only the maximum and minimum values. As a result, it may not accurately reflect the true variability in the data. 2. The standard deviation measures the dispersion of a dataset around the mean. It is calculated by taking the square root of the variance, which is the average of the squared differences between each value in the dataset and the mean. The standard deviation is a useful measure for symmetric distributions because it takes into account all of the values in the dataset and reflects the overall pattern of the data. 3. The interquartile range (IQR) is calculated by subtracting the first quartile (Q1) from the third quartile (Q3). The first quartile is the value that divides the bottom 25% of the data from the top 75%, while the third quartile is the value that divides the bottom 75% of the data from the top 25%. The IQR is often used to describe the spread of skewed distributions or datasets with outliers because it is resistant to the influence of these values.

Extension: Relative Frequency Tables

The concept of frequency tables can be extended using relative frequencies and percentages. 1. The relative frequency is found by dividing the frequency for each category by the sum of all frequencies. 2. The percentage is obtained by multiplying the relative frequency of category by 100.

(5) How *It is important to carefully consider the how of data collection when designing a study or analysis, in order to ensure that the data is of sufficient quality and reliability to support the research question and conclusions.

The how of data collection refers to the methods or techniques that are used to collect the data, and it can have a significant impact on the quality and reliability of the data. There are many different methods for collecting data, including surveys, experiments, observations, and secondary data sources. Each method has its own strengths and limitations, and it is important to choose the most appropriate method for the research question being addressed. For example, Internet surveys can be a convenient and cost-effective way to collect data from a large number of respondents, but they may also be unreliable due to biases, such as nonresponse bias (where certain groups are more or less likely to respond to the survey) or response bias (where the responses are not accurate or honest).

How do we organize and display quantitative data?

The main displays we will discuss are histograms, polygons, ogive, stem-and-leaf plots, and dot-plot.

Resistance and Nonresistant Measures

The mean, standard deviation, and range are considered nonresistant (or non-robust) because they are influenced by outliers. The median and IQR are considered resistant (or robust), because outliers do not greatly (if at all) affect their value. For these reasons, the median and IQR are often preferred to the mean, standard deviation, and range when working with data sets that may contain outliers. They are more robust and provide a more accurate representation of the center and spread of the data, even in the presence of extreme values.

(3) When and Where *Both the when and where of data can be important considerations when interpreting the results of a study or analysis. It is important to carefully consider the context in which the data was collected, as it can help to better understand the meaning and implications of the results.

The more we know about the context, the more we'll understand about the data we have! This is where the when and where of our data come in. 1. The when refers to the time at which the data was collected, which can have an impact on the values that are recorded. For example, values recorded at different points in time may reflect different trends or patterns. 2. The where of data refers to the location where the data was collected, which can also have an impact on the values that are recorded. For example, values recorded in different geographical locations may reflect different social, cultural, or economic factors

The variables can be measured at different levels: nominal, ordinal, interval, and ratio

The qualitative variables are nominal and ordinal. The difference between the two is that ordinal has some order between qualitative data, but nominal has not. For example, the type of industry, of course, is a qualitative variable, as the values are names for transportation. For example, the satisfaction level of customers can be ranked by some order from most to least. The difference between interval and ratio is that interval level measurement ranks data, but there is no meaningful 0, whereas the ratio has 0 in its meaning.

(4) Why *Answering these types of questions can help us to better understand the data and draw meaningful conclusions. It is important to carefully consider the questions that we want to answer when designing a study or analysis, in order to ensure that the appropriate data is collected and analyzed.

The questions that we ask of a variable, or the why of our analysis, shape how we think about and approach the variable. The questions we ask can influence the way we define and measure the variable, as well as the type of statistical analysis that we use to analyze the data. For example, if we are interested in understanding the relationship between two variables (say, amount of sleep and test scores), we might ask questions such as: 1. Is there a relationship between the two variables? 2. If there is a relationship, what is the nature of the relationship (e.g. positive, negative, or no relationship)? 3. Is the relationship statistically significant, or could it have occurred by chance?

Standard Normal Model *The standard normal model, as well as other normal distributions, are based on the assumption that the data follows a symmetrical, bell-shaped curve. In order for the standard normal model or other normal distributions to be a good model for the data, the data must be approximately symmetric and unimodal.

The standard normal model has a symmetrical bell-shaped curve, with the mean at 0 and the standard deviation at 1. Z-scores are actually based on the standard normal model. When working with data that is normally distributed, it is often helpful to standardize the data by converting the values to z-scores, which allows for easier comparison and analysis. This standard model can be written as N(0,1), and to standardize, we need to subtract from mean and rescale by the standard deviation. z = (x - x̄) / s. If the data is not symmetric or is multimodal (i.e. has multiple peaks), then the standard normal model or other normal distributions may not be a good fit for the data. In such cases, it may be necessary to use a different statistical model or transform the data in some way to make it more suitable for analysis. To check whether the data is approximately symmetric and unimodal, it is common to look at the histogram of the data or create a normal probability plot. A histogram should show a roughly symmetrical distribution, with a single peak in the middle. Don't model data with a Normal model without checking the "Nearly Normal" Condition

Elements & Observation

The statisticians will call the students as elements, and the score of each student as an observation. Soon These observations were part of the teacher's assessment, and she needs to use these data to analyze the content she taught.

Suppose the statistics class just had a test. The teacher checked and recorded the test scores of students.

The test scores represent numbers that, in statistical terms, are called data, and the whole set of numbers of the students is called a data set. But these numbers are meaningless if we don't know what measures and who those numbers are measured on. Since we know that these are the test scores for the students enrolled in statistics class, these numbers may convey important information about class performance, test difficulty, students' abilities, content knowledge, and even testing environment if placed in context.

Data

The values that variables assume are called data. Since the variables can be categorical or quantitative, data can also be divided into categorical and quantitative

Center *In a symmetric distribution, the mean, median, and mode are often close to each other or even equal, depending on the exact shape of the distribution. However, in skewed distributions or datasets with outliers, the mean, median, and mode can be significantly different from each other. It's important to consider which measure of central tendency is most appropriate for a given dataset, taking into account the symmetry or skewness of the data as well as the presence of outliers.

There are three commonly used measures of the "center" of a distribution: mean, median, and mode. 1. The mean, also known as the average, is calculated by summing all of the values in a dataset and dividing by the number of values. It is often considered the best measure of central tendency for symmetric distributions because it takes into account all of the values in the dataset and reflects the overall trend in the data. 2. The median is the middle value in a dataset when the values are ordered from least to greatest. It is often considered a better measure of central tendency for skewed distributions because it is resistant to the influence of outliers (values that are significantly different from the majority of the other values in the dataset). 3. The mode is the value that occurs most frequently in a dataset. It is a useful measure of central tendency when there are a few values that occur much more frequently than the others.

Describing the Distribution of a Quantitative Variable

There are three things that we should look for when trying to find trends and patterns: shape, center and spread

Univariate data (or one-variable data)

This is data that only has one aspect of it that is being measured. Among our sets of univariate data, we will divide our data sets into two different types: quantitative and categorical.

Shape *It's worth noting that a histogram does not have to be perfectly symmetrical to be considered symmetric. Some degree of skewness or asymmetry is often present in real-world data, and a histogram may still be considered symmetric if the degree of asymmetry is relatively small. *Another way to think about skewness is that there are two types of skewness: positive skewness and negative skewness. Positive skewness occurs when the distribution is skewed to the right, with a long tail on the right side and a shorter tail on the left side. On the other hand, negative skewness occurs when the distribution is skewed to the left, with a long tail on the left side and a shorter tail on the right side. *A distribution can have one mode, in which case it is called a unimodal distribution, or it can have two or more modes, in which case it is called a multimodal distribution. A bimodal distribution, for example, is a distribution with two modes. Uniform distributions, on the other hand, do not have a mode because all of the values in the distribution occur with roughly the same frequency. In a uniform distribution, there is no single value that stands out as being more common than any other value. *It's important to be aware of outliers in a dataset because they can skew the results of statistical analyses and cause them to be less representative of the underlying data. For this reason, it is often useful to analyze data both with and without outliers to see how they affect the results. There are a few different ways to identify outliers in a dataset. One way is to use graphical methods such as boxplots, which can help you visualize the distribution of the data and identify any values that are significantly different from the rest of the data. You can also use statistical measures such as the mean and standard deviation to identify outliers.

To describe the shape of the display, check the following: 1. Symmetry. If you fold the histogram, do you have have equal amounts of data on each side. If yes, then your data are symmetric! Think of the shape of a butterfly and what happens when you "fold" it in half. 2. Skewness. The shapes can be right-skew and left-skew, the least or highest number in distribution pulls it to its side, and so it makes it look skewed. The skewed distribution will have one tail longer than the other, whereas the symmetric distribution has equal tails. If the tail is longer at the left side, then it is called left skewed, and right skewed for the ones that the tail is longer on the right side. 3. Peaks (modes). A mode represents the most frequent value or values in a distribution. In a histogram, stemplot, dotplot, or other graphical representation of data (except for boxplots), the mode is often indicated by the peak or peaks in the distribution. It's worth noting that a distribution can have a mode even if it is not symmetrical or has skewed data. For example, a positively skewed distribution (with a long tail on the right side) can still have a mode if there is a value or values that occur more frequently than any other values in the distribution. t's important to be aware of the number and location of modes in a distribution because they can provide valuable insights into the underlying data and how it is distributed. For example, the presence of two modes in a distribution may indicate the presence of two distinct groups or subpopulations within the data. 4. Outlier. Beware of outliers. Outliers are values in a dataset that are significantly different from the majority of the other values in the dataset. They can be either extremely high or extremely low, and they can have a significant impact on statistical measures such as the mean, median, and range of the data. 5. Gaps. Gaps in data help us detect multiple modes and warn us about different groups of data sources

EXAMPLE: You have taken 5 exams in your math class and you want to know your average score. The scores on the exams are as follows: Exam 1: 80 Exam 2: 90 Exam 3: 70 Exam 4: 85 Exam 5: 75 How do you find the average exam score?

To find the average, you need to add up all of the exam scores and then divide by the total number of exams. In this case, the total score is 80 + 90 + 70 + 85 + 75 = 400, and the total number of exams is 5. Therefore, the average exam score is 400 / 5 = 80; in this example, your average exam score is 80.

Real-Life Applications: To Trust or Not To Trust a Bar/Pie Chart?

To help inform whether bar/pie charts are reliable or not, here are examples of ways they are commonly misused: 1. Using bar/pie charts to compare variables on different scales: Charts are best used to compare categories or groups that are on the same scale. If you are comparing variables that are on different scales, it can be difficult to accurately compare the sizes of the bars/pie slices. 2. Using bar/pie charts to show continuous data: Charts are best used to show categorical data, not continuous data. If you have continuous data, it is usually better to use a different type of graph, such as a line graph or scatterplot. 3. Using bar/pie charts to show small differences: Charts are not very effective at showing small differences between categories. If the differences between the categories are small, it may be difficult to accurately interpret the graph. 4. Using bar/pie charts to show trends over time: Charts are not well suited for showing trends over time. For this purpose, it is usually better to use a line graph or a time series plot. 5. Using bar/pie charts to show more than two variables: Charts are typically used to compare two variables. If you want to show more than two variables, it is usually better to use a different type of graph. The example below compares A, B, and C; here, you can see that it might make more sense to use a bar chart over a pie chart. 6. Using bar/pie charts to show a false impression of size: Truncated bar graphs (bar graphs that don't start at a y-value of 0) can be misleading if the truncation is not clearly labeled or if the truncation is done in a way that distorts the data. For example, if the truncation is done at an arbitrary value, it could give the impression that the data is more evenly distributed than it really is.

How can categorical variables be represented?

Using tables and/or graphs: Bar Graphs, Pie Charts, Contingency Table (Two-Way Table).

(2) What (Variables, Dependent Variables, Independent Variables, Controlled Variables) *It is important to carefully consider the variables that will be measured in a study, as they will determine the questions that can be answered and the conclusions that can be drawn. It is also important to ensure that the variables are accurately and consistently measured, in order to ensure the validity and reliability of the study

Variables are characteristics or attributes that are measured or observed for each individual in a study. The variables should have a name that clearly identifies what has been measured, so that the data collected can be easily understood and analyzed. here are different types of variables, including: 1. Dependent variables: These are the variables that are being measured or observed in a study. The value of the dependent variable is thought to depend on the value of one or more independent variables. 2. Independent variables: These are the variables that are being manipulated or controlled in a study. The value of the independent variable is thought to influence the value of the dependent variable. 3. Controlled variables: These are variables that are kept constant or controlled in a study, in order to eliminate their influence on the dependent variable.

Descriptive statistics

We describe a situation by collecting, organizing, summarizing, and presenting the data.

Going Deeper: Categorical vs. Quantitative Variables

We established that variables refer to characteristics that change from one individual to another: age group, dominant hand, height, you name it! In statistics, one of the ways variables can be classified is between categorical or quantitative. 1. Categorical variables are variables that can be placed into categories or groups. These variables do not have a numerical value and cannot be ordered or ranked. Examples: gender, race, and marital status. 2. Quantitative variables are variables that can be measured or counted and have a numerical value. These variables can be either continuous or discrete. Continuous quantitative variables can take on any value within a given range, such as height or weight. Discrete quantitative variables can only take on certain values, such as the number of children in a household or the number of times a person has been hospitalized.

Inferential Statistics

We try to make an inference from our collected data to populations by generalizing, estimating, testing, and making predictions.

Categorical Variables

When the variable assumes values that are attributes, we call the variable categorical, and data as categorical—for example, the colors of cars, names of states, districts, countries. The values for colors of cars may stretch from white to black, any possible color you may see on the street. Here's a list of categorical variables: Gender (male or female) Race (white, black, Hispanic, etc.) Marital status (single, married, etc.) Employment status (employed, unemployed, self-employed, etc.) Education level (high school, associate's degree, bachelor's degree, etc.) Political party (Republican, Democrat, Independent, etc.) Religion (Christian, Muslim, Hindu, etc.) Eye color (blue, brown, green, etc.) Hair color (blonde, brunette, red, etc.) Birthplace (United States, Canada, Mexico, etc.)

Quantitative Variables

When we measure a characteristic that results in numerical values, then we deal with quantitative variables and subsequently with quantitative data—for example, the number of days, the price of the product, the age of the individuals. The quantitative data divided further into two types: discrete and continuous. Here's a list of some of them: Age (8, 16, 34, etc.) Height (180 cm, 5'2", 2 meters, etc.) Weight Income Body mass index (BMI) Blood pressure Heart rate Hours of sleep (a controversial one for teens) Distance traveled Number of siblings

Relative frequency table * A relative frequency table is similar to a frequency table, but it shows the relative frequency of each value in the data set. This means that the frequencies are expressed as a proportion of the total number of values in the data set.

With these quantities in mind, a relative frequency table is similar, but it gives the percentages for each category instead of counts. Based on the relative frequency and percentage distributions of stress on job, we can state that the 33.3% of the employees answered that their jobs are very stressful. Another way to interpret the data is by combining the groups "very" and "somewhat" stressed and report that 80% of the employees answered that jobs are very or somewhat stressful

Normal Model: More than Just a Hump

You may have learned about "normal" models or bell-shaped curves in your Algebra class and through calculus. Some sets of data may be described as approximately normally distributed. A normal curve is mound-shaped and symmetric

Cumulative graphs (Ogives)

cumulative graph, also known as a cumulative frequency plot or an ogive, is a graphical representation of a cumulative distribution. It is used to show the number or proportion of a data set that is less than or equal to a given value. The cumulative frequency adds the frequencies by each class. Ogives help us determine the position of data to see how many values are below or above a certain value. *To create a cumulative graph, you need to first create a cumulative frequency table that shows the number or proportion of observations that are less than or equal to each value or interval of values in the data. The x-axis of the cumulative graph represents the values or intervals of the data, and the y-axis represents the cumulative frequencies. Then, you can plot the points and connect them with a line to create the cumulative graph.


Set pelajaran terkait

Mastering Student Success Chapter 15: Goal Setting

View Set

Fundamentals of Building Construction: Chapter 1

View Set

Computer Science 307 : Software Engineering : Chapter 5

View Set

VARCAROLIS Chapter 28: Child, Older Adult, and Intimate Partner Violence

View Set

ACCT 101A - Study for Exam II (Ch. 5)

View Set

The Gettysburg Address by Abraham Lincoln

View Set