Lecture and Lab Set on Descriptive Statistics

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What is the mean? What is the difference between the sample and population means?

- The mean itself is the sum of the values divided by the number of values. - The sample mean is the mean of observations. It is X BAR. - Population mean is MU.

What is an expression in R? What happens with stand-alone expressions?

- An expression is composed of numbers, variables, functions, and other objects connected by mathematical operations like PEMDAS. - Stand-alone expressions that AREN'T assigned to a variable are displayed but NOT saved.

What is a dataframe, broadly speaking?

- Dataframes, unlike lists and vectors are TWO DIMENSIONAL. If we want many rows and many columns, which kinda resembles a table. - A dataframe is an object made up of other objects, which are typically vectors.

What are the two main ways we want to summarize data? Describe them.

- Measures of central tendency: the center of a data distribution given a group of data. - Measures of dispersion: measure the spread of a data distribution given a list of data These both are forms of summary. They help us BETTER UNDERSTAND our data from a broad point of view.

How do we see what types of datasets exist in R?

data()

What is the basic format of functions?

functionname(argument1, argument2,...)

What is the population standard deviation? (o)

It's the square root of the average of the squared deviations about the population mean MU.

What are vectors?

A vector in R is an ordered collection of numbers, text, or logical symbols. MUST ALL BE THE SAME DATA TYPE!! Vectors are one dimensional.

What is weighted mean? What does "weight" mean here?

A weighted mean is a meal where there is some variation in the relational contribution of individual data values to the mean. Ex: GPA. wi is the weight of an observation (let's say a 3-credit class). xi is the value of that observation (like 3.5). So, We must ADD the PRODUCTS of Wi and Xi - not just the SUM of Xi. Then we DIVIDE by the sum of the wis.

What is the interquartile range?

The difference between the 1st quartile value and the 3rd quartile.

What is the median? How do we go about determining the median?

The median is the middle value from a set of ordered values - it is the VALUE with an EQUAL NUMBER of data above and below it. How do we determine the median? 1. Order data from smallest to largest FIRST. 2. If the number of values (n) is odd, median is the middle number in the list of ordered values. Essentially, it should be at position (n+1)/2. 3. BUT, if the number of values (n) is even, then our median is the average of the middle two values.

What is the scan() function, and why would we use it?

The scan() function is for when we have a long list of data we want to read in one at a time. To enter the same vector, we simply type scan() and type all of our data - after each data point we hit the return key. To finish the vector, we hit the return key twice.

What is absolute frequency?

This is the LITERAL COUNT of how many times data falls in each category in our histogram.

What is relative frequency?

This is the percentage of the count per category RELATIVE to all counts across all categories. For instance, if a group of 11-20 years has 3 people and there are 20 people across all categories, then the relative frequency would be 3/20.

How do we import a csv file into R? What do we need to know before we do this?

To import a .csv file into R, we have to know the location of the file on our computer. To identify the file path, we can right-click on the file and go to "Get Info" When we know the location of the file path, we use the read.csv function to read the data into R directly via a new object. >>> dc <- read.csv("file_path")

How do we run functions in R?

To run a function, we type the name of the function followed by a set of parentheses. Inside parentheses, we put the object or number that we want to run the function on - essentially, our argument. Our arguments are what we put inside parentheses and they're what tell R how we want to run the function. We can also put the result of the function into an object!

How do we view a list in R?

Using the View() function - importantly, this function has a capital V. We can also use this function with vectors, data frames, and other objects.

How do we assign a range of numbers to a new variable?

We add a sequence of numbers using the assignment operator and colon. For instance, if we wanted to assign a list of numbers from 1 to 20 to a new variable, we could simply do: x <- 1:20

How do we access parts of a list?

We can access parts of the list using the same syntax that we have in Python. Essentially, the way we do this is using the bracket syntax. The main difference is that R uses indexing that starts at 1. But, we can access the parts of a list using the index number like so: mylist[2] would access the second item in the list. We can also use the $ operator. In R, $ is a shortcut for referencing a part of a dataset. WE MUST WRITE THE ACTUAL NAME OF THE COLUMN HERE THOUGH. mylist$hisage would access the data value at the data column hisage.

How do we access individual vectors in a data frame?

We can access the individual vectors in a data frame using two key strategies. Let's say we were working with the following data frame: xy <- data.frame(numbers=x, letters=y) >>> the index number: xy[1] would access the x vector. >>> the "$" symbol: xy$numbers would also access the x vector by its column name.

How do we load a dataset called BOD into R, and how do we use the help function to learn more about this dataset?

We can load a dataset called BOD in using the data() function again - make sure to put the dataset name in QUOTES: >>> data("BOD") Once this function loads the dataset in, we can explore this dataset using the help function WITHOUT QUOTATIONS: >>> help(BOD)

How do we report back individual elements of a dataframe?

We can report back individual elements with dataframes by referencing both the row and column number. The syntax looks something like this: xy <- data.frame(col1 = x, col2 = y) >>> xy[1,2] would access the value in the first ROW and second COLUMN.

How do we give the columns of a data frame different names?

We can simply modify the "key=value" pairs that we use when creating/reassigning the data.frame. For instance, if I wanted to give a vector called "x" the name "numbers" and a vector called "y" the name "letters", I could do something like this: xy <- data.frame(numbers=x, letters=y)

What do we have to code in when we're dealing with R notebooks?

We must code in chunks. We can add chunks using C+ inserts.

How do we go about coding a data frame?

We use data.frame() and pass in vectors. THESE VECTORS MUST HAVE THE SAME NUMBER OF ELEMENTS. We cannot, under any condition, have a data frame with an unequal number of elements in the vectors involved.

What is descriptive statistics, broadly speaking? Why would we want to use descriptive statistics?

- With descriptive statistics, we have a numerical or quantitative summary of the characteristics of a data set. - We want to work with descriptive statistics because a simple summary is more efficient and effective than a large set of values.

How do we find the Pth quantile?

1. We have to put the data in order, from smallest to largest. 2. We have to compute n*p, where n is the number of data values. 3. If n*p is an integer, then the pth quantile is the AVERAGE of the (n*p)th and the (n*p+1)th number in the list. As in, let's say we have 12 numbers and we want to split them into quartiles. First the first quartile, which is 25%, we do 12 * .25 = 3. Since 3 is an integer, we need to take the VALUE at position 3 and the VALUE at position (3+1) = 4 and then average THOSE VALUES. 4. Then again, if the position ISN'T an integer, we simply round up and use the number that occurs at that place in the list. For instance, let's say we were working with 13 numbers and wanted to split them into quartiles. 13 * .25 = 3.25. Since 3.25 is NOT an integer, we simply round this up to 4 and use the VALUE at position 4.

What is a boxplot? What does it span, and what's drawn through it? What do the "whiskers" represent, and what does it summarize?

A boxplot is a really cool type of dispersion diagram. It spans from the 1st to the 3rd quartile, and a vertical line is drawn THROUGH the box as the median. The "whiskers" of the boxplot extend out to the minimum and maximum values. The boxplot is an awesome five number summary: min, Q1, median, Q3, and max.

What is a histogram, and what is our main challenge with them?

A histogram is essentially a list of ordered data that are classified into bins (or categories). Classification is our main challenge here.

What is a list in R?

A list in R, in contrast with vectors, can be a combination of data types. Lists are objects that include a collection of anything - numbers, text, or symbols. We create a list using some cool syntax: the list() function. With this list() function, we specify values and their associations: mylist <- list(hisname = "Steve", hername = "Sue", hisage = 40, herage = 39)

What is an assignment operator? What does it do?

An assignment operator in R is "<-"! Essentially, it creates or replaces an object. An object can be legit anything in R - a number, a variable, a data frame, a matrix! In R we can use "<-" or "=". Whatever works best. To the left of the operator, we type the name of the object we want to create/replace. Then, to the right, we go ahead and type an expression, object name, function, etc. To see the results, we simply type the variable name.

What is the coefficient of variation? Why would we want to use it?

Because the standard deviation is called the absolute measure of variability, it can be hard to compare variability among different locations or different times. For instance, Buffalo's annual precipitation across 40 years might be 35.47, with a standard deviation of 4.70. San Diego's mean for the same data might be 9.62, while the standard deviation could be 4.42. Do 4.42 and 4.70 seem that different? NO! But, do 35.47 and 9.62 seem different? YES! That's why we use CV! CV measures relative variability in the data. A low CV means that the data is less dispersed. A high CV means that data is more variable. CV = (s/X bar) * 100% [SAMPLE] CV = (SD/mu) * 100% [POP] --> SD is standard deviation of the POPULATION.

What is cumulative relative frequency?

For each category, we count the RELATIVE frequency of that value and ALL PREVIOUS values. IF THE LAST ONE DOESN'T ADD UP TO 1.0, you've done something WRONG!

What are functions in R?

Functions are commands that programmers have written that calculate long expressions in one go.

What is a mode? Describe multimodal, bimodal, etc.

Given a list of data, for any value, if the value occurs most frequently, the value is the mode. >>> Bimodal: two peaks >>> Multimodal: multiple peaks We don't always have a mode!

How do we tell R to include the headers of a data set?

In the read.csv() function, we can include the header=TRUE argument, which tells R that the first row in the dataset contains the variable names.

How do we set the working directory in R?

It's a good idea to define a working directory in R, especially when we're dealing with a lot of csv files. We set the working directory in R using setwd("file path")

What is a stem and leaf plot?

It's a graphical data display that can be constructed quickly, with all the data values potentially extracted from the plot with ease.

What is the population variance?

It's the average squared deviation from the population mean.

What is the sample variance?

It's the average squared deviation from the sample mean.

What is the sample standard deviation? (s)

It's the square root of the average of the squared deviations about the sample mean.

How do we write a dataset from an R object to a csv file?

Let's say we had an R object called dc that was a dataframe. We could use the write.csv function to esport this dataset to a .csv format. Ex: write.csv(dc, "new_file_path")

What is quantile deviation?

Quantile means equal portions of a data set. There are several types of quantile: >>> Median: divide data into 2 EQUAL sets. >>> Quartiles: 4 equal sets - 25% >>> Quintiles: 5 equal sets - 20% >>> Percentiles: 100 equal sets - 1% A quantile gives lots of information, including the median. The median is a data value at the middle position. Since the median is a value that to the left side and to the right side we have an equal number of values, then median has 50% of values to the left of it and 50% of values to the right of it.

What is range?

Range is the difference between the largest and smallest values in an interval/ratio set of data.


Kaugnay na mga set ng pag-aaral

VA Real Estate State Practice Exam 5

View Set

American History 1:8 The Plains Indians

View Set

Sie FINAL EXAM 1 can take over and over

View Set

study guide for exam 2 peds, Peds Exam 3 Resp

View Set