psy420 test two

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

to compute the standard error, we can just go:

-"hapSD <- sd(hap)" -"n <- length(hap)" -"hapSE <- hapSD/sqrt(n)" And all of these things we've computed our in our Environment.

There are many reasons we may not be able to collect a large (n>50) sample of observations

-each observation might cost a lot of money -each measurement might be really hard to make (complicated) -each observation might use up a valuable resource -etc

According to Central Limit Theorem,

-most populations are normally distributed -most sample will probably be normally distributed -the standard error is the standard deviation divided by the square root of the sample size -all of the above

criterion value

-not a "critical value" -some reference or cut-off value

CLT tells us that if we did this experiment over and over again an infinite number of times, then...

-this infinite number of means from our two groups would form little normally distributed sampling distributions, -and they we would be one standard error wide, -and 95% of each would be enclosed by its respective 95% Confidence Interval.

Probability

A real number between 0 and 1 (inclusive) that describes how likely something is to occur, where 0 means it never occur and 1 means it is certain to occur.

For p-values how small is "small"?

As a researcher, doctor, data analyst, business person, actuary, etc., that is your call. In the wild, this depends on a lot of factors: •Are there lives at stake? •Is there a lot of money involved? •Is it a health and safety issue? •What is the risk/reward?

attach(insert data set name$variable)

A cool thing about a Data Frame is that you can "attach" it to yourself, so that R automatically knows you want to look for stuff in that Data Frame. This means no pesky dollar signs and less typing. -"mean(stah$hap)" vs -"attach(stah) -"mean(hap)"

Small n

Going to underestimate our SD by a lot, fuzzy and skewed, more than 1/2 of time

View(insert data set name)

Let's try our first command. First, close the view of "stah" Now let's get it back by using the "View" command. - "View(stah)"

plot(x, y)

Select the "Plots" tab in the lower right panel to make sure it's in front. We can plot the data just using the plot command: -"plot(scrTime, hap)" And we get a nice plot of our data over in the Plots area! Notice that plot already cleaner, more legible, and more "sciency" than an Excel plot. If we hit "Export" just above the plot, we can copy it straight to our clipboard:

Now I'll plot histograms of the data, just to check them out, make sure there's nothing too weird, etc...

So I do this: hist(females, xlab = " # washes", main = "females") hist(males, xlab = " # washes", main = "males") So let's plot the both with the same x-axis, and stack them vertically. We tell R what we want the limits of our x-axis to be by putting xlim = ... into the command. hist(females, xlab = " # washes", main = "females", xlim = c(30,40)) Ha! Now we can easily see that the male distribution is shifted over by about one wash! So females, in this sample, are washing about one more time per week than males overall. That is just a fact. Another cool way to compare two distributions is with boxplots. boxplot(females, males, names = c("female", "male"), ylab = "number of washes") The gray boxes enclose about 50% of the data from each group, and they barely overlap so, again, we can definitely say that, overall, the females in this sample washed their hands more than males. If we compute the means of each group: female mean = 35.64 and male mean = 34.40 Or a difference of 1.24 So, on average females in this sample washed their hands about 1 1/4 times more time per week than males. That is just a statement of fact. But can we trust the result from this one sample and say that females, in general, wash their hands more than males? What if we did the experiment over and over and over again, would we get the same answer every time? How often would the female mean be greater than the male mean? If the female mean was consistently higher, we could generalize our findings, but it wasn't, we couldn't! But as you all now now, we do not have to do the experiment over and over and over again to figure this out.We can just use Central Limit Theorem instead! So if we compute our standard errors: We get 0.37 for the females, and0.24 for the males "Wow, so the difference between the means was 1.24 and the standard errors are pretty small compared to that, so I'll bet this difference is solid."But I've been at this game a lot long than you, so let's make a sketch using our 68/95/almost all rule: And here we can see that the female mean is so far up relative to the male main, it's actually outside the 99.9 or "almost all" part of the distribution of male means. The z-score for the female mean relative the male distribution is: (fmean-mmean)/fse = 5.167 And we know that's BIG z-score.Big enough - sooo rare in fact - that the female mean actually wasn't a rare draw from the male distribution, but a draw from some alternative distribution: a distribution of female hand washes that was different from the male one! The female standard error was bigger, so maybe the male mean wasn't way in the tail of it! If we compute the z-score for the male mean using the female standard error (like we did in lab), we get: mmean-fmean/mse = -3.35

At some point, when you calculate that something is soooo rare, you should probably consider an explanation other than "rarity".

So now "Am I smart?" Has turned into 2 hypotheses: The default, or "null", hypothesis, that you are part of some reference group (random online test takers, in this case. The alternative hypothesis, that you are not part of that group. It's kind of like you see a black bird that looks like a swan in a pond where you see swans all the time, and you find yourself thinking...Is that a swan, just a super super super rare one? Or is that a black bird that just happens to look like a swan?

Measure of central tendency

Some way to quantify the "center" of the distribution whatever that means

Hypothesis testing Warning

Sometimes we have a particular idea or "hypothesis" about the world before we do an experiment. Like I think I'm (or my kid is) smart. Or I think babies can see faces. These seeming simple hypotheses are going to turn in to weird questions with strange answers. Then, to answer these questions, we're going to play make-believe. This procedure is known as "hypothesis testing" and, well, that's stats.

hypothesis testing steps

Steps: 1)Formulate the hypotheses 2)figure out the appropriate distribution 3)compute a p-value with your phone or whatever 4)think about whether the p-value is small enough to conclude that students got more anxious

What's small?

That depends on the situation but think "50-100" as big -most commonly used in psych and neuroscience -over 100 observations is big -20, 35 small samples

scripts pane

This pane is where you write "scripts" for R to follow. "Scripts" are just a collection of commands for R to obey in order. Using scripts saves us from having to type stuff in the console over and over. (top left)

hist(x values)

histogram

to run the script

hit source button (after opening data set)

big n correct standard deviation correct z score correct p value

small n underestimate of standard deviation overestimate of z score underestimate p value (more likely think alternative hypothesis is correct, overestimate rareness)

By Central Limit Theorem,

standard error = standard deviation / sqrt(sample size)

RStudio test drive

Download the data file from Canvas. Of course, I suggest being organized and having a folder for this class, making a sub-folder for each module... Open RStudio Now we'll load our data: Hit "Import Dataset" (top right) Select the top option "From Text (base)" Then navigate to the data file and open it. We see a pop-up dialog like this: -Name: What the Data Frame will be named. We are going to end up typing that some, so let's make it shorter! Let's just do the acronym for the file name. Then hit "Import" And now we have this: Our new data frame. (top right) A spreadsheet-like view of our data frame. (top left) The commands we could have typed into the console to do the same thing: load the data and look at them. (bottom left)

how we've used the standard deviations

-to compute a z score -to compute the standard error (fatness of sampling distribution) -and then for example to use that standard error to compute a z score for the main relative to another; in this case the standard deviation ends up in the denominator of z if we do a fraction and our denominator is too small, what happens to our answer OUR ANSWER IS TOO BIG this is a problem

Did students get more anxious?

1) Formulate the hypotheses -No, students didn't get more anxious - the null hypothesis -Yes, students did get more anxious - the alternative hypothesis 2) figure out the appropriate distribution This is going to be a little weird Let's pretend we live in the null world - the world in which students did not get more anxious. Pretending - playing make believe - that we live in the null world - the world in which students did not get more anxious: What would the mean be? -This one's easy; we already know it's 42. Would it be fuzzy? -Yes! All experimental means are fuzzy, so if we did experiment in the null world we wouldn't get exactly the same mean every time! If so, how fuzzy? -Our "null mean" would have a sampling distribution with a width - a standard deviation - given by the standard error. Unfortunately, we cannot transport ourselves to an alternate universe in which anxiety didn't change - the null universe, and:-measure anxiety in n=25 people-compute a mean (which would be around 42)-compute a standard deviation (maybe it be around 10?)-compute a standard error in order to...-compute a p-value Here's what we can do though: we can assume that all that happened to anxiety was that it shifted to higher values overall. If we can assume that, and we've already measured this (mean and sampling distribution) The we can just assume - make believe - that we know the standard deviation and hence the standard error for an n=25 experiment in the pre-pandemic world! So let's go back and get the exact numbers: n = 25 students mean = 49.9 std. dev. = 10.1. So the standard error is 10.1/sqrt(25) = 2.02 3) And now we're ready to compute out p-value! Computing the p-value in the Probability Distributions app: So that's p = 4.6 x 10-5. Which means there is only a 4.6 chance in 100,000or a 1 chance in over 21,000 that we would do an experiment in a make-believe world where the pandemic did not make students more anxious, and see mean anxiety of 49.9 or greater. So do you think we got so unlucky that our experiment was that 1 in over 21,000, or do you think maybe the whole covid coronavirus pandemic made people more anxious? We can also calculate a z-score for this mean:z = (our mean - null mean) / width of sampling distribution= (49.9 - 42) / 2.02 = 7.9 / 0.6 = 3.91And since we all speak z-score now, we know this is really large. 4) see if the p-value is small enough to conclude that we got more anxious -Why, yes, yes it is.

So with a skewed distribution we have

3 different measures of central tendency with 3 different values (mean, median, mode) -no right answer but convey different info

Why do we do it this way?

Because we can't really compute a sensible probability for things like "Am I smart?" (what would "there's a 70% probability that you're smart" even mean?) But we can compute the probability for things like "Would somebody from this known group outscore you on this test?" We call this probability a "p-value"

Some history, tragedy, and ranting

Before computers and calculators, these probabilities - the areas under the distributions - had to be computed via numerical integration using pencil and paper. This was incredibly tedious, and took a really really long time. Thus, tables were computed that showed z-scores corresponding to pre-determined "round number" areas, such as 0.05 (i.e. the 95% percentile). Starting around 100 years ago, this procedure was unfortunately called a "test of significance".And the predetermined probabilities became known as "significance levels".The z-scores corresponding to these various probabilities became know as "critical values". Over the years, lots of people started using these tests, without really understanding them, as an easy way to make decisions without thinking. People are lazy - easy is easier than hard. People strongly favor certainty over uncertainty (even with working with p-values, which are literally numbers that represent uncertainty). And, importantly, when we hear a word, our mind automatically thinks of what that word usually means to us, and all the words related to that concept (this is called "semantic priming"). So the procedure began to look like this: 1) compute p-value 2) see if it's less than 0.05 (the stupid value that became traditionally used in psychology). 3a) if it is, then the finding is "significant" and therefor real, important, and in no way due to chance producing a rare event. 3b) if it isn't, then scrap the study. And this was taught, and handed down from generation to generation...Thus leading us to what is now know as...The Replication Crisis. Happily, things are changing. But when reading older papers, don't pay attention to the "significance", even if the authors go on and on about it. Concentrate on the data in a common-sense way. And, beware, there are still many people around who have been seduced by The Dark Side... In this class, we will not use the word "significant" or because I think it's just too confusing. When we need to talk about some reference or cut-offvalue, I will call it a "criterion value" not a "critical value". Finally, we will report p-values as probabilities, like p=0.035, not (as you may have seen) as "less thans" like p<0.05.

variableMean <- mean(variable)

Computing means, standard deviation, etc. like this won't do us much good if the values just print out in the console. Happily, we can save these as named values, and we can name them whatever we want! We do it like this: The "<-" means "assign to"You get it by typing option and minus on the Mac or alt and minus on Windows So reading right to left, the above says compute the mean of the variable "hap" (in the data frame stah, because we've "attached" it), and assign the name "hapMean".But you can name your mean happiness whatever you want! You could name it "Phred" if you wanted, but that wouldn't be very descriptive... (here's a secret: you can just use "=" instead of "<-" if you want - try it! - some R snobs might hate you but I don't care!)

Last time we left off with a fun fact: whenever we collect data and compute a std dev chances are it is too small. And prior to that we talked about why this is due to sampling distribution of the std dev being skewed

Here's the thing. If we have a lot of observations a huge sample size our estimate of the std dev is going to pretty much be the true one. But small number of observations a tiny sample size our estimate of the std dev is going to get fuzzy exactly like the mean But it's also going to get more skewed

Hypothesis testing of a mean

I wonder if UT students are more anxious in fall 2020 (the coronavirus era) than they were in fall 2019 (the pre-coronavirus era). Our null hypothesis would be "No, students are not more anxious." Our alternative hypothesis would be "Yes, students are more anxious." Let's say we know that, by a common anxiety questionnaire, UT students had a mean anxiety of 42 (on a scale of 100) pre-pandemic. So that becomes our "reference" value. Let's say we sample n = 25 students, and got a mean of 49.9 and a std. dev. of 10.1. If we repeated this experiment on 25 different students, would we get 49.9... exactly again? No... our mean has uncertainty, it has "fuzziness". The fuzziness - the uncertainty of our mean - is the width of the sampling distribution of the mean. This is given by the standard error of the mean. By Central Limit Theorem, standard error = standard deviation / sqrt(sample size) We get:= 10.1 / sqrt(25) ~ 10 / 5 = 2 Now we know both what our mean is, and how fuzzy it is. 49.9 +/- 2 Now that we have this info, let's 1) make a nice, informative graph 2)do a hypothesis test and (hopefully) come to a conclusion To make a graph, let's first multiply 2 by 1.96 so I can make error bars showing the 95% Confidence Interval... 1.96 * 2 ~ 3.9 And now we start making a graph: Remember that error bars are "shorthand"... Still, I don't really feel like this graph is answering my question... The missing piece of information is: How does our data compare to the pre-pandemic era?In other words, what would I expect under the null hypothesis that anxiety hasn't changed?So let's add the pre-pandemic anxiety of the graph. We can show that on our graph as a dashed line perhaps (what ever you think looks good). So what's the answer to our question? So our question started "Are students more anxious?" And we graphically sort of turned it into "Did we get what was expected if students didn't get more anxious?" Or "Did I get what I expected under the null hypothesis?" "Did we get what we expected under the null hypothesis?" And the answer here is "There's no way. Nope." Technically, the answer is "If, if, anxiety didn't change, it would be very super highly highly unlikely that we would have gotten a mean at least as high 50, given the 'fuzz' of our mean!" But how unlikely, exactly? "Exactly how unlikely" is where p-values come in.So let's compute a p-value.Happily, we've already done most of the work.

small p value

If this probability is small then, at some point, you have to consider rejecting the idea that you are part of the null group - the group you are using to calculate the probabilities... in favor of the alternative idea, that you are actually a member of some other group - a group that is overall more like you.

dollar sign, $

If we read backwards (for English) from right to left, "stah$scrTime" reads "give me the variable scrTime from the data frame stah"

mean(insert data set name$variable)

If we want to compute the mean of screen time, we would use the "mean" command: -"mean(stah$scrTime)

data pane

In Excel, we put our data is spreadsheets, usually with data running down a column, with the name for that column at the top. -where you will see your data R stores data in very similar things called "Data Frames" (or more comically named but similar things called "tibbles"). We'll talk about these later. (top right)

Notice that we don't have to deal with those cell references (like D6 or $C$5) like we do in Excel.

Instead, our data and only our data stay in the data frame, our calculations are done separately, and the results of calculations get whatever descriptive names we want to give them!

Same with "critical"...

It was not meant to mean: crucial vital urgent key essential paramount all-important decisive again, even though these are all synonyms in everyday usage!pivotal

In this context, "significance" was meant mean "perhaps worthy of further investigation"

It was not meant to mean: important note worthy real solid key major consequential game-changing or anything like that, even though these are all synonyms in everyday usage!

Back to incomes

Mode: Most commonly earned income ~$17k bracket -we would say "the modal income is ~$17k in the US". Median: the $ amount that splits the set of incomes in half is ~$49000. This 50:50 split half households eat more half earn less -we would say "the median income is ~$49k" Mean: the $ amount where the distribution would balance on your finger is ~$75k -shifted to the right bc of outliers influence on mean -we would say "the mean income is ~$75k"

Alternative Hypothesis

Not the Null Hypothesis. -the opposite of the null - anything else

Notice that our new named value now appears in our environment (our "data space").

Now we can use "hapMean" whenever we want, instead of trying to remember 4.whatever...

Some examples of common skewed distributions

POSITIVE: skinny tail points toward the right -income -wealth -scores on a hard test -emissions of photons or electrons from a weak source NEGATIVE: skinny tail points toward the left -live span in developed categories -scores on an east test

Hello RStudio!

RStudio is a super fancy program for dealing with data - just like Excel - but it's designed for scientific data analysis rather then business and administrative uses. It's "engine" is R, a program which is technically separate, but we will never use it directly, we will always use the RStudio interface. I mention this only because I have a bad, but apparently unbreakable, habit of saying "R" as shorthand for "RStudio"So, again, we will never open or use R directly; whenever I say "Make this graph in R" or whatever, I mean RStudio.

A little bit about p-values, history, and tradition.

The basic idea is that you make believe the world is one way (the blue distribution). Then you compute how rare, how bizarre, something (the blue arrow) would be in that make believe world. We express is as a p-value, which is the probability that a random draw from the blue distribution would produce a value bigger than one we're considering. If that probability is really small, then the thing were considering must be really rare in this make believe world. If the p-value is small enough, we may want to consider that this thing isn't from the world we used to compute the p-value; maybe it's just from a different group altogether (e.g., the pink distribution).

Yes that mean

The exact same number you would get if you added up the incomes of every family and divided it by the number of families -also called the average

plots pane

The lower right pane has 4 tabs we'll use, mainly "Plots". The plots window is where all the awesome graphs you are going to make show up. (bottom right)

Median

The observation that cuts the sample in half 50% of observations above and 50% below

sd(variable)

standard deviation -"sd(hap)"

Here's what we've done:

To answer the question "Did the pandemic make people more anxious?" We took some data. We assumed that all the pandemic would do is make people more anxious overall, -it wouldn't change how spread out or variable anxiety was. -it wouldn't change the distribution of anxiety into some funny non-normal shape We then played make believe - we pretended that the pandemic didn't change anxiety, and calculated how unlikely our data would be. Since our data came out sooooo unlikely, we concluded that we're probably NOT living in the "null" world, we're probably living in a world where anxiety DID increase, and our data are thus typical for this world. Believe me, I know it seems really bizarre at first, but that's stats.To make good decisions, we need to compute relevant probabilities. To compute those probabilities we play make-believe; we ask "What if..." for various possible ways the world might be.

So that's comparing two groups and two group means at a conceptual level.

Various mathematicians and mathematical statisticians over the years have devised ways to handle, for example, those different standard errors we had. But if you have the concept of the sampling distribution and the logic of how we use it down, then everything else will be relatively easy.

summary(insert data set name)

We can use the "summary" command to get a concise summary of our variables. get min, 1st Qu, median, mean, 3rd Qu, max of each variable -"summary(stah)" Notice that we had to tell both the "View" and "summary" commands what we wanted a view or summary of ("stah") by putting it in parentheses. We had to be exact and specific. So to get means and standard deviations of both variables:

There will also be a big menu at the top of your screen

We will hardly ever use any of these except the "File" and "Edit" ones, which are basically the same across all applications (Word, Excel, etc.)

Some good news

We're not really going to learn anything new in this, except that... This whole hypothesis testing business can be applied to means as well as to individual observations. As such, we can answer questions about groups of people in addition to individuals.

So you think you're smart...

You take a random online intelligence test and score a "75". What the heck does "75" mean? Is that good or bad? To know what the 75 means, you need a comparison group or, in other words, a reference group, a context for your 75. Say you look at the site, and it says the mean score is 60 and the standard deviation is 10. So now we can compute a z-score and/or a percentile.z-score = (75 - 60) / 10 = 15 / 10 = 1.5 We can get a rough idea of where we stand by the 68/95/almost-all rule. We should be somewhere between the 84th percentile and the 97th or 98th percentile. To figure this out exactly, we'll use software. PROBABILITY DISTRIBUTIONS APP Notice though, that we kind of sneakily changed the original, less specific question "Am I smart?" To a more specific one that we could answer in terms of probability. "Am I relatively rare (on the high side) in this group of online test takers?"

terminal and jobs

You'll notice that there are two other tabs in this lower left pane. We won't use these. We'll use only the console.

significant

a (poor) description of a finding that exceeds some arbitrary "significance level" - it does not mean important, real, etc.

significance levels

a completely arbitrary cutoff that was used to decide whether or not to reject the null hypothesis back before we had computers -predetermined probabilities

Script

a file made up of a series of commands for R to follow.

percentile

a number between 0 and 100 that indicates about how much of a distribution is below that score.

commands

a precise statement of exactly what you want R to do. -They are called "commands" because you are the boss of R.It has to do exactly what you tell it to do.But R is a very literal employee with no common sense, so you have to tell it exactly what to do. You have to give it the exactly right commands - the commands I'll tell you to use - with no typos or anything.

comparison group

a reference group, a context

For a normal distribution, mean median and mode are

all the same

Skew

an asymmetry in a distribution (it pulls the mean median and mode apart from one another) -can be positive or negative

95% confidence interval

an interval around a data point that would contain the True Value 95% of the time.

separate out my two groups, females and males, and give the two groups separate names.

females <- hwd$washes[hwd$sex == 1] males <- hwd$washes[hwd$sex == 0] This part (inside the square brackets):hwd$sex == 1 says "find all the rows where the variable sex in the data frame hwd is equal to 1"And this part:hwd$washes[...]says "give me these rows (whatever is in the square brackets) of the variable washes in the data frame hwd"

length

gives the number of observations in a variable

Sampling distribution of standard deviations

another common skewed distribution - if we did an experiment and get a sample of data and computed the standard deviation and did it over and over then made a histogram of all those standard devs it would be a positive skew -the reason is when you compute standard dev you sum up squared values. Squared values can't be negative but they can be infinitely positive (square rooting doesn't undo this) -squaring introduces the skew So just like with the incomes the median standard deviation will be left to the mean or average standard deviation -it's the average of mean std. dev that is the best estimate of the tue population std. dev -what that means is that if you collect data and compute the standard deviation it's going to be too small more than half of the time -bc 50/50 split is left of the mean so you get that half of the time

student t distribution history

arose about 100 years ago in Dublin, Ireland at Guinness Brewery -have to test your ingredients and your final product -but if you test too much eat into supply and profits want to test with small samples as small as you can get away with William Gosset, chief brewer of guiness -used small samples and plotted all his means as z scores -his distribution was always slightly different from the normal distribution -big z scores in the tails and not enough small ones near the peak -too big bc small sample - z score follows t distribution -there isnt just one t distribution -there is a family

error bars

bars extending out from a data point on a graph, usually showing +/- one standard error or, more commonly, a 95% confidence interval.

mean

computes the mean

sd

computes the standard deviation

attach

let's you use variable names in a data frame directly

main = "title"

main title of the graph

Mode

most frequently occurring observation or the center of the bin containing the most observations

The hypothesis used to compute p-values is generally the ____ hypothesis

null hypothesis

In statistics, questions are often framed in the form of a

null hypothesis and alternative hypothesis

plot

plots data

change line in script

save, then the source whole script again

xlim = c(#,#)

sets range of x axis

ylim = c(#,#)

sets range of x axis

summary

shows a brief summary of the variables in a data frame

View

shows a data frame like a spreadsheet over in the script pane (actually, we can just click on the data frame), so...

Whenever we collect data and compute a standard deviation chances are TGAT it is too

small

solution of small standard deviation

t distribution! it doesnt change anything conceptually it is just a correction factor for small samples brilliant correction for standard deviation using the std dev from our data, then the normal distribution is the wrong distribution to look up p-values, there's has to be another one that allows for this std dev underestimation business the correct distribution , the one that gives correct p values for small sample sizes is called the student t distribution so when we have small samples and we are estimating the standard deviation from our data we use the t distribution small n underestimate SD overestimate z score underestimate p value t distribution gives you p value if you had larger n it gives us what we would have gotten if we had known the correct standard deviation

If somebody got a score that was so rare, like a score that corresponded to a probability of 0.00001 that somebody else would come along and score better than them, we might conclude that they were not "normal", __________________________

that they belonged to a different group, that they were a "genius" or whatever. For example, maybe random online test takers tend to be part of a group (like 40 yr old trolls living in mom's basement), that you aren't part of. Maybe, alternatively, you're part of a different group.

Data Frame

the containers for R data; they are analogous to an Excel spreadsheet that contains only data and data names (like the ones you have gotten for the labs and projects so far.

by giving you just a mean and standard deviation, they are also telling you implicitly that _____________________________

the distribution of scores is Normal

Null Distribution

the distribution you're using to compute the probabilities. -the distribution corresponding the null hypothesis

Null Hypothesis

the hypothesis that something comes a given distribution, the Null Distribution. -the default hypothesis used to compute p-values.

console

the place you enter commands that you want R to obey immediately. -The console is where you enter R commands. (bottom left)

p-value

the probability that something from the Null Distribution would be greater than some value; the area under the distribution above that value. -the probability of seeing a measurement at least as extreme as the one being considered

negative skew

this distribution also has a skew. it has a negative skew because most mass is to the right and theres a long tail pointing to the left, towards negative scores This distribution is characteristic of scores from a test in which the students on average did really well because the test was easy or the students studied hard or were just bright Do you see why? Scores from a really hard test people that didn't study etc would be flipped and have a positive skew do you see why? This distribution is actually from our first test -made nice negative skewed distribution

positive skew

this distribution has a skew. it has a positive skew because most of the mass is to the left and there's a long tail pointing to the right btw the household income is skewed in this case, the upper tail is so long that whoever made the graph shoved all the really high numbers the super long tail into these two high bars

so many things are normally distributed b/c of CLT but not everything is. for example, one very non normally distributed distribution is household incomes in the US

true

xlab = "title"

x axis title

ylab = "title"

y axis title

A p-value is the probability that

you would get a mean or score at least as big as yours, given the null hypothesis.

critical values

z-scores corresponding to these various probabilities

Review

•Means are fuzzy. •Error Bars: bars extending out from a data point on a graph, usually showing +/- one standard error or, more commonly, a 95% confidence interval. •95% confidence interval: an interval around a data point that would contain the True Value 95% of the time. •Normal Distributions are awesome •Hypothesis testing •The Normal Distribution can tell us how rare something is (when you assume that it's coming from some reference or "null" distribution). •Given an observation (like a test score), you can compute how rare it would be to see a score at least that extreme. •This "rarity" is expressed as a probability, and is called a p-value. •If something seems very rare given one hypothesis (a "null" hypothesis), perhaps a different ("alternative") hypothesis is true instead.


Set pelajaran terkait

Chapter 13: Radioactive decay calculations

View Set

B. Policy Riders, Provisions, Options and Exclusions-Incontestability Clause

View Set

Chapter 3 Diagnosis and Assessment Multiple Choice

View Set

AP Statistics Chapter 4 Study Tool | Example Problems

View Set