Quantitative research in communication (Allen, M.) Chapter 2, pp. 17-34.
The median is bit less well-known. Based on the definition given in the book, what is the median age of the five people above
24. (It is the mind point of a set of values when arranged from least to greatest)
The book is silent about the following case: What is the median of the following six numbers? 17, 23, 19, 30, 25, 25. You see the problem? Try and look up in another source how to find the median in cases like this. What is the answer?
24.5
I'm sure you all know how to calculate an average, but just to be on the safe side...: These are the ages (in years) of 5 participants in an experiment: 19, 23, 24, 27, 32. What is their average age?
25
The third measure of central tendency, the mode, is always taught in courses about quantitative analysis, although it is very rarely used in actual practice. What is the mode of the age data in the previous example and why?
25
The simplest measure of dispersion is the range. What is the range of this data set?
29
It is not difficult to see why the mode can be problematic. Go back to your previous analysis on IQ scores, and try to determine the mode of the data. What is the problem?
There is a lot data to try to find the mod, it can be done, but it would take time. a lot time and energy.
What is the technical term for the deviation illustrated in graphs (b) and (c) above?
standard deviation
Measures of dispersion
Dispersion refers to how far the values are spread apart. Are the values quite similar to one another or very different? In fact, there are quite a few terms that are used synonymously with dispersion, e.g. "spread", "variation", or "variability".
Let's leave these settings alone for the time being. Instead of fiddling around with the bars, let's discover how we can add the professional-looking normal curve on top of the histogram, so that we can create a graph that is similar to the one in the book.
Double-click the graph and the Chart Editor will open. In the toolbar, look for the button that looks like the normal curve. Got it? If you hover your mouse over it, it says "Show Distribution Curve". Click it, and there you go! You can close the Chart Editor and you've got a histogram with the normal curve superimposed.
What is the width of every bar in the histogram SPSS automatically created?
$25
"Role"
(In this course, we will ignore the last column "Role", the default setting "Input" will be suitable for all the analyses we do. This column is only relevant in some highly advanced applications related to artificial intelligence and machine learning.)
What was the value of Pearson's r in this case?
-.14
There is always a small chance for making such errors
, but in good research we always try to minimise the probability that either of these errors are made.
The next step is to calculate how far each of the values lies from the mean. You'll need to subtract the mean from each of the values. The results are called the deviations. Values below the mean have negative deviations, while values above the mean have positive deviations. (If a value is equal to the mean, its deviation is zero.) Calculate the five deviations and list them below.
-11, -5, -4, 2, 18.
The first step is to calculate the mean of the values, which is equal to...
18
What is the standard deviation of the IQ scores?
10.655
Calculate SD for our example.
11.202
In addition to the graph, you get some (now familiar) numerical descriptors as well. What is the mean IQ of these students?
110.22
Now, as the book says, square each of these deviations (i.e. multiply them by themselves). What are the squared deviations? (List the five values.) * Tu respuesta
121, 25, 16, 4, 324
And thus, the variance is equal to...
125.5
Just to make sure you get it right. What is the median of the following seven numbers: 24, 21, 18, 15, 27, 17, 24?
15
What percentage of respondents belong to the "Primary school" category?
15.0%
The book is wrong in claiming that this is the variance itself. It is not. The value you've just obtained is called the sum of squared deviations (or simply: the sum of squares). The variance is "kind of" an average squared deviation. To calculate the average squared deviation, you would divide this sum by the number of values you have added up, i.e. 5. The variance, however, is different, because instead of dividing by the number of values, we divide by the number of values MINUS ONE. In this particular case, you have to divide the sum of squared deviations by...
4
The next step - again, as the book says - is to add up these squared deviations. What is the sum of squared deviations?
486
Now go back to "Data view" and restore the data set by entering the values "3" and "4" in rows 7 and 9, respectively. Run the analysis again. You should get the same output as the first time. Now focus on the last column called "Cumulative percent". What is the value you can see in the row "Secondary school"?
57.9%
Suppose you conduct a small-scale questionnaire survey, and one of your items in the questionnaire asks about the respondent's level of education. The respondent must choose one of the following options: (1) No schooling [N], (2) Primary school [PS], (3) Secondary school [SS], (4) Bachelor's degree [BA], (5) Master's degree [MA], or (6) PhD [PhD]. You gather data from 20 people and obtain the following answers:SS, SS, MA, N, PS, SS, SS, BA, BA, MA, BA, MA, BA, SS, SS, BA, SS, PhD, PS, PS What is the frequency of the category "Secondary school"? (Write a number.)
7
You may not understand some parts of the output but if you look at the "Descriptives" table, out of the 13 rows, how many do you understand?
7
Now, starting in the menu bar, go "Analyse -> Descriptive Statistics -> Descriptives..." The "Descriptives" dialogue box will appear.
As in the Frequencies dialogue box, you have to tell SPSS which variable you want to analyse (in a typical data set we have more than one variable), so drag the variable "Time" to the empty box on the right. Click OK. You'll see a table which gives you some important descriptive statistics. (If the results are the same as what you obtained, give yourself a treat!
So, the mode is rarely used, but the choice between the mean and the median can be very interesting. For example, we often see pieces of research reporting on the mean salary in a country, while the median salary is less often provided. Do you expect the two values to be roughly the same or quite different? Why?
Because, depending on what we are using the data for, some measures of central tendency are more meaningful than others
Drag this image over the largest area with the blue text (the so-called "canvas"). We want IQ to be represented on the horizontal ("X") axis, so drag the variable over the "X-Axis?" label. You can see that the Y-Axis has been automatically relabelled as "Histogram".
Believe it or not, it's that simple. You've set up your histogram and so you can click "OK".
This example illustrates the general principle that the type of data we work with determines what procedure will be appropriate for analysis
Calculating frequencies, percentages, and creating pie charts and bar charts are all inappropriate when we work with continuous data, therefore we need other measures and other types of charts.
Do you see how to create a graph? Well, I hope you do: just look for a button with a label that's synonymous with "graphs". What's it called?
Chart builder
The average height of Hungarian adult men is 176 cm, whereas that of adult women is 164 cm. The standard deviation of height is 6 cm. Calculate the value of Cohen's d, a measure of how different the two genders are in this respect.
Cohen's d = (164 - 176) ⁄ 6 = 2. The value is exactly two, which indicates they differ by 2 standard deviations, and so on.
The authors mention yet another important measure, which can indicate how sure we can be that a statistic is an accurate estimate. When for example, I conclude that I can be 95% sure that the true average height of all men is 176 cm ± 2 cm, I provide a c_______ i________. (Write the missing term.)
Confidence interval.
The frequency analyses you've just done work well with all categorical (i.e. nominal or ordinal) variables, but - as the book points out, they are not very useful when you wish to analyse _________ data. What the missing word?
Continuous data.
Notice the line of reasoning here: "If the null hypothesis were true, it would be very unlikely that we obtain such results. But we HAVE obtained such results. Therefore, we can conclude that the null hypothesis is (very probably) f_____." Again, what's the missing word?
False
The two most common types of graph that you may use to show the frequencies/percentages of are (1) the pie chart and (2) the bar chart. Try and figure out on your own, how to create such charts in SPSS.
First create a pie chart that shows the distribution of the level of education, then create two different bar charts: one that shows the frequencies (i.e. counts) and another (very similar) one in which you can see the percentages of respondents falling in each category.
Try and discover for yourself how you can obtain the numerical indicators for these two types of deviations from the normal distribution. (Hint: there are at least three different ways to obtain these statistics, using the dialogue boxes you already know.) What are the values you find for the social media data? How can you interpret them?
For Skewness: .898 (statistic) .172 (Std. Error) For kurtosis: .732 (statistic) .342 (Std Error) The Skewness statistic is less than 2 or less than twice the standard error, so the Skewness is not a problem here. In Kurtosis, the statistic value should be less than 2 or less than twice the standard error, so we can see that the distribution appears normal
So, what's this normal distribution, anyway? I wonder why the book does not explain... As you can see in your chart, it is a beautiful bell-shaped distribution, showing that most observations cluster around the mean, and the further you go from the mean, the fewer observations you find. The funny thing is that most continuous variables in the world that we live in are (at least approximately) normally distributed. (There is also a mathematical reason why this should be so, called the Central Limit Theorem.)
For instance, adult people's heights (within each gender) are normally distributed: most people are of an approximately average height, short and tall people are less frequent, and extremely tall and extremely short people are very rare. The same applies to the weight of adults, exam scores, language exam results, IQ test scores, the size of chicken's eggs, the high temperature on a given day of the year, and so on. The normal distribution is so common that in most quantitative research, we simply assume that the continuous variables we work with are all normally distributed.
In the social sciences, by convention, when can we conclude that we have evidence for a difference or a relationship? (I.e. when do we say that the difference or the relationship is significant?)
For positive relationships, when increases in one variable correspond to increases in another; for negative relationships, when a rise in one variable corresponds to a decrease in the other.
So, we have set up our variable "Level of education" and now we are ready to enter the actual data.
For this, we have to go back to "Data View". Notice an important difference between the layouts of the two views. In "Variable View", each row corresponds to a variable, and the columns represent the different settings. Therefore, "Level of education" corresponds to Row 1. When you switch to "Data view", the variables will correspond to the columns rather than the rows. You can see that "EduLevel" is the header of Column 1.
If you want to find out how many respondents belong to each category, which column of the table will give you answer? (What is the header of the appropriate column?)
Frequency.
Tick these, click Continue and then OK. You should see that the table has been expanded to include these statistics as well. But still, there is something missing. Let me show you how to obtain all the details you would ever want.
Go "Analyse -> Descriptive Statistics -> Explore..." This dialogue box is more complex than the other two you've seen before (and - of course - you can achieve a lot more with it than what we're doing now). For the time being, all you need to do is to drag your variable "Time" over to the box labelled "Dependent List" and click OK.
Now let us see how you can use SPSS to calculate all these values, so that you will never have to do this again (expect perhaps at the exam at the end of this course...
Go "File -> New -> Data" to create a new, empty data set. Switch to "Variable View", and in the first row, create a new variable with the Name "Time". Set the Decimals to "0" and the Measure to "Scale". (In the rest of the columns the default settings are fine.) Go back to "Data View" and in the first column, enter the five values in our example: 7, 13, 14, 20, 36.
There is a button in the toolbar at the top with the number "1" and the letter "A". (If you hover your mouse over it, it says "Value Labels".) Click it once. What do you notice?
I noticed that each number change to its define category. This is actually a toggle button, so you may click it again to go back to the original view. You can choose whichever way you prefer to see your data. This is just a view setting, so it will have no effect on the analyses you do.
The file called "Gaming" contains data about how much time adolescent boys and girls spend playing video games (expressed in hours per week). Create a histogram on the variable "VidGame" and comment on what you see. Can you think of a reason why the distribution looks like this?
I think it looks like this because the number of hours per week spent playing video games is twice as much in male as it is in female.
Now look at the two types of chart you've created (pie chart vs. bar chart). Which type do you prefer? Which chart do you think communicates/visualises the distribution of education levels more efficiently. Think of your audience. In your own words, write about some advantages/disadvantages of one type of chart over the other.
I think the pie chart works better when we have less variables, it is a little bit overwhelming to visualize and understand when a lot of data is used, form my perspective a pie chart makes it very difficult for someone else perspective to estimate the magnitude of angles. On the other hand, the bar chart can be used for a broader range of data types, not just for breaking down a whole into components.
Remember, that we decided to represent the level of education numerically, and therefore we will have to enter our data in a numerical form (rather than typing strings such as "N", "PS", "SS" etc.).
In Data View, enter the following numbers, one below the other, in the first column in rows 1-20: 3, 3, 5, 1, 2, 3, 3, 4, 4, 5, 4, 5, 4, 3, 3, 4, 3, 6, 2, 2. (Overwrite the "1" in row 1 that you have previously entered for testing the effect of the "Decimals" setting.)
Well, what result would convince you? If the lady gets it exactly right and identifies the 3 cups prepared the right way, nothing could be more convincing than that! But still, even in that case, she might have got it right by accident!
In fact, because the cups can be arranged in 20 different ways, the probability that she gets it right by guessing is 1 in 20, which is 5%. So even if the lady is just guessing, there is a 5% chance that we will be convinced that she can tell the difference. If she's guessing, there is a 5% chance that in the end we will draw the wrong conclusion and make an error.
Graph (a) as well as the dotted line in graphs (b) and (c) represent the normal distribution. Describe how the distributions in graphs (b) and (c) deviate from normal.
In graph (a) we have a graph that shows a normal distribution. In graph (b) we have a graph that its distributions rarely fit along a normal curve. There is a sort of a "dip" in values right around where the mean should be. In graph (c) we can see that there are few cases in which the high ends of the set of values stand apart from the others
This is where you can set up all your variables. In this first example, we have only one variable ("Level of education"), so let us set it up. Because we have only one variable, we will enter settings in row 1.
In the "Name" column, you have to provide the name of the variable. Try and type "Level of education" under Name in row 1.
There are also cases when there is no mode. When could that happen?
It can happen when no value appears more than any other.
Now look at the values in this column carefully and try to figure out on your own what this number may mean. What is the practical meaning of your previous answer? This is the proportion of those respondents who... (Finish the sentence.)
It is the portion that will complete the sum of the percentage in the given data.
Why on earth do you get the same information twice? To find out, go back to data view, and delete the "3" in row 7, and "4" in row 9. You should take care how you do this! If you just select a cell and press delete, the whole row (i.e. respondent) will be deleted from the data set, which is not what we want.
Instead, when the given cell is selected (i.e. the background is orange), press F2 (as in Excel) to edit it. Then delete the number and press Enter. Now you should see a dot in the cell which indicates the value is missing. Now run the frequency analysis again. If you look at the output, you can see that the values in the two columns identified above are now different. Look at the values carefully.
At this point, I'd like to show you that you've really learnt something. Open the "IQ" file again. Go "Analyse -> Descriptive Statistics -> Explore...", drag the IQ variable to the "Dependent List", and click "OK". Now look at the "Descriptives" table and count how many of the 13 rows you understand now. List the ones we still haven't covered.
Interquartile Range, 95% Confidence Interval for Mean, 5% Trimmed Mean
The remaining term is an "umbrella" term that SPSS uses for the other two data types you already know. What are those?
Interval and ratio variables. In other words, SPSS does not make a distinction between these latter two data types and uses a single option to cover those two cases.
The book mentions two specific ways in which a distribution may deviate from normal. Open the file "Social media use". This file contains data about 200 teenagers, and the numbers represent the number of hours spent using social media sites/apps over a period of 7 days. Create a histogram for the data and comment on what you see
It looks very nice, it shows that the distribution fits along a normal curve, there are no dips around the man or humps.
What does the lower-case (usually italicised) letter p represent in significance tests?
It represents the alpha level, or risk of type I error accepted. Typically, this is set to p <.05.
According to Cohen's definition, would this be regarded as a large, medium, or small effect of gender?
Large.
Measures of central tendency
Measure of central tendency are important numerical descriptors of a continuous variables. The book does not provide a definition, but what is common in the three measures discussed is that each of them provides a numerical value around which the data tend to centre or cluster.
Of all the measures of central tendency and measures of dispersion that we've looked at, which ones do you see here (i.e. in the output generated by SPSS)?
Minimum, Maximum, Mean and standard deviation.
Which of these terms do you recognise from last week's reading?
Nominal and Ordinal
Which are the ones that you (painfully) miss?
None.
Click the "Values" cell in row 1, and then the three dots. This will open the "Value Labels" window. In the text box "Value", type the number "1", and in the text box "Label", type "No schooling". Click the "Add" button (or just press Enter). You can see in the large text box that SPSS now knows that the value of 1 corresponds to "No schooling".
Now type "2" and "Primary school" in the two text boxes at the top and click "Add" again. Continue this way until all the 6 values are listed with their corresponding labels. If you make a mistake, you can click any of the lines in the large text box and use the "Change" or "Remove" buttons to edit the labelling scheme. Once you've checked that everything's correct, click OK.
In the next column ("Decimal"), you can see that the default value is 2. What does this mean? Go back to "Data view" and try to enter the value "1" in the first row (column 1). What do you see in the cell after pressing Enter?
One of the decimals dessapeared and we have values with only one cero after the dot.
Now think of the nature of our variable ("Level of education"). Which option is the appropriate choice in the "Measure" column? (If it is not obvious, go back to last week's reading for clarification.)
Ordinal variable.
You may notice that two columns contain identical information. What are the headers of those two columns?
Percent and Valid Percent
Because in that study the correlation between handshake quality and the interviewer's assessment turned out to be significant, the authors were able to r_________ the null hypothesis. (What's the missing verb?
Reject the null hypothesis
What are three options that you can choose from?
Scale, Ordinal and Nominal
In the gallery you can see that SPSS can create four different types of histogram: which of the previews looks most similar to the examples in the book? (Hover your mouse over it, and its name will appear.)
Simple histogram
It's more like an aesthetic issue, but because our data will consist of whole numbers (integers), we will not need those extra zeros: they are absolutely redundant.
So, go back to "Variable view" and set the number of decimals to 0. (You may go back to "Data view" to check that your change has taken effect.)
So, we have set up our variable and entered the data. Now we can do a simple analysis.
So, we have set up our variable and entered the data. Now we can do a simple analysis.
"Missing"
Sometimes, especially when you work with traditional pen-and-paper questionnaires, respondents skip questions and therefore you have missing data. Some researchers like to use specific codes (such as the number "99" or "999" or "-1") to represent the fact that there is no data available. This is where you can tell SPSS which numerical value (or values) should be ignored in the analyses because they represent missing data. Another (perhaps easier) way to deal with missing data is to simply leave the corresponding cells empty (in Data view). All empty cells are treated as "System missing" by SPSS.
I hope you can see that with ordinal variables (such as this one), this column can often be very informative.
Tables can summarise your results very efficiently, but it is a fact of life that most people do not get very excited about columns of numbers. You can often communicate your results far more efficiently by creating a visual representation, i.e. a graph. Let's see how you can do that in SPSS. Go back to the Frequencies dialogue box.
Start from the menu bar at the top and go: "Analyse -> Descriptive Statistics -> Frequencies..."
The Frequencies dialogue box appears and you can see our variable "Level of education" on the left. Drag it to the empty box (labelled "Variables") on the right. (Alternatively, you may select it and use the arrow button in the middle.) This is the way you tell SPSS "I want an analysis on this variable". Click OK. The results of the analysis are given in a table in a new output window. Let's see if you can make sense of it!
In the study we looked at in the first class (titled "Exploring the Handshake in Employment Interviews"), you could see some examples for Pearson's r, which is a measure of how strongly two variables are correlated. Open the article (you can find the pdf-file on Moodle) and look at the table at the bottom of p. 1142. Of all the components of a good handshake, which one had the strongest effect on the interviewer's evaluation of the candidate?
The applicant's gender was one of the strongest effect.
It is also important to understand that many different histograms could be constructed for the same set of values.
The bars correspond to intervals (of the same width), but the width of the interval and where the lowest interval starts is up to you. SPSS analyses your data and automatically determines the width of the bars and their location in order to create a good-looking histogram (with neither too few, nor too many bars).
Explain in your own words what is the difference between the two columns.
The column "percent" show us the actual percentages of the total sample and the "valid percent" is the percent when missing data are excluded from the calculations.
Try and explain in your own words what the height of each bar represents in histogram
The heigh represents the Frequency.
It is also important to note that although the APA Publication Manual, which the authors mention, was originally created by and for psychologists, it has become a de facto standard for practically all fields of social science, including communication studies.
The manual also provides very detailed guidelines on how to cite sources, and it is the APA format that you will have to use later in your thesis work as well. So, it is a good idea get used to the APA standard and begin to use it whenever you are required to submit papers.
The best-known measure of central tendency is the average. The book introduces a more technical term for the average. What is it called in research circles?
The mean
Write an example for a positive correlation that is likely to be true. (Hint: Think of something along the lines of "The more..., the more...".)
The more you eat the more weight you gain.
Now write an example for a negative correlation that you expect to be true. (Hint: Think of something along the lines of "The more..., the less/fewer...")
The more you study, the less subject your fail.
Think of the study that tested the relationship between handshake quality and the interviewer's assessment of the candidate. What was the null hypothesis when testing this relationship?
The null hypothesis was neuroticism, agreeableness, conscientiousness, and openness to experience. None of them were related to either the handshake or the interviewer evaluations.
Let's suppose a researcher conducts an observational study on the Budapest Metro line.
The researcher observes young people travelling on the metro line and records the time (in seconds) that passes between the moment that the person enters the carriage and the moment he/she first looks at his/her smart phone. The observation of five young people yielded the following times: 7, 13, 14, 20, 36.
What can we do to get back our original unit of measurement? Well, we can take the square root of the variance. And that will give us the most widely used, and most useful measure of dispersion, abbreviated as "SD", which is short for... (Write the term.)
The standard deviation is simply the square root of the variance.
I think it's important for you to see the problem for yourself. Download and open the SPSS file called "IQ" from the Moodle site of the course. The file contains the scores the 50 university students achieved on a standardised IQ test. Do the frequency analysis on these data (exactly as above), creating a frequency table, a pie chart, and a bar chart. Do you see the problem? What is it?
The type of data is too difficult to analyze, there are a lot numbers in the frequency table, in the pie chart there are a lot categories to differentiate, I would say the graphic with the bar chart is the only one that looks appropriate. From this we can say that depending on the type of data we have, we should find an appropriate graph to illustrate it.
According to the rules of thumb given by the authors, would you regard this as a large (strong), medium, or small (weak) correlation?
The value is less than 2, so we can see that the correlation is strong.
In this experiment, our null hypothesis is that the lady CANNOT tell the difference between the two types of tea.
Therefore this 5% is the probability that we reject the null hypothesis when in fact it is true. In other words, this 5% is the chance that we will make a Type I error. And it is this 5% level that is denoted as alpha. (And it follows from this that there is a mistake in Figure 2.2. We are more likely to make a Type I error if alpha is inappropriately HIGH, rather than inappropriately LOW.)
The authors make references to "alpha" without actually defining what it is. Let me explain the concept through Ronald Fisher's famous "Lady Tasting Tea" experiment. Fisher (a big name is the history of statistics) had a colleague, Dr. Muriel Bristol, who said it was very important to put the milk in the cup first and then add the tea, because if you did it the other way round, the tea would taste bad. Fisher wondered whether Dr. Bristol could really tell which ingredient was added first by tasting the contents of a cup.
Therefore, he decided to put her to the test. He took 6 cups, and in 3 of them he added the milk after the tea, and in the other three cups he added the tea after the milk. He presented the 6 cups to Dr. Bristol and told her that 3 of them were prepared the way she preferred, but the other three were prepared the "wrong" way. He asked her to taste each cup of tea and tell him which three were prepared the "right" way.
Writing results
These symbols may look slightly frightening at first because they are unfamiliar to you, but once you've got used to them, they actually make the text easier to read.
(I also feel obliged to correct a mistake made by the authors.
They claim that such statistics "fall along a normal distribution". This is very definitely wrong: all the examples listed by the authors - such as t, F, or r - are statistics following their own distributions which are quite different form the normal distribution.
"Measure"
This is a VERY important column and very directly related to what you read about the types of data last week. The default setting is "Unknown", but you must not leave the "Measure" undefined.
"Values"
This is where you can "teach" SPSS what your numerical codes (1-6) actually mean. In other words, you can associate each numerical value with a text label. This is very useful because, if you set up these values, SPSS will produce tables and graphs that contain the labels you provided (rather than the raw numbers), so the output will be much easier to interpret.
Looking ahead to test statistics and effect sizes
This section may be a bit difficult to read because of its "looking ahead" nature. Quite a few new terms are used: of course, at this point you are not expected to know what terms like "ANOVA", "Pearson's r", or "t-test" mean. Don't worry, it will all fall into place in time.
"Columns"
This specifies the width of the given column in "Data view". The setting only affects how the column is shown and has no effect on the analyses you conduct. You can leave it at its default setting.
Instead of frequencies, quantitative researchers often report percentages. How do you convert a frequency into a percentage? What percentage of our respondents have completed secondary school?
To do this, divide the frequency by the total number of results and multiply by 100. In this case, the frequency is 7 and the total number of results is 20. The percentage would then be 35%. (7 ÷ 20) x 100= 35
This logic is at the heart of all hypothesis testing. Now suppose that in fact the null hypothesis is true and there is no relationship between handshakes and evaluations. Even in that case it is possible that the authors obtained these results just by accident. It is unlikely, but still possible. If that happened, the authors made an error, which is called the __________. What's the missing term?
Type I error occurs when you reject the null hypothesis when, in fact, the null hypothesis is actually correct.
Now let's take a look at the opposite case. Another finding of this study was that the authors found no evidence that the interviewer's assessment is influenced by the physical attractiveness of the candidate or the clothes that the candidate wears. In fact, however, in may be the case that these factors play a role, but the authors failed to find sufficient evidence for it. In such cases, when the researcher fails to reject a null hypothesis when in fact it is false, we say that the researcher has committed another type of error, which is called the ___________. What's the missing term?
Type II
"Label": This is the column that lets you overcome the limitations concerning the name of the variable. If you leave this column empty, SPSS will use the variable name "EduLevel" in all its output (i.e. tables and graphs).
Typically, we want to see something more human readable in the output, so this is where we can provide a nice, descriptive name of what the variable represents. There are no restrictions regarding the range of characters you can use here; you can use the space and all Unicode characters (even Chinese characters, if you wish). So, let's enter "Level of education" in this cell.
As you might suspect, the range is a pretty crude measure. It's based on the minimum and maximum: you just take these values and throw the rest away. For this reason, more sophisticated measures of dispersion have been developed, which take into account every single value in the data set.
Unfortunately, there is an error in the description of variance in the book: one crucial step is missing. Just to make sure you get everything right, let's do calculations for the above data set together.
Ooops... You get an error message, which says... (What is the text of the message?)
Variable name contains an illegal character
Let's see how you can get SPSS display these values, in case you need them. Go back to the Descriptives dialogue box, and click "Options". Which two checkboxes are you happy to see?
Variance and Range.
In the second column ("Type"), you will have to specify the type of data this variable will hold. By default it is set to "Numeric", indicating that the variable will be expressed in terms of numbers. Move to this cell in row 1, and click the three dots (ellipsis) to see what other options there are.
Well, there are many choices, but for the purposes of this course, we will deal with only two of them: Numeric and String.
In this view, what do the rows correspond to? Each row will correspond to one ... (Finish the sentence.)
When you view data in SPSS, each row in the Data View represents a case.
Now let us see how you would obtain these results using SPSS. Once you have installed the software and set up the licence, connect to the VPN server of the university and then start the program. Close the welcome window. Now you can see an empty grid that may remind you of Excel. You will, however, soon learn that the two programs operate very differently.
You will, however, soon learn that the two programs operate very differently. If you look at the bottom left of the window, you can see that in SPSS you can look at your data set in two different views. By default SPSS starts in "Data View": this is what you can currently see, as indicated by the orange background of the tab. The other view is called "Variable View". When you wish to enter a new data set, that is where you typically start, so click this tab to switch to variable view.
The problematic character in the name you have just entered is in fact the space. In SPSS, variable names cannot contain spaces. Funny characters such as $, #, @ etc.
can also cause problems, so it's safest to remain with the 26 letters the English alphabet (lower or upper case) and the digits. So, go back to row 1 and enter an abbreviation like "EduLevel" as the variable name.
Nevertheless, it is important that you familiarise yourself with a few key terms. What is the term for the more advanced type of statistics, that can help us decide whether we have evidence for relationships between variables or difference between groups?
correlations, t tests, and ANOVAs.
Note
however, that this assumption concerns the whole population. In our example, we assume that if we had measured the IQ of all (more than four billion) adults on earth, the histogram would show a beautiful, regular bell-shaped curve. Of course, if we examine only a small sample of 50 adults, our histogram will be a little jagged and irregular. What we need to watch out for are cases when the histogram looks really weird (such as when you see too bumps at the extremes and a dip in the middle). As long as the distribution is shaped like a hill and looks roughly symmetric (as in our case), you can safely go on and assume that the population distribution is normal. Although almost all of the procedures you are going to learn in this course are based on the assumption of normality, research indicates that these procedures are pretty robust: even if this assumption does not hold perfectly, the results of the analyses still remain reliable.
One important advantage of using this approach
i.e. defining each category by a number-label pair, is that if you later decide that you do not like the label that you gave to one of the values, you can always come back here and modify it. You will only have to make the change in one place and from then on, all the output generated by SPSS will use the new label.
The next column ("Width") specifies the maximum length of a "String" variable
i.e. how many characters it can consist of. Because our variable is "Numerical", this is irrelevant, so you can leave it at its default setting of 8.
"String"
is IT jargon for text. In other words, if we pick "String", we will be able to enter our data as it appeared above: "SS, SS, MA, N, PS ...". This may seem like the best choice at first, but it does have an important drawback: SPSS will not know the logical ordering of our categories, which goes from "No schooling" (lowest) to "PhD" (highest). Instead, in all its output (tables, graphs etc.), it will present these categories in alphabetical order by default (i.e. BA, MA, N, PhD, PS, SS), which - in our case - does not make much sense. Therefore, the cool and professional way of working with SPSS is to "recode" these labels and represent the level of education in a numerical form: 1 = No schooling, 2 = Primary school, 3 = Secondary school, 4 = BA, 5 = MA, 6 = PhD. Therefore, in this example we will remain with the "Numerical" type. (Close the "Variable Type" window.)
"Align"
is another visual setting. Because in tables numbers are traditionally aligned to the right side of the cell, the default setting "Right" is fine
What do you think could be the advantage of using percentages rather than raw frequencies?
it incorporates the total number of scores into its calculation. A frequency may not tell us all we want to know about the data. A frequency of 50 in one sample of scores may indicate a large percentage (for example, when sample size is 60), while in another sample a frequency of 50 may indicate a minute percentage (for example, when sample size is 1,000,000). Thus, before interpreting the magnitude of the frequency, we should be aware of the total sample size.
The variance is a pretty useful measure of dispersion, but it has a major drawback:
it is quite difficult to interpret. The reason is that it is based on squared values. You may recall from secondary school physics/science classes that when you square a value, the unit of measurement will also be squared. So, if the original data are expressed in seconds, the variance will be expressed in "square-seconds", which can give us a bit of a headache. (I can understand square-meters, but square-seconds beat me...)
What is this type of deviation from normality called?
nonnormality
These statistics will help us decide whether we have enough evidence for our research hypothesis. In other words, they just give us a big YES or NO, and no further information. The authors mention another type of statistic that are used to tell us how big the difference is between certain groups or how strongly certain variables are related. Statistics belonging to this latter category are all measures of... (Write the term. Hint: it consists of two words.)
test statistics
Note:
the graphs created by SPSS are not static images, but editable objects. If you double-click a chart you have created, you see lots of settings - colours, font size etc. - that you can use to fine-tune the chart to your liking. If you have the time, you can play around with these settings: these features are pretty easy to discover on your own.
the histogram and the normal distribution.
type of chart, which is specifically designed to graphically represent the distribution of continuous variables