Data Quiz 1
Population
Consists of all of the organisms, objects, or events of a specified type that researchers want to describe or make inferences about Population membership must be clearly defined/unambiguous
Bimodal data
Has 2 modes Not smart to use the mean or median to represent a distribution that is bimodal: - If you just gave mean would not be giving a complete picture (could think distribution is just the normal distribution)
Class interval
- A specified range of observation values - Established when there are too many possible data values and so a frequency distribution does not summarize sufficiently - Divide the range of measurement into a manageable number of units - The resulting frequency distribution lists the number of observations in each class interval - The conventional practice is to adopt a class interval size that will divide the range of observations into about 10-15 intervals (in reality there are more specific guidelines listed in a different cue card) - The class interval boundaries must be specified with sufficient precision that every observation can be placed unambiguously into one interval (expressed with equal or finer measurement precision than the raw data) - Form a grouped frequency distribution
Ordinal measurement
- Data is ranked (order matters!) - Not quantitative-> does not lend itself readily to the calculation of parameters or to arithmetical manipulation - Consists of rank ordering (the numbers assigned reflect ordinal position (rank) only and in no way measure or reflect the amount or magnitude of the variable) - There can be no presumption of equal variable differences between pairs of adjacent ranks Ex place in a race
Nominal measurement
- Data placed in categories - Not quantitative-> does not lend itself readily to the calculation of parameters or to arithmetical manipulation - Classification (individuals or objects are classified as belonging to one or another of a set of categories)- requires exhaustive set of categories (the range of the categories is sufficient to encompass all individuals) in which the categories are mutually exclusive (any individual can belong to only one category) - Numbers obtained are the frequencies of occurrence of the nominal classes or types - Most often classification by qualitative distinctions, continuous variables are sometimes reduced to nominal distinctions when we lack confidence that we can reliably measure them in quantitative terms - Any individual may be simultaneously subjected to nominal classification on more than one variable dimension - Useful when interested in determining whether different types of individuals differ on some behavioural measure Ex Nationality
2 basic approaches when analyzing data
- Descriptive statistics - Inferential statistics
Interval measurement
- Equal units of measurement assigned to attribute (equal intervals between possible scores) - Quantitative-> permit the calculation of parameter values and are suitable for use in many computational procedures - Does not have a true 0 point (0 value is defined arbitrarily and does not reflect an absence of the variable)-> this means that numerical values in this measurement do not bear a ratio relationship to each other (40 degrees is not double as hot at 20 degrees) *AKA not proportional - Can yield negative scores (-10 degrees) - The numerical values do bear a consistent interval relationship to each other Ex temperature in celcius
Experimental control- reality
- Impossible to control all variables that could affect the DV - Researchers control the variables they can - Other influence that aren't controlled are assumed to be randomized (we assume the effects are "washed out" if they are "spread out" over the groups)
4 types/levels of measurement
- Nominal - Ordinal - Interval - Ratio More information conveyed as one moves from top to bottom of the list Data may be readily transformed from a more complex to a less complex form (this involves operationally defining the variable in terms of the desired measurement) but not in the opposite direction However since info is lost when data are transformed into simpler forms, this is usually not done unless there is some question as to the validity of the original form of measurement
Ratio measurement
- Quantitative-> permit the calculation of parameter values and are suitable for use in many computational procedures - (same as interval but) has a true 0 point (the number 0 reflects an absence of the variable in question)- essential feature of this type of measurement - The numbers bear both a consistent interval relationship and ratio relationship to each other- direct comparisons can be made - Negative scores are impossible Ex weight
Statisitics
- The science of the collection, organization, analysis, and interpretation of data - A methodology for extracting information and meaning from numbers, and for making decisions about them. - The intermediary between what is observed and what can be concluded about how natural events operate - The body of rules and procedures for evaluating and making decisions about the outcome of scientific observation - Using data to make a judgement (or decision) about a situation
Statistical methods are the researcher's tools
- To assist in describing data - In making inferences or generalization from experimental data (sample) to larger groups (population) - In studying causal relationships
Experimental control- ideal
- To imply causation - The experimenter eliminates the influence of all variable that could affect the DV except the one(s) directly manipulated - All conditions are kept the same for all participants except the effect of the IV
Quantitative- measurement
A feature of some types of measurement meaning it yields numbers that reflect the amount of the variable AKA Numerical values are assigned in such a way that the size of the number reflects the amount of the variable being measured
2 categories of computer software that performs statistical calculations rapidly and accurately
1. Dedicated statistical applications (ex SAS) 2. Spreadsheets (ex Excel)
How to calculate the median
1. Determine from raw scores Order matters- must be put in order 1st Location of the median (to find its position in the ordered scores) = (n + 1) / 2 Once you have the location go and find the value in that position and thats the median If the median is halfway between 2 scores (for when n = an even number) you add the 2 scores and divide by 2 for the value of the median 2. Can also calculate using the excel MEDIAN function
Why learn statistics
2 main reasons: - In order to understand a scientific discipline it is necessary to know the procedures and rules for evaluating scientific evidence - Many groups and individuals attempt to influence your behaviour with statistical arguments. Knowledge of statistics enables you to evaluate arguments in a responsible manner
Methods to calculate the mean
2 methods to calculate: 1. By hand from raw scores- the sum of the scores divided by the number of scores Mean = Σx / n 2. By computer- statistical programs like Excel (using AVERAGE function)
Bimodal
2 modes Technically when there are 2 most frequently occurring score But usually reserved for cases wherein the modal scores are separated by some less frequently occurring values
Symmetrical
A distribution is termed this when the data frequencies decrease at equal rates above and below a central point Visually: - The distribution is bisected- one half is the mirror image of the other
Skewed
A non-symmetrical distribution Shifted to 1 side due to extreme value(s)- could indicate something is wrong with your data Visually: - A bunching of the observations at one or the other end of the measurement range Can be positively or negatively skewed
Graph
A pictorial representation of a frequency distribution or data table Composed of (x) abscissa and (y) ordinate Common types: - Bar graphs - Frequency polygons - (Frequency) Histograms
Analysis ToolPak
An Excel add-in program that provides data analysis tools for financial, statistical, and engineering data analysis Found on the Data tab (in the Analysis group)
Inferential statistics
Involves generalizing the findings beyond the immediate observations The body of rules and procedures by which general statements/conclusions are made about populations (people or events) based on/derived from the observation of small samples Causal relationships can be determined (this is one of their uses/advantages- can make causal claims based on these) The effects of systematically changing one or more variables are observed under controlled conditions
Qualitative data
Analyzed by nonparametric techniques
Variable
Any observable/measurable property of organisms, objects, or events, such that individuals may differ in the amount, or kind, of this property Is able to be changed and is measurable Something that will change across the participants in your study (not one of your inclusion characteristics) Types: - Quantitative - Qualitative
Datum
Any particular observation
Sample
Any subgroup or subset of a population that the researcher believes represents the population A group of a specific size (n=) is selected and measured Best samples are those which are selected randomly
Percentiles, deciles, and quartiles
Can be useful ways to describe the relative position of a particular score within its distribution
Categorical variables
Can have: - Binary = two categories (ex yes or no question) - More than 2 categories (ex hair colour)
Qualitative variable
Categorical A distinction of kind, not amount Ex male or female
Measure of central tendency
Centre of the data set A single summary number which indicates where many of the scores lie Includes: - Mean- Arithmetic Average - Median- mdn - Mode
Parameter
Characteristic of a population A numerical term that summarizes or describes a population A purely descriptive term Ex average height calculated from the height of every member of the population
Statistic
Characteristic of a sample A numerical term that summarizes or describes a sample Obtained from samples and are used to estimate population parameters Both a descriptive term (describes sample characteristic), and an estimate of the corresponding population characteristic Ex average height calculated from the heights of a sample of members of a population
Descriptive statistics
Consists of the techniques for organizing, describing, summarizing, and extracting information from numerical data Describe characteristics of the raw data No cause-and-effect relationships found from these types of statistics Part of all research- can be an end in itself or just the first step to inferring the characteristics of a larger population Foundation of exploratory studies (research around things we don't know anything about- new area of investigation) Ex - Polls and survey research - Correlational studies Types of descriptive statistics: - Frequency distributions and graphs - Percentiles, deciles, and quartiles - Averages - Variability
Frequency polygon
Constructed so that the frequency of each data value is plotted as a point and the points are then connected by straight lines - This creates an impression of continuity between the data points and thus frequency polygons are an appropriate way to depict interval/ratio variables - More than 1 frequency polygon may be plotted on the same set of axes- increasing the amount of info presented and providing a direct comparison between the 2 (use different lines to differentiate) - Tends to emphasize continuity between score frequencies
Quantitative data
Continuous of discrete are analyzed by procedures known as parametric statistics But if there are many tied scores analysis by nonparametric technique may be preferable
Dependent variable
DV - The variable of primary interest (it is measured) - A variable whose changes we wish to study- a response variable - The variable designed to measure the effect of the variation of the independent variable
Structure of data
Data is composed of: Observations- (individuals or cases, ex student 1) Variables- (observations' attributes, ex weight)
Data vs information vs knowledge
Data is not information unless it is interpreted (not useful until it is interpreted) The information must be further analyzed, discussed, have inferences made to become knowledge
Cooked data
Data that an experimenter has made up or otherwise tampered with Refers to the deliberate falsification of observations
Outliers
Data values that are far away from other data values Strongly affect your results
2 broad purposes we use statistical methods for
Description and inference
Qualitative measurement
Differentiates individuals in terms of their possession of specified qualities
Quartiles
Divide the distribution into fourths The 1st quartile has 25% of the distribution below it and corresponds to the 25th percentile Specific ones can be calculated the same way percentiles can just convert the quartile value of interest into the appropriate percentile value
How to find an exact percentile score
LL= lower exact limit of the score interval that contains the desired percentile score P= percentile in decimal form N= total number of data values Σfb= cumulative frequency of scores (number of raw scores) below the interval that contains the desired percentile score fw= frequency of scores within the interval that contains the desired percentile score (i)= size of the interval (number of measurement units) that contains the desired percentile score (P)(N)= number of raw scores that will be below the percentile rank
Which measure of central tendency to use?
For nominal data (can't calculate mean or median)- mode For ordinal data- median Interval or ratio data- mean but use median instead of mean if highly skewed distribution (median is not as affected by skew)- if mean and median differ by 1 SD or more then use the median
Percentile rank
Given scores have them To determine what the percentile rank of a score is ask yourself?- what percentage of scores are equal to or fall below it
Data
Group of any recorded observations Will always take numerical form (in this course)
Abscissa
Horizontal axis Marked in units representing the variable of interest
Spread
How much the data varies Kurtosis
Independent variable
IV - A variable we believe affects the measurements obtained on the dependent variable (it is manipulated) - A variable whose effect on the dependent variable we wish to study - The variable that the research changes within a defined range, to study the effect on the dependent variable
Determining the mode of data organized into a frequency distribution using class intervals
Identify the class interval that contains the greatest frequency of scores The mode of the distribution is the midpoint of that class interval
How to tell if there is skew
If the mean is in the middle (same as median) there is no skew
n
Indicates the number of observations (scores) that we have gathered
Random sample
Means the sample is unbiased To achieve this: - Every element of the population must be equally likely to be selected to the sample group - Selection of one element must not affect the possibility of other elements being selected
Precision of measurements
No matter how finely you measure or how precise your ruler is, every measurement value actually stands for a range of values whose exact limits are one-half unit higher and lower Ex for 10.00 it represents 9.995-10.005
Steps in constructing a frequency distribution
Note can be done for a data set of whole numbers or of decimals 1. Count the number of scores in your data set (n=?) 2. Identify highest and lowest score in the data set (aka range, to do so organize the data sets in ascending order) 3. Identify the smallest possible unit of measurement (what is the smallest amount a score can increase from 1 participant to another) 4. Decide on appropriate number of class intervals 5. Decide on the score range of each class interval (i, in other words its width/size) 6. Round off result of step 5 up or down if necessary (pick nice number- 1-5, 20) 7. List class intervals of scores in order (usually the largest interval is put at the top, usually first score is a multiple of i) 8. Identify the real upper limits (ul) and lower limits (ll) of class intervals (boundaries, needed when making graphs to fill in the gaps) 9. Determine the frequencies at each class interval
Quantitative variable
Numerical One in which the number derived from the measurement reflects the amount of the property in question Numerical data that you can add, subtract, multiple, and divide Ex length Types: - Continuous - Discrete
Positively skewed
Observations are bunched at the lower score values
Raw data
Observations that are recorded and gathered together The observations/measures just as they were obtained and have not had anything done to them This is what we perform our various statistical procedures on
Constant
Often represented by the letter c Can be any number, but its value remains constant for the operation specified Σ(X+c) = (X1+c) + (X2+c) + (X3+c) + ... + (Xn+c)
Leptokurtic distribution
One in which the score are bunched together with steeply sloping sides More peaked than the normal distribution
Platykurtic distribution
One in which the scores are more evenly spread out A relatively greater proportion of the scores fall towards the ends or tails of the distribution (than is leptokurtic distributions) More flat than the normal distribution
Unimodal
One mode When there is clearly one score value that occurred the most frequently
Discrete variable
One that can only assume certain numerical values (ex whole numbers only) Ex number of children in a family If measurements were taken of several such events, tied scores would be expected
Mesokurtic distribution
One that is neither too peaked nor too flat (just right) Refers to the kurtosis value of a theoretical frequency distribution known as the normal distribution
Objective observation
One that is not in any way affected by the opinions, values, or biases of the observer The test for this is whether any other person viewing the same events would report the same things Only these observations are scientific Note subjective opinions of subjects may be observed objectively (by the experimenter)- the crucial point is that the observer is objectively detached from the observations
Continuous variable
One that may assume any value between maximum and minimum limits (within the given range) Ex height Theoretically these variables do not produce tied scores but this is only true when measurement reaches an infinite degree of precision- since we cannot achieve this all real-life measurement is expressed in discrete units (while some variables may be continuous all actual data are discrete)
Subjective observation
One that reflects the observer's personal point of view These observations are NOT scientific
Empirical event
One which may be perceived by our senses (including any extension of them in the form of detection and/or recording apparatus)
Exact limits of the class intervals
One-half unit to either side of the class interval values Sometimes necessary to be specified when calculating summary statistics from frequency distribution data (ex class interval: 7.0-7.4, exact limits: 6.95-7.45) - When class intervals are constructed in terms of exact limits an arbitrary convention must be adopted to deal with the occurrence of a score equal to an exact limit value (usually to place such a score into the next higher interval)
The discovery of knowledge
Process: 1. Asking the right questions (figure out what is missing from the literature and determining whether this gap is testable) 2. Collecting useful data, which includes deciding how much is needed (ensuring you are sufficiently answering your question) 3. Summarizing and analyzing data, with the goal of answering the question(s) 4. Making decisions and generalization based on the observed data 5. Turning the data and subsequent decisions into new knowledge
X
Represents the variable of interest Individual scores are identified by numerical subscripts (X1 refers to score for 1st subject)
Determining the mode of a set of numbers
Scan them to identify the score that occurred with the greatest frequency Recommended to order the scores in ascending or descending order (this greatly reduced the possibility of error)
Negatively skewed
Scores are bunched at the high end of the measurement range
Deciles
Scores that divide the distribution into tenths 1st decile- has 10% of the distribution below it 2nd decile- has 20% of the distribution below it Note the 1st decile corresponds to the 10th percentile Specific ones can be calculated the same way percentiles can just convert the decile value of interest into the appropriate percentile value
Creating a histogram on excel
Select Histogram option from Data Analysis tool pak Input info as shown in image Can edit the chart to create a figure that is more correct (no spaces between bars for interval or ratio data)and can format the histogram to add title, and labels
Average
Several different statistics may legitimately be called averages- each provides particular and different information and, depending on characteristic of the score distribution, they can be quite different numerical values This term is not used in statistics
Σ
Summation operator Capital Greek letter sigma Means "the sum of" When it occurs before a letter representing a variable (such as ΣX) it means the sum of all n scores of the type represented by X ΣX= X1+X2+X3+...+Xn Can be used in a variety of different ways to signify more complex operations (in these cases be alert for the proper sequence of operation)
Frequency distributions
Systematic method of ordering scores (simplifies it) An arrangement that lists all possible data values or types, and shows the frequency of occurrence of each one - A way to make sense of ones data by simplifying it - Purely descriptive operation - Enable the viewer to see aspects of the data that are not easily detected by merely scanning the raw scores (whether data is bunched or spread evenly) Types: - Ungrouped - Grouped (class intervals) Purposes: - Simplifies calculations for other statistics - Transition step in constructing a frequency histogram
Real upper and lower limits and class intervals
Take into account the space between the class intervals When graphing continuous data there will be gaps present between the class intervals - must fill these gaps using the real upper and lower limits (widen the class intervals to fill the gap) To calculate when there are class intervals: a=(upper value of class interval- lower value of class interval)/ 2 Exact limits= (lower - a) to (upper + a)
Quantitative measurement
The assignment of numerical quantity to the variable
Variable of interest
The behaviour or property under investigation
Bar graph
The frequency or amount of each type of observation is represented by a vertical bar and the bars themselves are separated by some space Can be made more complex to present more info (look at each category at multiple variables) Advantages: - Useful to depict frequencies of nominal variables because of the distinctness of the bars reinforces the distinctness of the nominal classes of observation - Since the heights are proportional to the amounts or frequencies they represent one can make relative comparisons conveniently and quickly But - Make sure to consider the scale on the ordinate when interpreting to get a true perspective of the data being represented
Six-number summary
The lowest value (min) The cut off points for the 1/4 (25th percentile), 1/2 (50th percentile- median), and 3/4 (75th percentile) of the data The highest value (max) The mean This simple summary can tell you interesting info and are easier to understand than large quantities of info
Mean
The mean of a sample of X scores is symbolized as X- (X bar) The mean of a population is symbolized by Greek letter μ (mu) Highly affected by extreme values (skew) Mean follows the extreme values (if high outlier mean drastically increases)
ΣXΣY
The product of the sums of the X values and the Y values ΣXΣY = (X1 + X2 + X3 + ... + Xn)(Y1 + Y2 + Y3 + ... + Yn)
Ratio relationship
The proportions of the magnitudes of the variable are accurately reflected by the proportions of the numerical scores Ex 30 inches is twice as long as 15 inches The basis for this is the true zero point present in ratio measurement
Kurtosis
The relative peakedness or flatness of the distribution Reflects whether the scores are more or less evenly distributed throughout the measurement range Can be leptokurtic or platykurtic or mesokurtic Can be measured quantitatively If your distribution is not mesokurtic this is an indicator you didn't select the right interval width (i)- change
Mode
The score (or class interval) that occurs most often On graphs always the peak Can be bimodal- 2 modes Can have no mode- all scores have same occurrence
Percentile
The score value equal to or below which a specified percentage of the distribution falls
Median
The score which divides the total number of scores in half- 50th percentile (midpoint) Organization of data matters more, outliers do not matter as much Not very affected by extreme values (skew)
ΣX^2
The sum of the squared X's (sum of scores obtained when each X value is squared) Occurs this way since squaring has a higher priority than summing ΣX^2 = X1^2 + X2^2 + X3^2 + ... + Xn^2 Note: ΣX^2 does NOT equal (ΣX)^2
Peaks
The tallest bars (or cluster(s) of bars) Represent the most common values/bulk of data Can use the peaks to compare where the bulks lie for different populations
Consistent interval relationship
The unit or interval of measurement refers to a constant amount of the variable throughout the measurement range Ex 1 centimeter is the same amount of length always
What is the purpose of data
To get necessary information and knowledge (only once interpreted)
Goal of statistics
To take data from a sample and make conclusions about the population
Identifying the real upper and lower limits of class intervals
Upper limit- ul Lower limit- ll AKA exact limits Divide smallest possible unit (from step 3) by 2 Add and subtract the answer you get to/from each class interval Note for the upper and lower limits you go 1 decimal place further than the precision of the data that you're actually dealing with
Ungrouped frequency distribution
Use only if you have a small data set Each value of x in the distribution represents one value in the data (you have a category for each possible value)
Deciding the appropriate number of class intervals
Use the following rule (Modified Sturge's Rule) as a guide only! Number of scores: - 1-100 (5-10 class intervals) - 101-1000 (11-20 class intervals) See picture for even more in depth guidelines This is not a hard rule but rather a suggestion- you know your data set the best so follow your instinct with these in mind
Grouped frequency distributions
Use/make class intervals Used with large data sets- group and look at the groups frequency distribution Several values in the data are classified into one interval
Cumulative frequency distribution
Useful for knowing how many at X or below One in which the frequency of observations at each data value is added to the frequencies of preceding values (the frequency values accumulate as one reads up the listing of the data values or types) - Enable a quick determination of the percentage of observations above or below some value
Histogram
Uses vertical bars to depict frequencies of an interval/ratio variable Differs from a bar graph in not having spaces between the bars - Tends to emphasize the differences between the score frequencies Depicts the distribution of your sample data Key characteristics: - Peaks - Spread - Symmetry
Operational definition
Variables have them Specifies the manner of measurement of the variable Essential in science because they are a means by which we can achieve precision in our communication and objectively in our data (by specifying overt means of measurement) The key to transforming subjective phenomena into observable/measurable variables- allowing us to study subjective phenomena However the risk is that these definitions may be oversimplifications that do not completely cover all aspects of the variable (ex annual income as a operational definition of success)
Ordinate
Vertical axis Marked in units indicating the frequency or amount
Smallest possible unit of measurement
What is the smallest division (possible) that was used on the measuring scale when the scores were collected In other word by how much can your score increase from 1 participant to another (using the smallest change possible)
Determining mode from data presented graphically
When presented graphically: - The mode is the value of the highest bar in a histogram/bar graph - The mode is the highest point in a frequency polygon
How to find what percentile a given score is
X= the given score whose percentile rank is determined Σfb= cumulative frequency of scores below the interval containing the given score LL= the lower exact limit of the interval containing the given score (i)= size of the interval (number of measurement units) containing the given score fw= frequency of scores within the interval containing the given score N= total number of scores
How to decide the score range of each class interval (i)
i= width of class interval Use: i= (largest score-smallest score)/# of class intervals Remember # of class intervals is value from step 4 The distance between rows should be the desired width, not the range within a row (the ranges contain the upper and lower limits so for a width of 9 the interval is 1-9)
Σ(X+Y)
Σ(X+Y) = (X1+Y1) + (X2+Y2) + (X3+Y3) + ... + (Xn+Yn)
ΣXY
ΣXY = X1Y1 + X2Y2 +X3Y3 + ... + XnYn ΣXY does NOT equal ΣXΣY
ΣcX
ΣcX = cX1 + cX2 + cX3 + ... + cXn