Analytics Test 2
What is the difference between cross-sectional data and time-series data?
Cross sectional data is taken at one point in time Time-series data is taken over time
The process of extracting useful information from text data is known as _____. a. text mining b. corpus c. tokenization d. stemming
a. text mining
What is a variable?
characteristic/quantity of interest
terms
each document comprised of these
text mining analysis
tokenization: divide text into terms; get rid of stop words stemming: convert word to root word text reduction: combine synonyms frequency term-document matrix descriptive analytics
How do you calculate the bin width?
(largest data value-smallest data value) / (number of bins)
What is the difference between bar charts and column charts?
Both display magnitude of quantitative data Bar are horizontal Column are vertical
True or false: covariance is a better measure of a linear relationship between 2 variables
False covariance only determines the direction of the relationship not the strength since it is affected by scale; magnitude is difficult to interpret correlation coefficient is the better method since it shows direction and strength; -1 to 1
Which of the following is not a measure of central tendencies? Mean Median Standard Deviation Mode
Standard deviation Measures of central tendencies are mean, median, mode, geometric mean
Describe the relationship between the mean, median, and mode of a symmetrical data set
They will all be the same, at the center
Which of the following is not considered a chart? Scatter Bubble Pie Treemap
Treemap: considered an advanced data visualization Types of charts: scatter, line, bar (horizontal), column (vertical), pie, bubble, heat maos, sparklines, clustered or stacked column, scatter-chart matrix
A Wall Street Journal subscriber survey asked 46 questions about subscriber characteristics and interests. State whether each of the following questions provides categorical or quantitative data. (a)What is your age? (b) Are you male or female? (c) When did you first start reading the WSJ? High school, college, early career, midcareer, late career, or retirement? (d)How long have you been in your present job or position? (e)What type of vehicle are you considering for your next purchase? Nine response categories include sedan, sports car, SUV, minivan, and so on.
(a) quantitative (b) categorical (c) categorical (d) quantitative (e) categorical
Consider a sample with data values 10, 20, 12, 17, 16, and 12. How would you expect the mean and median for these sample data to compare to the mean and median for part a (higher, lower, or the same)? (i)Both the mean and median will decrease. (ii)The mean will increase while the median will decrease. (iii)The mean will increase while the median will increase. (iv)Both the mean and median will increase. (v)Both the mean and median will stay the same.
(i)
A simple random sample of 5 months of sales data provided the following information: Units Sold:94, 95, 85, 94, 92 Develop a point estimate of the population mean number of units sold per month.x =
92
What type of data is taken at one point in time?
Cross-sectional data Time-series data is taken over time
What are the facts and figures you need to collect, analyze, manipulate, present, and summarize?
Data
Variance
Data in comparison to mean population variance=var.p(...) sample variance=var.s(...)
variation
Difference in a variable measured over observations (time, customer, etc)
What are the two types of quantitative data?
Discrete (particular numbers) Continuous (any value)
Scatter-chart matrix
Displays many scatter charts/plots together Relationship between multiple variables
What is Jaccard's Coefficient and how does one calculate it?
Does not count matching zero entries and is computed by dividing (number of variables with matching nonzero value for observations u and v) by ((total number of variables)-(number of variables with matching zero values for observations u and v))
What are the two dissimilarity measures?
Eucledian distance (affected by scale of numbers, smaller=similar) Manhattan distance (not as influenced by outliers)
What is the mode?
Frequency of a number in a set one mode: =mode.sngl(...) multiple modes=mode.mult(...)
Describe the relationship between the mean, median, and mode of a left skewed data set
Mean will be less than median and mode. Mode will be greatest
Describe the relationship between the mean, median, and mode of a right skewed data set
Mode will be lowest, median in middle, and mean will be the highest (greater than median)
What are the two types of qualitative data?
Nominal: named categories Ordinal: categories with implied order
What is the difference between numeric and categorical data?
Numeric data involves numbers, while categorical data does not and can't be used for calculations unless you convert it to numerical form Numeric can perform numeric and arithmetic operations
Discuss the differences between a population and a sample.
Population involves collecting every observation and every individual but it is not always possible so use a sample as a representation of the population Sample: subset of population Sample size is important
What are the two main sources of data? Examples
Primary data: collect yourself (surveys, observations, experiments, etc.) Secondary data: collected by someone else (what we use); business insight, statists, customer satisfaction
Compare quantitative and qualitative data
Quantitative data: numeric, numbers, numeric and arithmetic operations can be performed on them Qualitative data: can't use for math calculations; numeric and arithmetic operations can't be performed; can be summarized by counting number of operations or computing proportions of each observation in each category
What does a correlation of 0.97 mean?
Very strong positive correlation
Deleting the grid lines in a table and the horizontal lines in a chart _____. a. increases the data-ink ratio b. decreases the data-ink ratio c. increases the non-data-ink ratio d. does not affect the data-ink ratio
a. increases the data-ink ratio
Complete linkage can be used to measure the distance between clusters that are the _____ in cluster analysis. a. most different b. farthest apart c. closest d. most similar
a. most different
A simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size _____. a. n has the same probability of being selected b. N and n have the same probability of being selected c. n has a probability of 0.5 of being selected d. n has a probability of 0.05 of being selected
a. n has the same probability of being selected
In the text mining process, the text is first preprocessed by deriving a smaller set of _____ from the larger set of words contained in a collection of documents. a. tokens b. terms c. stack d. stems
a. tokens
The random numbers generated using Excel's RAND function follows a _____ probability distribution between 0 and 1. a. uniform b. normal c. random d. binomial
a. uniform
The goal of _____ is to use the variable values to identify relationships between observations. a. unsupervised learning b. Ward's method c. data mining d. McQuitty's method
a. unsupervised learning
How are z-scores interpreted?
as a number of standard deviations how far a particular value is away from the mean relative to the standard deviation
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a _____. a. scatter chart b. dendrogram c. cumulative lift tree d. decile-wise lift chart
b. dendogram
Standard deviation
better measure of data since has same scale as original data, prefer to use this square root of variance
Average linkage is a measure of calculating dissimilarity between two clusters by _____. a. finding the distance between the two closest observations in the two clusters b. computing the distance between the cluster centroids c. computing the average distance between every pair of observations between two clusters d. finding the distance between the two most dissimilar observations in the two clusters
c. computing the average distance between every pair of observations between two clusters
The data collected from the customers in restaurants about the quality of food is an example of a(n) _____. a. cross-sectional study b. variable study c. observational study d. experimental study
c. observational study
The value of the _____ is used to estimate the value of the population parameter. a. population estimate b. sample parameter c. sample statistic d. population statistic
c. sample statistic
What is the geometric mean?
check growth rate =geomean(...)
corpus
collection of documents to be analyzed
centroid linkage
compares averages of each cluster
document
contiguous piece of text
What are the two types of variables?
controllable: decision variables, know value of variable uncontrollable: random variable, don't know value
What are the measures of association?
covariance and correlation coefficient
Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond to a 25-year-old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-year-old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two observations using Euclidean distance. a. 66.21 b. 72.28 c. 88.57 d. 75.39
d. 75.39
data dashboards
data-viz tool that illustrates multiple metrics and automatically updates these metrics as new data becomes available should provide timely summary of KPIs and inform but not overwhelm summary of firm's operations in visual form call attention to unusual measures
covariance
descriptive measure of the linear association between two variables magnitude is very difficult to interpret only direction of relationship, not strength (affected by scale)
Range
difference between max and min highly influenced by extreme numbers =max(...)-min(...)
Single linkage
distance between most similar pair of observations elongated clusters
complete linkage
distance between pair of observations that are most different can be distorted by outliers
What is an observation?
entity you are observing for variables set of values corresponding to set of variables
What is data?
facts and figures you need to collect, analyze, manipulate, present to make sense collection of observations
What is a histogram?
graphical presentation of frequency distribution, relative frequency, or percent frequency distribution of quantitative data constructed by placing bin intervals on horizontal axis and frequency/relative frequency/% frequency on the vertical axis like a column chart without the gaps
Parallel-coordinates plots
includes different vertical axis for each variable each observation represented by drawing line on parallel-coordinates plot connecting each vertical axis can ID common traits across multiple dimensions with color
k-mean clustering
iterative assignments of observations to one of a number of pre-determined k-clusters
hierarchical vs. k-mean
k-mean: numeric, large sample sizes, uses averages hierarchical: small sample size, categorical, highly sensitive to outliers, wide range, may change dramatically if observation eliminated or added
What are the 2 similarity measures?
matching coefficients Jaccard coefficient
Compute the mean and median for the sample data 10, 20, 12, 17, 16, and 12.If required, round your answers to one decimal place.
mean=14.5 median=14
Consider a sample with data values of 10, 20, 12, 17, and 16. Compute the mean and median.
mean=15 median=16
Z-score
measure relative location of value in data set how far particular value is from mean relative to data set's standard deviation interpreted as number of standard deviations =standardize(...)
correlation coefficient
measures relationship between two variables not affected by units of measurement can only take values between -1 and 1 shows direction and strength =correl(...) speareman rho: relationship between categorical data
What is the median?
middle # look at with mean =median(...)
group average linkage
most commonly used average distance computed over all pairs of observations between the 2 clusters
What is the mean
one of most common =average(...) don't use alone when making decisions influenced by extreme outliers population mean: mu sample mean: x bar
Forms of advanced data visualization
parallel-coordinates plots treemap geographic info systems
What are the measures of distribution/relative position?
percentile quartiles z-score outliers
What type of chart is used as a last resort?
pie chart
How many standard deviations away from the mean does a value need to be to be considered an outlier?
plus 3 or minus 3
Text mining
process of extracting useful info from test data requires more processing than numerical data
Data-ink ratio
proportion of "data ink" to total amount of ink used in table or chart data ink: ink used in a table or chart that is necessary to convey the meaning of the data to the audience remove unnecessary lines to increase white space high ratio means most ink is spent on data but highest ratio may not be the best
What are the measures of variability?
range variance standard deviation coefficient of variance
What is a frequency distribution?
relative frequency distribution: tabular summar of data showing relative frequency for each bin percent frequency distribution: percent frequency of data for each bin; help provide estimate of relative liklihood of different values for random variable
hierarchical clustering
sequentially merges similar clusters to create nested clusters
What is the matching coefficient and how do you calculate it?
simplest overlap measure and is computed by dividing (number of variables with matching value for observations u and v) by (total number of variables)
What are the 4 popular agglomeration methods?
single linkage complete linkage group average linkage centroid linkage
What are the measures of central tendencies?
summarizes data; middle of data mean median mode geometric mean
Pivot tables
tabular summary for 2 variables at same time can be numeric, qualitative, or both crosstabulation: type of table for describing data of 2 variables pivot charts: pair with pivot tables; clustered column chart, can filter
What is descriptive data mining?
unsupervised learning: descriptive data-mining method; no variable to predict high dimensional analysis: have multiple variables and multiple observations includes association, clustering, and summarization
GIS (geographic information system)
use of data by geographical area or some other form of spatial referencing
cluster analysis
used in market segmentation and talent recruitment group things based on similarities to then analyze 2 types: 1. hierarchical clustering 2. k-mean clustering require: similarity measures and dissimilarity measures
What is the empirical rule?
used to determine percentage of data values that are within specified number of standard deviations from mean standard normal distribution
What is a dendogram?
uses dissimilarity distance higher level of agglomeration means less similar tree diagram used to illustrate sequence of nested clusters produced by hierarchical clustering
Coefficient of variance
when have data sets of different scales or variability variation from mean in percent form =standard deviation / mean *100 how large standard deviation is relative to mean