Analytics Test 2

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What is the difference between cross-sectional data and time-series data?

Cross sectional data is taken at one point in time Time-series data is taken over time

The process of extracting useful information from text data is known as _____. a. text mining b. corpus c. tokenization d. stemming

a. text mining

What is a variable?

characteristic/quantity of interest

terms

each document comprised of these

text mining analysis

tokenization: divide text into terms; get rid of stop words stemming: convert word to root word text reduction: combine synonyms frequency term-document matrix descriptive analytics

How do you calculate the bin width?

(largest data value-smallest data value) / (number of bins)

What is the difference between bar charts and column charts?

Both display magnitude of quantitative data Bar are horizontal Column are vertical

True or false: covariance is a better measure of a linear relationship between 2 variables

False covariance only determines the direction of the relationship not the strength since it is affected by scale; magnitude is difficult to interpret correlation coefficient is the better method since it shows direction and strength; -1 to 1

Which of the following is not a measure of central tendencies? Mean Median Standard Deviation Mode

Standard deviation Measures of central tendencies are mean, median, mode, geometric mean

Describe the relationship between the mean, median, and mode of a symmetrical data set

They will all be the same, at the center

Which of the following is not considered a chart? Scatter Bubble Pie Treemap

Treemap: considered an advanced data visualization Types of charts: scatter, line, bar (horizontal), column (vertical), pie, bubble, heat maos, sparklines, clustered or stacked column, scatter-chart matrix

A Wall Street Journal subscriber survey asked 46 questions about subscriber characteristics and interests. State whether each of the following questions provides categorical or quantitative data. (a)What is your age? (b) Are you male or female? (c) When did you first start reading the WSJ? High school, college, early career, midcareer, late career, or retirement? (d)How long have you been in your present job or position? (e)What type of vehicle are you considering for your next purchase? Nine response categories include sedan, sports car, SUV, minivan, and so on.

(a) quantitative (b) categorical (c) categorical (d) quantitative (e) categorical

Consider a sample with data values 10, 20, 12, 17, 16, and 12. How would you expect the mean and median for these sample data to compare to the mean and median for part a (higher, lower, or the same)? (i)Both the mean and median will decrease. (ii)The mean will increase while the median will decrease. (iii)The mean will increase while the median will increase. (iv)Both the mean and median will increase. (v)Both the mean and median will stay the same.

(i)

A simple random sample of 5 months of sales data provided the following information: Units Sold:94, 95, 85, 94, 92 Develop a point estimate of the population mean number of units sold per month.x =

92

What type of data is taken at one point in time?

Cross-sectional data Time-series data is taken over time

What are the facts and figures you need to collect, analyze, manipulate, present, and summarize?

Data

Variance

Data in comparison to mean population variance=var.p(...) sample variance=var.s(...)

variation

Difference in a variable measured over observations (time, customer, etc)

What are the two types of quantitative data?

Discrete (particular numbers) Continuous (any value)

Scatter-chart matrix

Displays many scatter charts/plots together Relationship between multiple variables

What is Jaccard's Coefficient and how does one calculate it?

Does not count matching zero entries and is computed by dividing (number of variables with matching nonzero value for observations u and v) by ((total number of variables)-(number of variables with matching zero values for observations u and v))

What are the two dissimilarity measures?

Eucledian distance (affected by scale of numbers, smaller=similar) Manhattan distance (not as influenced by outliers)

What is the mode?

Frequency of a number in a set one mode: =mode.sngl(...) multiple modes=mode.mult(...)

Describe the relationship between the mean, median, and mode of a left skewed data set

Mean will be less than median and mode. Mode will be greatest

Describe the relationship between the mean, median, and mode of a right skewed data set

Mode will be lowest, median in middle, and mean will be the highest (greater than median)

What are the two types of qualitative data?

Nominal: named categories Ordinal: categories with implied order

What is the difference between numeric and categorical data?

Numeric data involves numbers, while categorical data does not and can't be used for calculations unless you convert it to numerical form Numeric can perform numeric and arithmetic operations

Discuss the differences between a population and a sample.

Population involves collecting every observation and every individual but it is not always possible so use a sample as a representation of the population Sample: subset of population Sample size is important

What are the two main sources of data? Examples

Primary data: collect yourself (surveys, observations, experiments, etc.) Secondary data: collected by someone else (what we use); business insight, statists, customer satisfaction

Compare quantitative and qualitative data

Quantitative data: numeric, numbers, numeric and arithmetic operations can be performed on them Qualitative data: can't use for math calculations; numeric and arithmetic operations can't be performed; can be summarized by counting number of operations or computing proportions of each observation in each category

What does a correlation of 0.97 mean?

Very strong positive correlation

Deleting the grid lines in a table and the horizontal lines in a chart _____. a. increases the data-ink ratio b. decreases the data-ink ratio c. increases the non-data-ink ratio d. does not affect the data-ink ratio

a. increases the data-ink ratio

Complete linkage can be used to measure the distance between clusters that are the _____ in cluster analysis. a. most different b. farthest apart c. closest d. most similar

a. most different

A simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size _____. a. n has the same probability of being selected b. N and n have the same probability of being selected c. n has a probability of 0.5 of being selected d. n has a probability of 0.05 of being selected

a. n has the same probability of being selected

In the text mining process, the text is first preprocessed by deriving a smaller set of _____ from the larger set of words contained in a collection of documents. a. tokens b. terms c. stack d. stems

a. tokens

The random numbers generated using Excel's RAND function follows a _____ probability distribution between 0 and 1. a. uniform b. normal c. random d. binomial

a. uniform

The goal of _____ is to use the variable values to identify relationships between observations. a. unsupervised learning b. Ward's method c. data mining d. McQuitty's method

a. unsupervised learning

How are z-scores interpreted?

as a number of standard deviations how far a particular value is away from the mean relative to the standard deviation

A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a _____. a. scatter chart b. dendrogram c. cumulative lift tree d. decile-wise lift chart

b. dendogram

Standard deviation

better measure of data since has same scale as original data, prefer to use this square root of variance

Average linkage is a measure of calculating dissimilarity between two clusters by _____. a. finding the distance between the two closest observations in the two clusters b. computing the distance between the cluster centroids c. computing the average distance between every pair of observations between two clusters d. finding the distance between the two most dissimilar observations in the two clusters

c. computing the average distance between every pair of observations between two clusters

The data collected from the customers in restaurants about the quality of food is an example of a(n) _____. a. cross-sectional study b. variable study c. observational study d. experimental study

c. observational study

The value of the _____ is used to estimate the value of the population parameter. a. population estimate b. sample parameter c. sample statistic d. population statistic

c. sample statistic

What is the geometric mean?

check growth rate =geomean(...)

corpus

collection of documents to be analyzed

centroid linkage

compares averages of each cluster

document

contiguous piece of text

What are the two types of variables?

controllable: decision variables, know value of variable uncontrollable: random variable, don't know value

What are the measures of association?

covariance and correlation coefficient

Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond to a 25-year-old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-year-old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two observations using Euclidean distance. a. 66.21 b. 72.28 c. 88.57 d. 75.39

d. 75.39

data dashboards

data-viz tool that illustrates multiple metrics and automatically updates these metrics as new data becomes available should provide timely summary of KPIs and inform but not overwhelm summary of firm's operations in visual form call attention to unusual measures

covariance

descriptive measure of the linear association between two variables magnitude is very difficult to interpret only direction of relationship, not strength (affected by scale)

Range

difference between max and min highly influenced by extreme numbers =max(...)-min(...)

Single linkage

distance between most similar pair of observations elongated clusters

complete linkage

distance between pair of observations that are most different can be distorted by outliers

What is an observation?

entity you are observing for variables set of values corresponding to set of variables

What is data?

facts and figures you need to collect, analyze, manipulate, present to make sense collection of observations

What is a histogram?

graphical presentation of frequency distribution, relative frequency, or percent frequency distribution of quantitative data constructed by placing bin intervals on horizontal axis and frequency/relative frequency/% frequency on the vertical axis like a column chart without the gaps

Parallel-coordinates plots

includes different vertical axis for each variable each observation represented by drawing line on parallel-coordinates plot connecting each vertical axis can ID common traits across multiple dimensions with color

k-mean clustering

iterative assignments of observations to one of a number of pre-determined k-clusters

hierarchical vs. k-mean

k-mean: numeric, large sample sizes, uses averages hierarchical: small sample size, categorical, highly sensitive to outliers, wide range, may change dramatically if observation eliminated or added

What are the 2 similarity measures?

matching coefficients Jaccard coefficient

Compute the mean and median for the sample data 10, 20, 12, 17, 16, and 12.If required, round your answers to one decimal place.

mean=14.5 median=14

Consider a sample with data values of 10, 20, 12, 17, and 16. Compute the mean and median.

mean=15 median=16

Z-score

measure relative location of value in data set how far particular value is from mean relative to data set's standard deviation interpreted as number of standard deviations =standardize(...)

correlation coefficient

measures relationship between two variables not affected by units of measurement can only take values between -1 and 1 shows direction and strength =correl(...) speareman rho: relationship between categorical data

What is the median?

middle # look at with mean =median(...)

group average linkage

most commonly used average distance computed over all pairs of observations between the 2 clusters

What is the mean

one of most common =average(...) don't use alone when making decisions influenced by extreme outliers population mean: mu sample mean: x bar

Forms of advanced data visualization

parallel-coordinates plots treemap geographic info systems

What are the measures of distribution/relative position?

percentile quartiles z-score outliers

What type of chart is used as a last resort?

pie chart

How many standard deviations away from the mean does a value need to be to be considered an outlier?

plus 3 or minus 3

Text mining

process of extracting useful info from test data requires more processing than numerical data

Data-ink ratio

proportion of "data ink" to total amount of ink used in table or chart data ink: ink used in a table or chart that is necessary to convey the meaning of the data to the audience remove unnecessary lines to increase white space high ratio means most ink is spent on data but highest ratio may not be the best

What are the measures of variability?

range variance standard deviation coefficient of variance

What is a frequency distribution?

relative frequency distribution: tabular summar of data showing relative frequency for each bin percent frequency distribution: percent frequency of data for each bin; help provide estimate of relative liklihood of different values for random variable

hierarchical clustering

sequentially merges similar clusters to create nested clusters

What is the matching coefficient and how do you calculate it?

simplest overlap measure and is computed by dividing (number of variables with matching value for observations u and v) by (total number of variables)

What are the 4 popular agglomeration methods?

single linkage complete linkage group average linkage centroid linkage

What are the measures of central tendencies?

summarizes data; middle of data mean median mode geometric mean

Pivot tables

tabular summary for 2 variables at same time can be numeric, qualitative, or both crosstabulation: type of table for describing data of 2 variables pivot charts: pair with pivot tables; clustered column chart, can filter

What is descriptive data mining?

unsupervised learning: descriptive data-mining method; no variable to predict high dimensional analysis: have multiple variables and multiple observations includes association, clustering, and summarization

GIS (geographic information system)

use of data by geographical area or some other form of spatial referencing

cluster analysis

used in market segmentation and talent recruitment group things based on similarities to then analyze 2 types: 1. hierarchical clustering 2. k-mean clustering require: similarity measures and dissimilarity measures

What is the empirical rule?

used to determine percentage of data values that are within specified number of standard deviations from mean standard normal distribution

What is a dendogram?

uses dissimilarity distance higher level of agglomeration means less similar tree diagram used to illustrate sequence of nested clusters produced by hierarchical clustering

Coefficient of variance

when have data sets of different scales or variability variation from mean in percent form =standard deviation / mean *100 how large standard deviation is relative to mean


Set pelajaran terkait

Basic Physical Assessment NCLEX 3000

View Set

Operations Management (P370) Final

View Set

Pharmacology Chapter 55: Drugs Acting on the Lower Respiratory Tract

View Set

Chapter 36: Care of Patients with Vascular Problems

View Set

Conceptual Academy Unit A: Elements of Chemistry

View Set

Patho Preview Quiz 13, 7, Quiz Eleven, Structure and Function of GI, Path Exam 2, Patho Ch 46 Skin Disorders, Unit 7 patho, GI Quiz, Exam 3, Chapter 29, Patho Exam 5, Final Review Ch 18, Exam 4 patho chapter 13, Quiz #7 & 8: Chs 29, 30, 32, 33, Patho...

View Set