Data Analytics Quiz #2
What is the default probability cutoff?
0.5 or 50%
volume
A component of 4Vs of Big Data that is characterized by the amount of information a component can hold (byte, kilabyte, etc.)
velocity
A component of 4Vs of Big Data that is characterized by the speed of data processing
veracity
A component of 4Vs of Big Data that is characterized by the trustworthiness of data
variety
A component of 4Vs of Big Data that is characterized by the two subcategories structured and unstructured data
regression
A method of supervised learning characterized by an attempt to predict a continuous attribute
classification
A method of supervised learning characterized by learning a method for predicting the instance class from pre-labeled (classified) instances
clustering
A method of unsupervised learning characterized by finding "natural" grouping of instances given un-labeled data
association rules
A method of unsupervised learning that is a method for discovering interesting relations between variables in large DBs
structured data
A subcategory of variety; data containing a defined data type, format, and structure (e.g. transaction data, spread sheets)
unstructured data
A subcategory of variety; data that has no inherent structure, which may include text documents (tweets, Facebook posts, blog entries), sensor data, images, audio, video log files
univariate
A type of statistical analysis that involves the analysis of a single variable
multivariate
A type of statistical analysis that involves the examines two or more variables; typically involves a dependent variable and multiple independent variables
ordinal (ordered)
A type of variable that is a subcategory of categorical arranges data by low, medium high
nominal (unordered)
A type of variable that is a subcategory of categorical data that would organize data without ranking ex. male, female
continuous (numeric)
A type of variable that is defined by continuous or integer
Multiple Linear Regression
Attempts to model the relationship between one continuous dependent variable and two or more independent variables Build a regression equation by looking at parameter estimates - what is statistically significant Anything less than .05
unsupervised learning
Data has no target attribute; want to explore the data to find some intrinsic structures or similarity in them
supervised learning
Discover patterns in the data that relate data attributes (variables) with a target attribute; patterns are the utilized to predict the values of the target attribute in future data instances
Hierarchical divisive
Looked at from the top down Start with clusters and recursively split
Hierarchical agglomerative
Not normally used with large data sets Looked at from the bottom up Initially each point is a cluster Repeat combining two nearest until you find a good pattern
prescriptive analytics
On the Analytic Value Escalator, answers the question: how can we make it happen? Foresight
descriptive analytics
On the Analytic Value Escalator, answers the question: what happened? Hindsight
predictive analytics
On the Analytic Value Escalator, answers the question: what will happen? Insight
diagnostic analytics
On the Analytic Value Escalator, answers the question: why did it happen? Insight
terabyte
On the comparative scale of bytes, this is equivalent to 6 million books or 1,000 GB
megabyte
On the comparative scale of bytes, this is equivalent to a piece of music or 1,000 KB
petabyte
On the comparative scale of bytes, this is equivalent to a stack of DVDs as tall as a 55- story building or 1,000 TB
gigabyte
On the comparative scale of bytes, this is equivalent to a two-hour film or 1,000 MB
zettabyte
On the comparative scale of bytes, this is equivalent to all the data recorded in 2011 or 1,000 EB
exabyte
On the comparative scale of bytes, this is equivalent to all the information generated up to 2003 or 1,000 PB
kilobyte
On the comparative scale of bytes, this is equivalent to one page of text or 1,000 Bytes
yottabyte
On the comparative scale of bytes, this is equivalent to the storage capacity of the NSA datacenter or 1,000 ZB
byte
On the comparative scale of bytes, this is the basic unit of measurement and is equivalent to 1B
Clustering:
Partitioning data into subclasses Group similar objects Partitioning the data based on similarity Ex. library - genres of books Unsupervised learning Natural grouping of similar objects based on input parameters Homogeneous within and heterogeneous across One cluster is dissimilar from another If you add more parameters, variables, the grouping may change
Odds
Probability of occurring / Probability of not occurring = P/ 1-P
Clusters
There is no objective function here No dependent variable Called subjective segmentation The segmentation develops on its own based on values of the input variables
volume, velocity, veracity, variety
What are the 4Vs of Big Data?
boxplot
What visual graph should be used for comparing subgroups?
scatterplot
What visual graph should be used to display a relationship between two numerical variables?
bar chart
What visual graph should be used to display categorical data?
histogram
What visual graph should be used to display the distribution of the outcome variable (ex. median house value)
Odds ratio is important because
an ODDS ratio for an independent variable in logistic regression represents how the odds change with a (1) unit increase in that variable - holding all other variables constant
decision classification is for what kind of variables
categorical variables
What does R square measure?
measures how much variability in the response there is higher values indicate a better fit
What does logisitic regression seek to do?
model, estimate, predict, classify
For multiple linear regressions, do we want a strong correlation between variables?
no
probability
number of outcomes of interest / all possible outcomes
regression tree is for what kind of variables
numeric or continuous variables
multicollinearity
other variables have too much weight on a variable or not enough
a finished tree can be used to:
predict value of a data set, understand how the data is partitioned, and determine variable importance
simple linear regression
seeks to minimize the overall error by fitting a line that minimizes the squared distance from the line of the data points
Logistic regression
used to describe data and to explain the relationship between one dependent binary variable and or more independent variables. The goal is to find the probability of a particular outcome