Data Analytics Quiz #2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is the default probability cutoff?

0.5 or 50%

volume

A component of 4Vs of Big Data that is characterized by the amount of information a component can hold (byte, kilabyte, etc.)

velocity

A component of 4Vs of Big Data that is characterized by the speed of data processing

veracity

A component of 4Vs of Big Data that is characterized by the trustworthiness of data

variety

A component of 4Vs of Big Data that is characterized by the two subcategories structured and unstructured data

regression

A method of supervised learning characterized by an attempt to predict a continuous attribute

classification

A method of supervised learning characterized by learning a method for predicting the instance class from pre-labeled (classified) instances

clustering

A method of unsupervised learning characterized by finding "natural" grouping of instances given un-labeled data

association rules

A method of unsupervised learning that is a method for discovering interesting relations between variables in large DBs

structured data

A subcategory of variety; data containing a defined data type, format, and structure (e.g. transaction data, spread sheets)

unstructured data

A subcategory of variety; data that has no inherent structure, which may include text documents (tweets, Facebook posts, blog entries), sensor data, images, audio, video log files

univariate

A type of statistical analysis that involves the analysis of a single variable

multivariate

A type of statistical analysis that involves the examines two or more variables; typically involves a dependent variable and multiple independent variables

ordinal (ordered)

A type of variable that is a subcategory of categorical arranges data by low, medium high

nominal (unordered)

A type of variable that is a subcategory of categorical data that would organize data without ranking ex. male, female

continuous (numeric)

A type of variable that is defined by continuous or integer

Multiple Linear Regression

Attempts to model the relationship between one continuous dependent variable and two or more independent variables Build a regression equation by looking at parameter estimates - what is statistically significant Anything less than .05

unsupervised learning

Data has no target attribute; want to explore the data to find some intrinsic structures or similarity in them

supervised learning

Discover patterns in the data that relate data attributes (variables) with a target attribute; patterns are the utilized to predict the values of the target attribute in future data instances

Hierarchical divisive

Looked at from the top down Start with clusters and recursively split

Hierarchical agglomerative

Not normally used with large data sets Looked at from the bottom up Initially each point is a cluster Repeat combining two nearest until you find a good pattern

prescriptive analytics

On the Analytic Value Escalator, answers the question: how can we make it happen? Foresight

descriptive analytics

On the Analytic Value Escalator, answers the question: what happened? Hindsight

predictive analytics

On the Analytic Value Escalator, answers the question: what will happen? Insight

diagnostic analytics

On the Analytic Value Escalator, answers the question: why did it happen? Insight

terabyte

On the comparative scale of bytes, this is equivalent to 6 million books or 1,000 GB

megabyte

On the comparative scale of bytes, this is equivalent to a piece of music or 1,000 KB

petabyte

On the comparative scale of bytes, this is equivalent to a stack of DVDs as tall as a 55- story building or 1,000 TB

gigabyte

On the comparative scale of bytes, this is equivalent to a two-hour film or 1,000 MB

zettabyte

On the comparative scale of bytes, this is equivalent to all the data recorded in 2011 or 1,000 EB

exabyte

On the comparative scale of bytes, this is equivalent to all the information generated up to 2003 or 1,000 PB

kilobyte

On the comparative scale of bytes, this is equivalent to one page of text or 1,000 Bytes

yottabyte

On the comparative scale of bytes, this is equivalent to the storage capacity of the NSA datacenter or 1,000 ZB

byte

On the comparative scale of bytes, this is the basic unit of measurement and is equivalent to 1B

Clustering:

Partitioning data into subclasses Group similar objects Partitioning the data based on similarity Ex. library - genres of books Unsupervised learning Natural grouping of similar objects based on input parameters Homogeneous within and heterogeneous across One cluster is dissimilar from another If you add more parameters, variables, the grouping may change

Odds

Probability of occurring / Probability of not occurring = P/ 1-P

Clusters

There is no objective function here No dependent variable Called subjective segmentation The segmentation develops on its own based on values of the input variables

volume, velocity, veracity, variety

What are the 4Vs of Big Data?

boxplot

What visual graph should be used for comparing subgroups?

scatterplot

What visual graph should be used to display a relationship between two numerical variables?

bar chart

What visual graph should be used to display categorical data?

histogram

What visual graph should be used to display the distribution of the outcome variable (ex. median house value)

Odds ratio is important because

an ODDS ratio for an independent variable in logistic regression represents how the odds change with a (1) unit increase in that variable - holding all other variables constant

decision classification is for what kind of variables

categorical variables

What does R square measure?

measures how much variability in the response there is higher values indicate a better fit

What does logisitic regression seek to do?

model, estimate, predict, classify

For multiple linear regressions, do we want a strong correlation between variables?

no

probability

number of outcomes of interest / all possible outcomes

regression tree is for what kind of variables

numeric or continuous variables

multicollinearity

other variables have too much weight on a variable or not enough

a finished tree can be used to:

predict value of a data set, understand how the data is partitioned, and determine variable importance

simple linear regression

seeks to minimize the overall error by fitting a line that minimizes the squared distance from the line of the data points

Logistic regression

used to describe data and to explain the relationship between one dependent binary variable and or more independent variables. The goal is to find the probability of a particular outcome


Ensembles d'études connexes

Chapter 23: Gene Pools Evolve of Populations

View Set

Ch. 6 Evolution - Biol3620-ECU-Summers

View Set

AP World History China and Persia Test

View Set

GOVT2305 American Government Ch 4

View Set

May 3 2020. Anthony. Probability, Counting Principle, Combinations, and Permutations.

View Set

Ch. 50 Disorders of Musculoskeletal Function: Rheumatic Disorders

View Set

Chapter 7: asepsis and infection control

View Set

Chapter 8: The Supervisor as Leaders

View Set

Pharmacology, Chapter 8 Administration by the Gastrointestinal Route

View Set