Big Data Final

Ace your homework & exams now with Quizwiz!

association learning

- take a set of training examples - discover associations among attributes - example: products that people tend to purchase together (market basket analysis) - doesn't single out a single attribute for special treatment

classification

- understanding/predicting a discrete class - supervised learning

confidence equation

# examples with the values in the precondition and the conclusion/ #examples with the values in just the precondition

false, jesus would be an example of a unigram

(T or F) "national champion Virginia" is an example of a unigram

true

(T or F) Identifying the factors that determine whether or not a student will return to Furman after a major surgery is an example of hidden knowledge in a dataset

true

(T or F) PRISM is an example of a covering algorithm

false, covering

(T or F) PRISM is an example of an instance based classification learning algorithm

false

(T or F) a linear regression equation may be improved by removing particular attributes from the data set

true

(T or F) according to michel and aiden, famous people in the 19th century gained fame at a younger age than people did today, but their fame grew more slowly

true

(T or F) according to the research of aiden and michel, the most frequently used irregular verbs in english will never convert to regular verbs.

true

(T or F) association learning is also called market basket analysis because datasets for market basket analysis have products as attributes, and the algorithm finds associations among those attributes.

false, association learning is an axample

(T or F) classification learning and numerical estimation are both examples of unsupervised learning techniques

true

(T or F) clustering and association learning are both examples of unsupervised learning

false

(T or F) data visulization tools used prior to running a learning algorithm are helpful because they assign bright, meaningful colors to each attribute

true

(T or F) discretization can be either supervised or unsupervised

true

(T or F) if one attribute in a dataset is legitimately predictive of the output class, it is perfectly fine to use 1R for real data mining, not just baseline.

false

(T or F) if the zeroR algorithm gives you an accurate classification 95% of the time, your conclusion should be to use the algorithm faithfully.

true

(T or F) in come cases, the fact that a particular value in a data set is missing could help the learned model perform better, rather than worse

true

(T or F) in some cases, the fact that particular value in a dataset is missing could help the learned model perform better, rather than worse.

false

(T or F) in the decision tree algorithm, both numeric and nominal attributes may appear in several tree nodes.

false, for data mining projects

(T or F) kaggle.com is a web site for pro sports analytics

true

(T or F) linear models may be used for classification as well as numerical estimation.

true

(T or F) normalization is the reaon why the difference between 1 in GPA and the difference of ! in Age are not to be considered equivalent differences in nearest neighbor

true

(T or F) simple histograms provided by WEKA can sometimes suggest correlations between input attributes and the class attribute that might be worth exploring

true

(T or F) successful data mining usually involves trying a number of approaches in a series of experiments.

true

(T or F) support and confidence are two important factors determining the strength of an association rule

false, more reliable estimation of accuracy of testing on the training data

(T or F) ten fold cross validation gives a more reliable estimation of a models accuracy than does testing on the training data

true

(T or F) the PRISM algorithm generates and individual rule by adding tests that maximize accuracy while reducing coverage

true

(T or F) the answer to the question "How many furman computer science majors play a varsity sport and have a GPS greater than 3.5?" is an example of hidden knowledge

false, use mostly used item in that data set to replace missing values

(T or F) the best approach to replacing missing values in a training dataset is to select a random value for the attribute and use that.

true

(T or F) the google books data set used by ngram viewer are both the some thing

false, nominal

(T or F) the leaves of a model tree give a direct numeric result for the class attribute of a new instance of numeric estimation.

true

(T or F) the more attributes the better with the nearest neighbor algorithm

false

(T or F) the more nodes and branches that a decision tree has, the better its predictive ability.

true

(T or F) the naive bayes algorithm easily accommodates missing data in the training set

false, does not work well with numeric attributes

(T or F) the naive bayes algorithm easily accommodates numerical input attributes in the training set

false, called lazy because it doesnt build a set of rules or tree

(T or F) the nearest neighbor algorithm is called lazy because it usually predicts the class attribute incorrectly

false, it considers all

(T or F) the oneR algorithm only considers one attribute in its analysis of a dataset

false

(T or F) the oneR algorithm wouldn't run in weka during lab with Age as the class attribute because of a bug in the program.

true

(T or F) the process of discretization converts numeric values to nominal values

true

(T or F) the purpose of a shadow dataset is to hide data from the original dataset that would violate privacy or intellectual property rights or corporate secrets while still being interesting for data mining purposes

true

(T or F) the simple linear regression algorithm studied in lab selects only the most predictive attribute for the output formula

true

(T or F) there are circumstnces for which the oneR algorithm is appropriate to use for real prediction, not just for establishing a baseline

true

(T or F) valid approaches to handle missing values in a training dataset include replacing them, removing the instances that have them, and ignoring them

true

(T or F) with each representation of the decision tree algorithm, numerical attributes must re-discretize.

true

(T or F)data mining is best described as the process of identifying patterns in data

consequent

(conclusion): classes, set of classes, or probability distribution assigned by rule

Andecedent

(pre-condition): a series of tests (just like the tests at the nodes of a decision tree)

Unsupervised Learning

- "Any learning technique that has as its purpose to group or cluster items, objects, or individuals" - clustering is an example of ___

OneR

- 1 Rule - it generates a one-level decision tree expressed in the form of a set of rules that all test one particular attribute - 1 because the rules are based on only 1 input attribute

decision list

- A set of rules that are intended to be interpreted in Sequence

association

- Can be applied if no class is specified and any kind of structure is considered "interesting" - unsupervised learning - Can predict any attribute's value, not just the class, and more than one attribute's value at a time

stratification

- Ensures that each class is represented with approximately equal proportions in both subsets

linear regression

- The classic approach to numeric estimation is this. - It produces a model that is a linear function (i.e., a weighted sum) of the input attributes.

confidence

- The confidence of a rule provides a measure of a rule's accuracy - of how well it predicts the values in the conclusion. - It answers the question: if the precondition of the rule holds, how likely is it that the conclusion also holds?

support

- The number of training examples containing the attribute values found in both the rule's antecedent and its consequent - i.e., the number of examples that the rule gets right - This metric can also be expressed as a percentage of the total number of training examples

simple linear regression

- This algorithm in Weka creates a regression equation that uses only one of the input attributes. - even when there are multiple inputs - can serve as a baseline. - compare the models from more complex algorithms to the model it produces - It also gives insight into which of the inputs has the largest impact on the output.

baseline

- When performing classification learning, 1R, can serve as a useful baseline. - The 0R algorithm learns a model that considers none of the input attributes!

euclidean distance

- aka the distance function

cross-validation

- avoids overlapping test sets - First step: split data into k subsets of equal size - Second step: use each subset in turn for testing, the remainder for training - Often the subsets are stratified

resubstitution

- error rate obtained from training data - Resubstitution error is (hopelessly) optimistic!

ZeroR

- learns a model that considers none of the input attributes - it predicts the majority class in the training data Example: in the credit card data set 9 examples with yes output and 6 examples with no output. So, zeroR would predict yes since it is 9/15

numeric estimation

- like classification learning - input attributes -> model -> output attribute/class - the model is learned from a set of training examples that include the output attribute - output attribute is numeric - want to be able to estimate the value

converse

- obtained by swapping the antecedent and consequent - not always true

a priori

- probability of an event before evidence is seen

best fit line

- straight line that represents the data on a scatter plot - may pass through some of the points, none of the points, or all of the points

multivariate model

-Multiple X's, multiple Y's -Every other model-special case-most general -The GLM -Canonical Correlation

sampling bias

-a bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. -Ex. Writers of books write more about other writers of books than other famous people

causation

A cause and effect relationship in which one variable controls the changes in another variable.

ngram

A contiguous sequence of n items from text or speech (unigram, bigram / digram, trigram, four-gram, five-gram, and so on), often representing a concept

correlation

A measure of the relationship between two variables

model

A pattern, plan, representation, or description designed to show the structure or workings of an object, system, or concept

database query

An inquiry the user poses to a database to extract a meaningful subset of data.

repeated holdout/ holdout method

Can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratification) The error rates on the different iterations are averaged to yield an overall error rate

Supervised Learning

Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.

normalization

Different attributes that are measured on different scales --> need to be normalized Nominal attributes: distance either 0 or 1 Common policy for missing values: assumed to be maximally distant (given normalized attributes)

goodness

ID3 - uses a different _____ score based on a field of study known as information theory

Naïve Bayes

It is a simplified version of Decision Trees, and can be used for classification and prediction for discrete attributes. this algorithm is fairly simple, based on the relative probabilities of the different values of each attribute, given the value of the predictable attribute.

shadow dataset

Masked or de-identified extrapolated data

covering

PRISM algorithm for rule generating

nearest neighbor

Simplest way of finding nearest neighbor: linear scan of the data Classification takes time proportional to the product of the number of instances in training and test sets Nearest-neighbor search can be done more efficiently using appropriate data structures - Often very accurate, but no structural pattern... Following "precedent" (e.g., law, diagnosis...) Assumes all attributes are equally important

class attribute, nominal

The ___ is always ___ in classification learning

overfitting

The process of fitting a model too closely to the training data for the model to be effective on other data.

bucket

To handle numeric attributes, we need to discretize (turn numeric into nominal) the range of possible values into subranges called ____

shallow knowledge

When students understand facts and information but cannot yet apply it to different contexts or use it with critical thinking skills such as, analyzing, evaluating or creating. Contrast with deep knowledge.

outliers

___ are often easy to detect when plotting a dataset in visualization.

denormalization

___ is the process of creating a flat file from multiple database tables for purpose of data mining.

1R

_____ doesnt work well if many of the input attributes have fewer possible values than the class/output attribute does.

KD-tree

_____ is used to make the nearest neighbor algorithm more efficient

ensemble learning

______ can involve meta learning - where an algorithm learns the strengths of various models in selecting which to use for a new instance

redundant attributes

______ cause accuracy problems with both naive bayes and nearest neighbor, and should be eliminated if possible.

sentiment analysis

________ is a form of data mining often applied to social media and other online content such as blog posts and comments about products

normal distribution

a bell-shaped curve, describing the spread of a characteristic throughout a population

Big Data

a broad term for datasets so large or complex that traditional data processing applications are inadequate.

item set

a collection of attribute values that appears in one or more training examples

instance-based

a family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.

parallel computing

a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time.

laplace estimator

add 1 to the count of every attribute if its fraction equals 0/some number or 5/5 (100%)

document classification

algorithms to determine the author of an article, and to decide if a tweet is bullying are examples of _______

market basket analysis

analyzes such items as websites and checkout scanner information to detect customers' buying behavior and predict future behavior by identifying affinities among customers' choices of products and services

Ordinal

being of a specified position or order in a numbered series

corpus

collections of text such as manuscripts or social media streams or speeches or other texts

decision tree algorithm

constructed: top-down recursive divide-and-conquer manner starts at the root all attributes are categorical (if continuous, they are discretized in advance) examples are partitioned recursively based on selected attributes test attributes are selected on the basis of statistical measure (information gain)

class attribute

defines the characteristics of a set of objects

lazy method

distance function, because no work is done until a classification is needed

regression equation

estimate the output attribute for previously unseen instances

regression tree

have each classification be a numeric value that is the average of the values for the training examples in that subgroup

missing values

in the nearest neighbor algorithm, maximal distance is assumed in cases of ______. (naive bayes just ignores them)

natural language processing

interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

confusion matrix

is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one

training set

is an initial set of data used to help a program understand how to apply technologies like neural networks to learn and produce sophisticated results

noise

meaningless data

correlation coefficient

measures the accuracy of how the linear model is measured

false

naive bayes is an example of a numeric estimation learning algorithm

Normilization

numerical values in a dataset must be _____ before the nearest neighbor algorithm can work correctly

crowdsourcing

obtaining services, ideas, or content by soliciting contributions from a large group of people, especially the online community

sparse data

occurs when a vast majority of values in a dataset are either missing or irrelevant

nominal distribution

one of the things that makes naive bayes "naive" is the assumption that the numerical data follows a _____, when it actually may not.

Clustering

organizing items into related groups during recall from long-term memory

clustering

organizing items into related groups during recall from long-term memory

a posteriori

probability of event after evidence is seen

Zipf's law

states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table.

micro-risk

the ability to detect _____ is one of the things that makes data mining so powerful

sum of squares

the algorithm used to find the best fit line

constant

the c in the regression equation

Machine Learning

the extraction of knowledge from data based on algorithms created from training data

true negative

the model correctly predicts no

true positive

the model correctly predicts yes

false negative

the model incorrectly predicts no

false positive

the model incorrectly predicts yes

Discretization

the process of converting numeric attributes to nominal attributes

univariate model

the simplest form of analyzing data. ... It doesn't deal with causes or relationships (unlike regression) and it's major purpose is to describe; it takes data, summarizes that data and finds patterns in the data.

weights

the w in the regression equation

fame curve

timing of debut, exponential growth, timing of peak, slow decline

model tree

to have a separate regression equation for each classification in the tree - based on the training examples in that subgroup.

Discretization

turning numeric attributes into nominal attributes Ex. Binary split, bucket split

numeric estimation

understanding/predicting a numeric quantity

probability density function

used during naive bayes for numeric attributes so they can be used to find probability

rules

used to explain decisions


Related study sets

Ch. 23 1930s Great Depression and New Deal

View Set

Accounting Chapter 5 Learn Smart

View Set

Supply Chain Management Exam 2 Chapter 4

View Set

AI Midterm exam review: Search + Machine Learning

View Set