Big Data Final
association learning
- take a set of training examples - discover associations among attributes - example: products that people tend to purchase together (market basket analysis) - doesn't single out a single attribute for special treatment
classification
- understanding/predicting a discrete class - supervised learning
confidence equation
# examples with the values in the precondition and the conclusion/ #examples with the values in just the precondition
false, jesus would be an example of a unigram
(T or F) "national champion Virginia" is an example of a unigram
true
(T or F) Identifying the factors that determine whether or not a student will return to Furman after a major surgery is an example of hidden knowledge in a dataset
true
(T or F) PRISM is an example of a covering algorithm
false, covering
(T or F) PRISM is an example of an instance based classification learning algorithm
false
(T or F) a linear regression equation may be improved by removing particular attributes from the data set
true
(T or F) according to michel and aiden, famous people in the 19th century gained fame at a younger age than people did today, but their fame grew more slowly
true
(T or F) according to the research of aiden and michel, the most frequently used irregular verbs in english will never convert to regular verbs.
true
(T or F) association learning is also called market basket analysis because datasets for market basket analysis have products as attributes, and the algorithm finds associations among those attributes.
false, association learning is an axample
(T or F) classification learning and numerical estimation are both examples of unsupervised learning techniques
true
(T or F) clustering and association learning are both examples of unsupervised learning
false
(T or F) data visulization tools used prior to running a learning algorithm are helpful because they assign bright, meaningful colors to each attribute
true
(T or F) discretization can be either supervised or unsupervised
true
(T or F) if one attribute in a dataset is legitimately predictive of the output class, it is perfectly fine to use 1R for real data mining, not just baseline.
false
(T or F) if the zeroR algorithm gives you an accurate classification 95% of the time, your conclusion should be to use the algorithm faithfully.
true
(T or F) in come cases, the fact that a particular value in a data set is missing could help the learned model perform better, rather than worse
true
(T or F) in some cases, the fact that particular value in a dataset is missing could help the learned model perform better, rather than worse.
false
(T or F) in the decision tree algorithm, both numeric and nominal attributes may appear in several tree nodes.
false, for data mining projects
(T or F) kaggle.com is a web site for pro sports analytics
true
(T or F) linear models may be used for classification as well as numerical estimation.
true
(T or F) normalization is the reaon why the difference between 1 in GPA and the difference of ! in Age are not to be considered equivalent differences in nearest neighbor
true
(T or F) simple histograms provided by WEKA can sometimes suggest correlations between input attributes and the class attribute that might be worth exploring
true
(T or F) successful data mining usually involves trying a number of approaches in a series of experiments.
true
(T or F) support and confidence are two important factors determining the strength of an association rule
false, more reliable estimation of accuracy of testing on the training data
(T or F) ten fold cross validation gives a more reliable estimation of a models accuracy than does testing on the training data
true
(T or F) the PRISM algorithm generates and individual rule by adding tests that maximize accuracy while reducing coverage
true
(T or F) the answer to the question "How many furman computer science majors play a varsity sport and have a GPS greater than 3.5?" is an example of hidden knowledge
false, use mostly used item in that data set to replace missing values
(T or F) the best approach to replacing missing values in a training dataset is to select a random value for the attribute and use that.
true
(T or F) the google books data set used by ngram viewer are both the some thing
false, nominal
(T or F) the leaves of a model tree give a direct numeric result for the class attribute of a new instance of numeric estimation.
true
(T or F) the more attributes the better with the nearest neighbor algorithm
false
(T or F) the more nodes and branches that a decision tree has, the better its predictive ability.
true
(T or F) the naive bayes algorithm easily accommodates missing data in the training set
false, does not work well with numeric attributes
(T or F) the naive bayes algorithm easily accommodates numerical input attributes in the training set
false, called lazy because it doesnt build a set of rules or tree
(T or F) the nearest neighbor algorithm is called lazy because it usually predicts the class attribute incorrectly
false, it considers all
(T or F) the oneR algorithm only considers one attribute in its analysis of a dataset
false
(T or F) the oneR algorithm wouldn't run in weka during lab with Age as the class attribute because of a bug in the program.
true
(T or F) the process of discretization converts numeric values to nominal values
true
(T or F) the purpose of a shadow dataset is to hide data from the original dataset that would violate privacy or intellectual property rights or corporate secrets while still being interesting for data mining purposes
true
(T or F) the simple linear regression algorithm studied in lab selects only the most predictive attribute for the output formula
true
(T or F) there are circumstnces for which the oneR algorithm is appropriate to use for real prediction, not just for establishing a baseline
true
(T or F) valid approaches to handle missing values in a training dataset include replacing them, removing the instances that have them, and ignoring them
true
(T or F) with each representation of the decision tree algorithm, numerical attributes must re-discretize.
true
(T or F)data mining is best described as the process of identifying patterns in data
consequent
(conclusion): classes, set of classes, or probability distribution assigned by rule
Andecedent
(pre-condition): a series of tests (just like the tests at the nodes of a decision tree)
Unsupervised Learning
- "Any learning technique that has as its purpose to group or cluster items, objects, or individuals" - clustering is an example of ___
OneR
- 1 Rule - it generates a one-level decision tree expressed in the form of a set of rules that all test one particular attribute - 1 because the rules are based on only 1 input attribute
decision list
- A set of rules that are intended to be interpreted in Sequence
association
- Can be applied if no class is specified and any kind of structure is considered "interesting" - unsupervised learning - Can predict any attribute's value, not just the class, and more than one attribute's value at a time
stratification
- Ensures that each class is represented with approximately equal proportions in both subsets
linear regression
- The classic approach to numeric estimation is this. - It produces a model that is a linear function (i.e., a weighted sum) of the input attributes.
confidence
- The confidence of a rule provides a measure of a rule's accuracy - of how well it predicts the values in the conclusion. - It answers the question: if the precondition of the rule holds, how likely is it that the conclusion also holds?
support
- The number of training examples containing the attribute values found in both the rule's antecedent and its consequent - i.e., the number of examples that the rule gets right - This metric can also be expressed as a percentage of the total number of training examples
simple linear regression
- This algorithm in Weka creates a regression equation that uses only one of the input attributes. - even when there are multiple inputs - can serve as a baseline. - compare the models from more complex algorithms to the model it produces - It also gives insight into which of the inputs has the largest impact on the output.
baseline
- When performing classification learning, 1R, can serve as a useful baseline. - The 0R algorithm learns a model that considers none of the input attributes!
euclidean distance
- aka the distance function
cross-validation
- avoids overlapping test sets - First step: split data into k subsets of equal size - Second step: use each subset in turn for testing, the remainder for training - Often the subsets are stratified
resubstitution
- error rate obtained from training data - Resubstitution error is (hopelessly) optimistic!
ZeroR
- learns a model that considers none of the input attributes - it predicts the majority class in the training data Example: in the credit card data set 9 examples with yes output and 6 examples with no output. So, zeroR would predict yes since it is 9/15
numeric estimation
- like classification learning - input attributes -> model -> output attribute/class - the model is learned from a set of training examples that include the output attribute - output attribute is numeric - want to be able to estimate the value
converse
- obtained by swapping the antecedent and consequent - not always true
a priori
- probability of an event before evidence is seen
best fit line
- straight line that represents the data on a scatter plot - may pass through some of the points, none of the points, or all of the points
multivariate model
-Multiple X's, multiple Y's -Every other model-special case-most general -The GLM -Canonical Correlation
sampling bias
-a bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. -Ex. Writers of books write more about other writers of books than other famous people
causation
A cause and effect relationship in which one variable controls the changes in another variable.
ngram
A contiguous sequence of n items from text or speech (unigram, bigram / digram, trigram, four-gram, five-gram, and so on), often representing a concept
correlation
A measure of the relationship between two variables
model
A pattern, plan, representation, or description designed to show the structure or workings of an object, system, or concept
database query
An inquiry the user poses to a database to extract a meaningful subset of data.
repeated holdout/ holdout method
Can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratification) The error rates on the different iterations are averaged to yield an overall error rate
Supervised Learning
Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.
normalization
Different attributes that are measured on different scales --> need to be normalized Nominal attributes: distance either 0 or 1 Common policy for missing values: assumed to be maximally distant (given normalized attributes)
goodness
ID3 - uses a different _____ score based on a field of study known as information theory
Naïve Bayes
It is a simplified version of Decision Trees, and can be used for classification and prediction for discrete attributes. this algorithm is fairly simple, based on the relative probabilities of the different values of each attribute, given the value of the predictable attribute.
shadow dataset
Masked or de-identified extrapolated data
covering
PRISM algorithm for rule generating
nearest neighbor
Simplest way of finding nearest neighbor: linear scan of the data Classification takes time proportional to the product of the number of instances in training and test sets Nearest-neighbor search can be done more efficiently using appropriate data structures - Often very accurate, but no structural pattern... Following "precedent" (e.g., law, diagnosis...) Assumes all attributes are equally important
class attribute, nominal
The ___ is always ___ in classification learning
overfitting
The process of fitting a model too closely to the training data for the model to be effective on other data.
bucket
To handle numeric attributes, we need to discretize (turn numeric into nominal) the range of possible values into subranges called ____
shallow knowledge
When students understand facts and information but cannot yet apply it to different contexts or use it with critical thinking skills such as, analyzing, evaluating or creating. Contrast with deep knowledge.
outliers
___ are often easy to detect when plotting a dataset in visualization.
denormalization
___ is the process of creating a flat file from multiple database tables for purpose of data mining.
1R
_____ doesnt work well if many of the input attributes have fewer possible values than the class/output attribute does.
KD-tree
_____ is used to make the nearest neighbor algorithm more efficient
ensemble learning
______ can involve meta learning - where an algorithm learns the strengths of various models in selecting which to use for a new instance
redundant attributes
______ cause accuracy problems with both naive bayes and nearest neighbor, and should be eliminated if possible.
sentiment analysis
________ is a form of data mining often applied to social media and other online content such as blog posts and comments about products
normal distribution
a bell-shaped curve, describing the spread of a characteristic throughout a population
Big Data
a broad term for datasets so large or complex that traditional data processing applications are inadequate.
item set
a collection of attribute values that appears in one or more training examples
instance-based
a family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.
parallel computing
a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time.
laplace estimator
add 1 to the count of every attribute if its fraction equals 0/some number or 5/5 (100%)
document classification
algorithms to determine the author of an article, and to decide if a tweet is bullying are examples of _______
market basket analysis
analyzes such items as websites and checkout scanner information to detect customers' buying behavior and predict future behavior by identifying affinities among customers' choices of products and services
Ordinal
being of a specified position or order in a numbered series
corpus
collections of text such as manuscripts or social media streams or speeches or other texts
decision tree algorithm
constructed: top-down recursive divide-and-conquer manner starts at the root all attributes are categorical (if continuous, they are discretized in advance) examples are partitioned recursively based on selected attributes test attributes are selected on the basis of statistical measure (information gain)
class attribute
defines the characteristics of a set of objects
lazy method
distance function, because no work is done until a classification is needed
regression equation
estimate the output attribute for previously unseen instances
regression tree
have each classification be a numeric value that is the average of the values for the training examples in that subgroup
missing values
in the nearest neighbor algorithm, maximal distance is assumed in cases of ______. (naive bayes just ignores them)
natural language processing
interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
confusion matrix
is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one
training set
is an initial set of data used to help a program understand how to apply technologies like neural networks to learn and produce sophisticated results
noise
meaningless data
correlation coefficient
measures the accuracy of how the linear model is measured
false
naive bayes is an example of a numeric estimation learning algorithm
Normilization
numerical values in a dataset must be _____ before the nearest neighbor algorithm can work correctly
crowdsourcing
obtaining services, ideas, or content by soliciting contributions from a large group of people, especially the online community
sparse data
occurs when a vast majority of values in a dataset are either missing or irrelevant
nominal distribution
one of the things that makes naive bayes "naive" is the assumption that the numerical data follows a _____, when it actually may not.
Clustering
organizing items into related groups during recall from long-term memory
clustering
organizing items into related groups during recall from long-term memory
a posteriori
probability of event after evidence is seen
Zipf's law
states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table.
micro-risk
the ability to detect _____ is one of the things that makes data mining so powerful
sum of squares
the algorithm used to find the best fit line
constant
the c in the regression equation
Machine Learning
the extraction of knowledge from data based on algorithms created from training data
true negative
the model correctly predicts no
true positive
the model correctly predicts yes
false negative
the model incorrectly predicts no
false positive
the model incorrectly predicts yes
Discretization
the process of converting numeric attributes to nominal attributes
univariate model
the simplest form of analyzing data. ... It doesn't deal with causes or relationships (unlike regression) and it's major purpose is to describe; it takes data, summarizes that data and finds patterns in the data.
weights
the w in the regression equation
fame curve
timing of debut, exponential growth, timing of peak, slow decline
model tree
to have a separate regression equation for each classification in the tree - based on the training examples in that subgroup.
Discretization
turning numeric attributes into nominal attributes Ex. Binary split, bucket split
numeric estimation
understanding/predicting a numeric quantity
probability density function
used during naive bayes for numeric attributes so they can be used to find probability
rules
used to explain decisions