BI Final
Which of the following steps for building a data mining application is the most time-consuming?
preparing the data, 80% of effort
Ordinal variables
similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables (Grades A, B, C, D, F)
Interval variables
similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000. A person is 5,000 more richer or 5,000 less richer than one of the 3 people
S-O Polarity Classification
-Comes right after the retrieval and preparation of the text documents -It is also called detection of objectivity -Fact [= objectivity] versus Opinion [= subjectivity]
Apriori Algorithm (Association Mining)
-Finds subsets that are common to at least a minimum number of the itemsets -uses bottom up approach
N-P Polarity Classification
-Given an opinionated piece of text, the goal is to classify the opinion as falling under one of two opposing sentiment polarities -N [= negative] versus P [= positive]
Set
-Ordered collection of zero, one or more tuples (each with the same order of Dimensions or Measure Groups). This means each tuple in the set must have the same structure of components in them -(Syntax: use {})
Tokenizing
-Sentences are made up of words. Tokenizing splits up the sentence into its individual words. -At its simplest, tokenizing uses spaces as delimiters between words
Topic Modeling
-Statistical approach that generates a specified number of "topics" -Usually based on this algorithm: Latent Dirichlet Allocation -This is very much like clustering (a "topic" is like a cluster)
Abduction
-The conclusion may be wrong, but it is a plausible explanation of the fact, given the rule. Useful for diagnostic tasks. -we are given the rule
How Artificial Neural Networks work
-The input layer receives the data -The internal or hidden layer processes the data. -The output layer relays the final result of the net.
Cluster Analysis for Data mining
-Used for automatic identification of natural groupings of things -unsupervised learning -Learns the clusters of things from past data, then assigns new instances
Lift % in data mining
-X => Y -Confidence divided by how often Y happens in general (with or without X)
Support % in data mining
-X => Y how often X and Y go together compared to all baskets
Confidence % in data mining
-X => Y how often Y goes together with the X compared to other baskets with X
Data consolidation
-collect data -select data -integrate data
Data preportation involves:
-data consolidation -data cleaning -data transformation -data reduction
Association Rule mining
-finds relationships between variables -unsupervised -known as market basket analysis -X => Y (Support%, Confidence %)
Deduction
-if p then q -If the rule is correct, and the fact is correct, then you know that the conclusion will be correct (classical logic)
Rule based systems (expert systems)
-implement both deductive and abductive logic -not machine learning (knowledge representation)
Data cleaning
-inputting missing values -reduce noise in data -eliminate inconsistencies
Data transformation
-normalize data -aggregate data -construct new attributes
Induction
-observe p and q together n amount of times -we create rule -stereotypical thinking....highly error prone
Sentiment Analysis
-opinion mining, subjectivity analysis, and appraisal extraction -Find outs how people feel on a subject
Data reduction
-reduce number of variables -reduce number of cases -balance skewed data
k-Fold cross Validation
-split data into k mutually exclusive subsets -use each subset as training while using the rest as testing then repeat k amount of times -aggregate test results to get prediction
Simple Split
-way of training data models -split the data into 2 mutually exclusive sets (67% for training and 33% for testing)
Consider the following set of probabilities related to outputs in an output variable used in a decision tree algorithm. Which of the following probabilities will result in the lowest entropy value?
.95
SELECT ([Dim Product].[Category].&[Bikes], [Dim Product].[SubCategory].&[Road Bikes], [Measures].[Order Quantity]) ON COLUMNS, { [Order Date].[Calendar Year].&[2005], [Order Date].[Calendar Year].&[2008]} ON ROWS FROM [Adventure Works DW2012]; how many rows are created
2
SELECT ([Dim Product].[Category].&[Bikes], [Dim Product].[SubCategory].&[Road Bikes], [Measures].[Order Quantity]) ON COLUMNS, { [Order Date].[Calendar Year].&[2005], [Order Date].[Calendar Year].&[2008]} ON ROWS FROM [Adventure Works DW2012]; how many rows are in this query
2
If you are training and testing the accuracy of a data mining model, this implies that the model must be doing
supervised learning
A confusion (or classification) matrix is created as a result of ___________ the data mining model.
testing
Supervised Learning
the ANN gets to compare its guess to feedback containing the desired results
Machine Learning
Algorithms that use mathematical or logical techniques for finding patterns in data and discovering or creating new knowledge
Unsupervised learning
the ANN receives input data but not any feedback about desired results. It develops clusters of the training records based on data similarities
The simple split methodology splits the data into two mutually exclusive subsets called a ________ set and a ________ set
training; test
Knowledge Representation Systems
Capture existing expert knowledge and use it to consult end-users and provide decision support
Which of the following is a correct correspondence between a data mining task and an algorithm?
Clustering, K-means
An association rule model done with Adventure Works sales orders finds that about 20% the total orders made by customers include both a bike and a helmet. A total of 30% of orders involve a bike (with or without a helmet). But only 25% of orders involve a helmet (with or without a bike). This means:
Confidence(bike=>helmet) is less than Confidence(helmet => bike)
Which of the following tasks is accomplished using pivot table operations?
Cube slicing and dicing
How were the weights adjusted in the neural network algorithm that made bike buyer predictions in our Adventure Works exercise?
via back propagation
In big data terms, a data ___________ is a repository of internal and external data with no predefined metadata schema.
Data lake
Classification supervised data mining tools
Decision tree, ANN
Which of the following terms best describes Hadoop?
Distributed
Which of the following is a wide-column store database?
Hbase
Apache's ___________ is a platform that provides a very SQL-like language for creating MapReduce jobs that can manipulate and query data in a Hadoop cluster.
Hive
Which algorithm involves a recursive process of selecting the "best" attribute, splitting up the data based on the selected attribute, then selecting the next best and splitting again, etc. until the data sets are reasonably homogeneous in terms of their output values?
ID3
Turing Test
If interrogator cannot tell whether the human or the computer is answering, then the computer is "intelligent"
The ________________ algorithm works by splitting a large problem into smaller subproblems and letting multiple Hadoop nodes operate on them in parallel.
MapReduce
Hadoop is Java-based, and MapReduce algorithms are often written in Java. However, there is an execution environment called ________, which involves a simplified scripting language for performing many MapReduce tasks.
PIG
MDX
Query language for OLAP cubes
A relational database involves predefined metadata. This is different from much of the big-data world in which metadata is defined as needed, typically when querying the data. This is called _______________ (three word phrase).
Schema on Read
Multidimensionality
The ability to organize, present, and analyze data by several dimensions, such as sales by region, by product, by salesperson, and by time (four dimensions)
A dependency relationship's structure involves:
a predicate, a governor, and a dependent.
Dimensional Modeling
a retrieval-based system that supports high-volume query access
I know that people who shop at a shopping mall generally carry out bags of what they bought from the mall. I see a person walking out of the mall carrying a bag, and conclude that this person probably have bought something from the mall. What type of reasoning am I using?
abduction
Adventure Works wishes to know which additional products are most likely to be bought along with each of their types of bikes (i.e. in the same order). What kind of data mining task is this?
association
The feature that distinguishes neural networks that do supervised learning from neural networks that do unsupervised learning is the use of ____________.
back propagation
Given a set of customer data, including various demographics (age, income, education, number of cars owned, etc.) and whether or not a bike was purchased by the customer, Adventure Works wants to be able to predict which types of customers are most likely to buy bikes. What type of data mining task is this?
classification
Dimension Table Contains
classification and aggregation information about the values in the fact table
Confusion Matrix
classification problems, the primary source for accuracy estimation
Adventure Works wishes to segment its customers into a set of five or six "natural" groupings, without any preconceived notion of what these groupings should be. In other words, they want to group customers together based on how similar they are to each other. What type of data mining task is this?
clustering
Consider the following sentences: "My friend talked with my sister. Then he asked her out on a date." What feature allows us to know that "he" refers to my friend and "she" refers to my sister?
corefercence resolution
Another term for multidimensional database is:
cube
If you want to build a prediction data mining model where (a) you have training data with known categories for classification, and (b) the input attributes are discrete text values, the best data mining approach would probably be:
decision tree, based on inductive reasoning
Three types of logical structures commonly used in AI systems:
deduction, abduction, induction
Fact Table Contains
descriptive attributes (numerical values) needed to perform decision analysis
Which of the following formulas helps the k-means algorithm to form the best clusters of data points?
euclidean distance
Cube Slicing
filter by one or more dimensions
Co-reference resolution
finding all expressions that refer to the same entity in a text. (e.g. finding connections nouns and their associated pronouns)
Forward Chaining
from facts to conclusion
Backward Chaining
from hypothesis (potential conclusions) backward to facts
Drill-Down
going from summary to more detailed views
The NoSQL model that is specifically designed to maintain information regarding the relationships (often real-world instances of entities) between data items is called a(n) __________-oriented database. This is used for ontologies, and used to be called a "semantic network".
graph-oriented database
Which feature of cubes facilitate drill-down?
hierarchical dimension attributes
Which of the following DOES NOT involve applying a set of known rules to a set of given facts in order to reach a conclusion?
induction
Backpropagation
involves changing input weights to neurodes based on errors in outputs, until those errors approach zero.
Clustering unsupervised data mining tools
k-means, ANN/SOM
The _________________ data structure is common in the NoSQL world, involving operations like put, get, and delete. In this sense, it is very much like a HashMap.
key-value store
The ID3 decision tree algorithm works best for:
making predictions based on categorical data.
In an HDFS cluster, the _____________ is responsible for managing the file system, which is distributed among many data nodes.
name node
Categorizing Adventure Works products into product types like "Bikes", "Accessories", "Clothing", and "Components" involves ____________________ data.
nominal categorical
Giving a grade to a student of "A", "B", "C", "D", or "F" involves ____________________ data.
nominal ordinal
Nominal variables
one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories
Categorical data can be sub-classified as ________ and ________
ordinal; nominal
One of the challenges for natural language processing (NLP) systems is to recognize when a word is a noun or a verb based on its context of the sentence. For example "I went for a run" vs. "Let's run to the store". In one case, "run" is a noun, and in the other, "run" is a verb. The NLP task that tackles this problem is called ______________ tagging
part-of-speech
Query Axis
(in the SELECT clause) - rows, columns, and in principle can have many axes (many dimensions)
Slicer Axis
(in the WHERE clause) - this is how you "slice" the cube in the query
Tuple
-A collection of one or more members (from different Dimension hierarchies or Measure Groups) -Only one member from each dimension attribute or hierarchy is allowed in a tuple -(Syntax: use ())
Member
-An item in a dimension or measure group. Distinct value for a dimension's attribute or distinct measure. -Lowest level of reference for a cube
Standford's CoreLNP
-Breaking a text document into individual sentences -Tokenizing a sentence -Identifying parts of speech (POS) within a sentence (nouns, verbs, adjectives, adverbs, etc.) -Named entity recognition