BI Final

Ace your homework & exams now with Quizwiz!

Which of the following steps for building a data mining application is the most time-consuming?

preparing the data, 80% of effort

Ordinal variables

similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables (Grades A, B, C, D, F)

Interval variables

similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000. A person is 5,000 more richer or 5,000 less richer than one of the 3 people

S-O Polarity Classification

-Comes right after the retrieval and preparation of the text documents -It is also called detection of objectivity -Fact [= objectivity] versus Opinion [= subjectivity]

Apriori Algorithm (Association Mining)

-Finds subsets that are common to at least a minimum number of the itemsets -uses bottom up approach

N-P Polarity Classification

-Given an opinionated piece of text, the goal is to classify the opinion as falling under one of two opposing sentiment polarities -N [= negative] versus P [= positive]

Set

-Ordered collection of zero, one or more tuples (each with the same order of Dimensions or Measure Groups). This means each tuple in the set must have the same structure of components in them -(Syntax: use {})

Tokenizing

-Sentences are made up of words. Tokenizing splits up the sentence into its individual words. -At its simplest, tokenizing uses spaces as delimiters between words

Topic Modeling

-Statistical approach that generates a specified number of "topics" -Usually based on this algorithm: Latent Dirichlet Allocation -This is very much like clustering (a "topic" is like a cluster)

Abduction

-The conclusion may be wrong, but it is a plausible explanation of the fact, given the rule. Useful for diagnostic tasks. -we are given the rule

How Artificial Neural Networks work

-The input layer receives the data -The internal or hidden layer processes the data. -The output layer relays the final result of the net.

Cluster Analysis for Data mining

-Used for automatic identification of natural groupings of things -unsupervised learning -Learns the clusters of things from past data, then assigns new instances

Lift % in data mining

-X => Y -Confidence divided by how often Y happens in general (with or without X)

Support % in data mining

-X => Y how often X and Y go together compared to all baskets

Confidence % in data mining

-X => Y how often Y goes together with the X compared to other baskets with X

Data consolidation

-collect data -select data -integrate data

Data preportation involves:

-data consolidation -data cleaning -data transformation -data reduction

Association Rule mining

-finds relationships between variables -unsupervised -known as market basket analysis -X => Y (Support%, Confidence %)

Deduction

-if p then q -If the rule is correct, and the fact is correct, then you know that the conclusion will be correct (classical logic)

Rule based systems (expert systems)

-implement both deductive and abductive logic -not machine learning (knowledge representation)

Data cleaning

-inputting missing values -reduce noise in data -eliminate inconsistencies

Data transformation

-normalize data -aggregate data -construct new attributes

Induction

-observe p and q together n amount of times -we create rule -stereotypical thinking....highly error prone

Sentiment Analysis

-opinion mining, subjectivity analysis, and appraisal extraction -Find outs how people feel on a subject

Data reduction

-reduce number of variables -reduce number of cases -balance skewed data

k-Fold cross Validation

-split data into k mutually exclusive subsets -use each subset as training while using the rest as testing then repeat k amount of times -aggregate test results to get prediction

Simple Split

-way of training data models -split the data into 2 mutually exclusive sets (67% for training and 33% for testing)

Consider the following set of probabilities related to outputs in an output variable used in a decision tree algorithm. Which of the following probabilities will result in the lowest entropy value?

.95

SELECT ([Dim Product].[Category].&[Bikes], [Dim Product].[SubCategory].&[Road Bikes], [Measures].[Order Quantity]) ON COLUMNS, { [Order Date].[Calendar Year].&[2005], [Order Date].[Calendar Year].&[2008]} ON ROWS FROM [Adventure Works DW2012]; how many rows are created

2

SELECT ([Dim Product].[Category].&[Bikes], [Dim Product].[SubCategory].&[Road Bikes], [Measures].[Order Quantity]) ON COLUMNS, { [Order Date].[Calendar Year].&[2005], [Order Date].[Calendar Year].&[2008]} ON ROWS FROM [Adventure Works DW2012]; how many rows are in this query

2

If you are training and testing the accuracy of a data mining model, this implies that the model must be doing

supervised learning

A confusion (or classification) matrix is created as a result of ___________ the data mining model.

testing

Supervised Learning

the ANN gets to compare its guess to feedback containing the desired results

Machine Learning

Algorithms that use mathematical or logical techniques for finding patterns in data and discovering or creating new knowledge

Unsupervised learning

the ANN receives input data but not any feedback about desired results. It develops clusters of the training records based on data similarities

The simple split methodology splits the data into two mutually exclusive subsets called a ________ set and a ________ set

training; test

Knowledge Representation Systems

Capture existing expert knowledge and use it to consult end-users and provide decision support

Which of the following is a correct correspondence between a data mining task and an algorithm?

Clustering, K-means

An association rule model done with Adventure Works sales orders finds that about 20% the total orders made by customers include both a bike and a helmet. A total of 30% of orders involve a bike (with or without a helmet). But only 25% of orders involve a helmet (with or without a bike). This means:

Confidence(bike=>helmet) is less than Confidence(helmet => bike)

Which of the following tasks is accomplished using pivot table operations?

Cube slicing and dicing

How were the weights adjusted in the neural network algorithm that made bike buyer predictions in our Adventure Works exercise?

via back propagation

In big data terms, a data ___________ is a repository of internal and external data with no predefined metadata schema.

Data lake

Classification supervised data mining tools

Decision tree, ANN

Which of the following terms best describes Hadoop?

Distributed

Which of the following is a wide-column store database?

Hbase

Apache's ___________ is a platform that provides a very SQL-like language for creating MapReduce jobs that can manipulate and query data in a Hadoop cluster.

Hive

Which algorithm involves a recursive process of selecting the "best" attribute, splitting up the data based on the selected attribute, then selecting the next best and splitting again, etc. until the data sets are reasonably homogeneous in terms of their output values?

ID3

Turing Test

If interrogator cannot tell whether the human or the computer is answering, then the computer is "intelligent"

The ________________ algorithm works by splitting a large problem into smaller subproblems and letting multiple Hadoop nodes operate on them in parallel.

MapReduce

Hadoop is Java-based, and MapReduce algorithms are often written in Java. However, there is an execution environment called ________, which involves a simplified scripting language for performing many MapReduce tasks.

PIG

MDX

Query language for OLAP cubes

A relational database involves predefined metadata. This is different from much of the big-data world in which metadata is defined as needed, typically when querying the data. This is called _______________ (three word phrase).

Schema on Read

Multidimensionality

The ability to organize, present, and analyze data by several dimensions, such as sales by region, by product, by salesperson, and by time (four dimensions)

A dependency relationship's structure involves:

a predicate, a governor, and a dependent.

Dimensional Modeling

a retrieval-based system that supports high-volume query access

I know that people who shop at a shopping mall generally carry out bags of what they bought from the mall. I see a person walking out of the mall carrying a bag, and conclude that this person probably have bought something from the mall. What type of reasoning am I using?

abduction

Adventure Works wishes to know which additional products are most likely to be bought along with each of their types of bikes (i.e. in the same order). What kind of data mining task is this?

association

The feature that distinguishes neural networks that do supervised learning from neural networks that do unsupervised learning is the use of ____________.

back propagation

Given a set of customer data, including various demographics (age, income, education, number of cars owned, etc.) and whether or not a bike was purchased by the customer, Adventure Works wants to be able to predict which types of customers are most likely to buy bikes. What type of data mining task is this?

classification

Dimension Table Contains

classification and aggregation information about the values in the fact table

Confusion Matrix

classification problems, the primary source for accuracy estimation

Adventure Works wishes to segment its customers into a set of five or six "natural" groupings, without any preconceived notion of what these groupings should be. In other words, they want to group customers together based on how similar they are to each other. What type of data mining task is this?

clustering

Consider the following sentences: "My friend talked with my sister. Then he asked her out on a date." What feature allows us to know that "he" refers to my friend and "she" refers to my sister?

corefercence resolution

Another term for multidimensional database is:

cube

If you want to build a prediction data mining model where (a) you have training data with known categories for classification, and (b) the input attributes are discrete text values, the best data mining approach would probably be:

decision tree, based on inductive reasoning

Three types of logical structures commonly used in AI systems:

deduction, abduction, induction

Fact Table Contains

descriptive attributes (numerical values) needed to perform decision analysis

Which of the following formulas helps the k-means algorithm to form the best clusters of data points?

euclidean distance

Cube Slicing

filter by one or more dimensions

Co-reference resolution

finding all expressions that refer to the same entity in a text. (e.g. finding connections nouns and their associated pronouns)

Forward Chaining

from facts to conclusion

Backward Chaining

from hypothesis (potential conclusions) backward to facts

Drill-Down

going from summary to more detailed views

The NoSQL model that is specifically designed to maintain information regarding the relationships (often real-world instances of entities) between data items is called a(n) __________-oriented database. This is used for ontologies, and used to be called a "semantic network".

graph-oriented database

Which feature of cubes facilitate drill-down?

hierarchical dimension attributes

Which of the following DOES NOT involve applying a set of known rules to a set of given facts in order to reach a conclusion?

induction

Backpropagation

involves changing input weights to neurodes based on errors in outputs, until those errors approach zero.

Clustering unsupervised data mining tools

k-means, ANN/SOM

The _________________ data structure is common in the NoSQL world, involving operations like put, get, and delete. In this sense, it is very much like a HashMap.

key-value store

The ID3 decision tree algorithm works best for:

making predictions based on categorical data.

In an HDFS cluster, the _____________ is responsible for managing the file system, which is distributed among many data nodes.

name node

Categorizing Adventure Works products into product types like "Bikes", "Accessories", "Clothing", and "Components" involves ____________________ data.

nominal categorical

Giving a grade to a student of "A", "B", "C", "D", or "F" involves ____________________ data.

nominal ordinal

Nominal variables

one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories

Categorical data can be sub-classified as ________ and ________

ordinal; nominal

One of the challenges for natural language processing (NLP) systems is to recognize when a word is a noun or a verb based on its context of the sentence. For example "I went for a run" vs. "Let's run to the store". In one case, "run" is a noun, and in the other, "run" is a verb. The NLP task that tackles this problem is called ______________ tagging

part-of-speech

Query Axis

(in the SELECT clause) - rows, columns, and in principle can have many axes (many dimensions)

Slicer Axis

(in the WHERE clause) - this is how you "slice" the cube in the query

Tuple

-A collection of one or more members (from different Dimension hierarchies or Measure Groups) -Only one member from each dimension attribute or hierarchy is allowed in a tuple -(Syntax: use ())

Member

-An item in a dimension or measure group. Distinct value for a dimension's attribute or distinct measure. -Lowest level of reference for a cube

Standford's CoreLNP

-Breaking a text document into individual sentences -Tokenizing a sentence -Identifying parts of speech (POS) within a sentence (nouns, verbs, adjectives, adverbs, etc.) -Named entity recognition


Related study sets

CIS Lesson 02 Quiz 1 e-Commerce Business Models

View Set

Chapter 14- Energy generation in mitochondria and chloroplasts

View Set

Strategic Management Test 1 (Ch 1-6)

View Set