Big Data Test 2
A hidden layer is _________. A) A layer in a neural network between the input and outputs. B) A layer in social media networks that is hidden to the user. C) A layer in streaming stacks that processes data while remaining hidden to users. D) None of the above.
A) A layer in a neural network between the inputs and outputs
Convolutional neural networks are most notably used for __________. A) image classification B) social network analytics C) forecasting the stock market D) none of these
A) Image Classification
What are the major data mining tasks? A) Prediction B) Association C) Cluster D) All of them
All of them
Unsupervised Learning
An algorithm explores input data without being given an explicit output variable.
Supervised Learning
An algorithm uses training data and feedback from humans to learn the relationship of a given inputs to a given output.
Activation Function (Transfer Function)
An artificial neural network, the activation function of a neuron defines the output of the neuron given a set of inputs; defines how to pas the value from inputs through the neuron and make the output.
Step 3: Data Preparation
Data Consolidation, Cleaning, and Transformation
Stages of CRISP-DM
1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Model Building 5. Testing and Evaluation 6. Deployment
Most common DNNs
1. Convolutional Neural Network 2. Recurrent Neural Network
Which one defines how to pass the value from inputs through the neuron and make the output? A) Activation function B) Value function C) Control function D) Transform function
Activation Function
Clustering
An unsupervised learning method to segment data into groups that are NOT previously defined
Convolutional neural networks (CNNs)
used for image classification
Classification
used for predicting categorical variables
Regression
used for predicting continuous variables
Assume you want to perform supervised learning and to predict number of newborns according to size of storks' population, it is an example of _______ A) Classification B) Regression C) Clustering D) Structural equation modeling
B) Regression
Prediction
Two major types of prediction are classification and regression
Association Analysis is a/an _________. A) Supervised learning B) Unsupervised learning C) Reinforcement learning
Unsupervised Learning
Support
Refers to the percentage of baskets where the rule was true; the occurring frequency of the rule; probability of simultaneously observing both items sets in a database
Apriori (Association rules)
Rules of Form Condition (antecedent) -> Result (Consequent)
The goal of which task is to predict categories? _______ A) Regression B) Clustering C) Classification D) Association analysis
C) Classification
Select the incorrect statement about prediction that is one of data mining tasks? A) Two major types of predictions are classification and regression. B) Supervised learning methods can be used for classification and regression. C) Classification is used for predicting continuous outcome variables. D) Regression is used for predicting continuous outcome variables.
C) Classification is used for predicting continuous outcome variables
1. Which of the following is not true regarding data mining? A) It focuses on discovering useful information. B) It can use machine learning techniques. C) It is one task within a process. D) It is a process from start to end.
C) It is one task within a process
(_______) is a industry standard data mining process that is iterative in nature and has 6 steps. A) CRISP-DM B) SEMMA C) KDD
CRISP-DM
Generally looking for
Support as high as possible Confidence close to 1.0 Lift higher than 1.0
Step 1: Business Understanding
- Specific goals tied to potential action are critical - These allow development of a project plan - The project plan specifies the people responsible for collecting the data, analyzing data, and reporting findings.
Logistic Regression
- The appropriate regression analysis to conduct when the dependent variable is dichotomous (binary) - While a linear regression model outputs a real number, a logistic regression model outputs a probability value.
Support vector classifier
- The generalization of maximal margin classifier to the non-separable case - In this case, we might be willing to consider a classifier based on a line that does not perfectly separate the two classes.
Classification vs clustering
- classification is a supervised predictive model that segments data by assigning them to groups that are already defined. - Examines already classified data develops a predictive pattern
Review Questions
.
Confidence (Probability)
A measurement of its predicative power
Artificial Narrow Intelligence
A system for a particular task or job (ex. smart speaker)
Artificial General Intelligence
A system for doing everything humans can do
Task of inferring a model from labeled training data is called ________ A) Supervised learning B) Unsupervised learning
A) Supervised Learning
Data mining is defined as a process of identifying valid, novel, potentially useful, and understandable patterns in data. What is the meaning of 'valid'? A) The pattern should hold true on new data. B) It can use machine learning techniques. C) The pattern is easy to understand. D) The discovered patterns should lead to benefits.
A) The pattern should hold true on new data
What is the confidence level of the association rule [Cereal -> Milk] in this data? _______ A) 0.65 B) 0.75 C) 1.0 D) 0.57 Customer Items 1 Cereal, Milk, Bread 2 Eggs, Cereal, Beer, Water 3 Milk, Water, Cereal 4 Eggs, Beer, Bread 5 Cereal, Milk, Bread
B) 0.75
Which of the following is an example of unsupervised learning? _______ A) Classification B) Clustering C) Regression D) None of those
B) Clustering
(_______) is one of the three numeric measures that must be considered for an association rule; it measures its predictive power. A) Support B) Confidence C) Lift D) Precision
B) Confidence
Which of the following is used for clustering? A) Logistic regression B) K-means C) Apriori D) Neural networks
B) K-Means
Regression works by: A) Maximizing the distance between each data point in the dataset and the regression model. B) Minimizing the distance between each data point in the dataset and the regression model.
B) Minimizing the distance between each data point in the dataset and the regression model.
Let's suppose that a retailor found an association rule (Peanut butter à Bread) in their database. This rule's confidence is 0.7, support is 1.0, and lift is 0.85. Select the incorrect statement: _______ A) The right-hand side of this association rule is called the result. B) The retailor can think that those two items are truly associated. C) A purchase involving peanut butter is accompanied by a purchase of bread 70% of the time. D) Every transaction in the database includes the two items.
B) The retailor can think that those two items are truly associated
What is the problem of finding hidden structure in data without given an explicit output variable (unlabeled data)? A) Supervised learning B) Unsupervised learning
B) Unsupervised Learning
What is the first phase of CRISP-DM process? A) Data understanding B) Data collection C) Business understanding D) Modeling building
C) Business Understanding
(_______) is a task to segment data into groups that are not previously defined. A) Clustering B) Classification
Clustering
Summation Function
Computing the weighted sums of all input elements entering each processing element.
CRISP-DM
Cross Industry Standard Process for Data Mining
(_______) is one of CRISP-DM phases. This phase includes several tasks, such as data cleansing and transforming. A) Data consolidation B) Data preparation C) Model evaluation D) Data collection
Data Preparation
( ) is a subset of machine learning that uses multi-layered artificial neural networks.
Deep learning
deep neural network (DNN)
Deep learning is a subset of machine learning that uses multi-layered artificial neural networks (DNN) to deliver state-of-the-art accuracy in tasks such as object detection, speech recognition, and language translation. - Refers to a neural network with more than one hidden layer
artificial neural network
Each neuron 1) calculates a weighted sum of incoming values 2) transforming this input using the activation function, and 3) passes on the value to the subsequent neurons
The goal of k-means algorithms is to maximize the within-cluster-variation. (True / False)
False
The support levels of the two association rules, [A -> B] and [B -> A], are different. (True / False)
False
If lift (Milk->Bread) > 1
It implies that the two items are found together more often than one would expect by chance. - a large lift value is therefore a strong indication that a rule is important, and reflects a true connection between the items.
( ) is the scientific study of statistical models that computer systems use to perform a specific task without using explicit instructions.
Machine Learning
Machine Learning
Machine learning is the scientific study of statistical models that the computer systems use to perform a specific task WITHOUT USING EXPLICIT INSTRUCTIONS
Valid
Means that the discovered patterns should hold true on new data with sufficient degree of certainty
Potentially useful
Means that the discovered patterns should lead to some benefits to the user or task
Understandable
Means that the discovered patterns should make business sense that leads to the user saying mmm! it makes sense!
Novel
Means that the patterns are not previously known to the user within the context of the system being analyzed
Lift
Measures whether the condition product is present without the result product - Lift values > 1.0 indicate that the transactions containing the condition tend to contain the result more often than transaction that do not contain the condition
Support Vector Machine
People often refer to 1) The maximal margin classifier, 2) Support vector classifier and 3) The support vector machine as "support vector machines" A generalization of a simple and intuitive classifier called the maximal margin classifier
Measuring Rules
The effective use of a rule, three numeric measures about the rule must be considered: support, confidence, and lift
Data Mining
The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases. - Identify useful information and knowledge from large data sets
Hidden layer
The second layer of a three-layer network where the input layer sends its signals, performs intermediary processing and send to output layer
7. Discriminating between spam and non-spam emails is a classification task (True / False)
True
The maximal margin classifier seeks the largest possible margin so that every observation is on the correct side of the separate line. (True / False)
True
The support vector classifier allows some observations to be on the incorrect side of the separate line. (True / False)
True
Recurrent Neural Networks (RNN)
Used for natural language processing and for sequential data.
K-Means
creates k groups from a set of objects so that the members of a group are more similar. It's a popular cluster analysis technique for exploring a dataset. - a good clustering is one for which the within-cluster variation is small as possible
Process
implies that data mining comprises many iterative steps
Artificial Intelligence
the science and engineering of making intelligent machines, especially intelligent computer programs