Data Mining Basics

¡Supera tus tareas y exámenes ahora con Quizwiz!

What are two characteristics of Classification?

Discrete and nominal

This is a collective of decision trees. To classify a new object based on its attributes, each tree is classified, and the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Random Forest

What is the proportion of the cats the ML system found correctly among all the cats hiding in all the image?

Recall= TP/(TP + FN)

This supervised learning is related to continuous data (value functions). The predicted output values are real numbers. It deals with problems such as predicting the price of a house or the trend in the stock price at a given time, etc.

Regression

This is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are.

Similarity learning

What machine learning approach is defined by its use of labeled datasets to train algorithms to classify data and predict outcomes. The labeled dataset has output tagged corresponding to input data for the machine to understand what to search for in the unseen data.

Supervised Learning

This is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks

TensorFlow

_________________ is the most straight forward approach to test and validation datasets is the holdout method. In this approach, you set aside portions of the original dataset for validation and testing purposes at the beginning of the model development process.

The Holdout Method

What is the formula for false positive percentages?

We can compute this rate by dividing the number of false positives (FP) by the sum of the number of false positives and the number of true negatives (TN), or, as a formula: FPR = FP/(FP + TN)

Name three types of Regression Algorithms?

1. Random Forest 2. Decision Tree 3. Neural Networks

What are the four basic types of algorithms?

1. Regression 2. Classification 3. Clustering 4. Association

What are the three types of Machine Learning prediction errors when building or choosing a model?

1. Bias. 2. Variance 3. Irreducible error

What are the three Logistic Regression Models: 1. What logistic regression model has a response variable that can only belong to one of two categories? 2. What logistic regression model has a response variable that can belong to one of three or more categories and there is no natural ordering among the categories. 3. What logistic regression model has a response variable that can belong to one of three or more categories and there is a natural ordering among the categories.

1. Binary Logistic Regression 2. Multinomial or Nominal logistic regression 3. Ordinal logistic regression

1. These are techniques train models that allow us to predict membership in a category. 2. These are techniques allow us to predict a numeric result. 3. These are techniques help us discover the ways that observations in our dataset resemble and differ from each other.

1. Classification, 2. Regression (Linear not Logistic Regression) 3. Similarity learning

What are the key steps of a Data Science Project?

1. Collect Data 2. Analyze Data 3. Suggest Hypothesis

What are the three workflow steps of a Machine Learning Project?

1. Collect data 2. Train model 3. Deploy model

When reducing the number of features in a dataset, ___________ is simply selecting and excluding given features without changing them, and ______________ transforms features into a lower dimension.

1. Feature selection 2. Dimensionality reduction

Supervised Learning has: 1. Input data is? 2. Does it have a feedback mechanism? 3. Data is classified on? 4. Divided into what two types of Algorithms? 5. What is it used for (Prediction or Analysis)? 6. Does it have a known or unknown number of classes?

1. Input data is labeled 2. Yes, it has a feedback mechanism 3. Data is classified on training dataset 4. Divided into Regression and Classification 5. Used for prediction 6. It has a known number of classes

Unsupervised Learning has: 1. Input data is? 2. Does it have a feedback mechanism? 3. Data is classified on? 4. Divided into what two types of Algorithms? 5. What is it used for (Prediction or Analysis)? 6. Does it have a known or unknown number of classes?

1. Input data is unlabeled 2. It has no feedback mechanism 3. It assigns properties of given data to classify it 4. It is divided into Clustering and Association 5. It is used for analysis 6. It has an unknown number of classes

What is the difference between an instance and a feature? A single row of data is called ______________. It is an observation from the domain. A single column of data is called __________. It is a component of an observation and is also called an attribute of a data instance.

1. Instance 2. Feature or attribute of the Data Instance.

Name four types of Clustering Algorithms?

1. K-means 2. Hierarchical Clustering 3. DBSCAN 4. Neural Networks

Name five types of Classification Algorithms?

1. Logistic Regression 2. K-nearest neighbors (KNN) 3. Decision Tree 4. SVM (Support Vector Machines) 5. Random Forest

What are the six things that Feature Selection uses for Machine Learning

1. Remove features with missing values 2. Remove features with low variance 3. Remove highly correlated features 4. Univariate feature selection 5. Recursive feature elimination 6. Feature selection using SelectFromModel

Name three types of Association Algorithms?

1. aPriori 2. Eclat 3. FP-growth

A _____________ assesses the performance of the model. The test dataset is set aside at the beginning of the model development process specifically for the purpose of model assessment. It is not used in the training process, so it is not possible for the model to overfit the test dataset.

A test dataset

What is Artificial Intelligence?

AI aims to make a smart computer system work just like humans to solve complex problems

For the sources of error: bias and variance what is their relationship to each other?

An algorithm that exhibits high variance will have low bias, while a low-variance algorithm will have higher bias, Bias and variance are intrinsic characteristics of our models and coexist. When modifying the models to improve one, it comes at the expense of the other. The goal is to find an optimal balance between the two.

This is a kind of Unsupervised Learning where we can find the relationship of one data item to another data item. We can then use those dependencies and map them in a way that benefits us - e.g., understanding consumers' habits regarding our products can help us develop better cross-selling strategies. These techniques are often utilized in customer behavior analysis in e-commerce websites and OTT platforms.

Association

These two are both examples of unsupervised uses of similarity learning techniques. It's also possible to use similarity learning in a supervised manner.

Association rules and clustering

What is the type of error that occurs due to our choice of a machine learning model? When the model type that we choose is unable to fit our dataset well, the resulting error?

Bias

What type of Logistic Regression is your output is binary (i.e., it has two possible outcomes). The cracking example given above would utilize this logistic regression. Other examples of binary responses could include passing or failing a test, responding yes or no on a survey, and having high or low blood pressure.

Binary Logistic Regression

These techniques use supervised machine learning to help us predict a categorical response. That means that the output of our model is a non-numeric label or, more formally, a categorical variable. This simply means that the variable takes on discrete, non-numeric values, rather than numeric values.

Classification

This supervised learning refers to taking an input value and mapping it to a discrete value. In these problems, output typically consists of classes or categories. This could be things like trying to predict what objects are present in an image (a cat/ a dog) or whether it is going to rain today or not.

Classification

if a machine predicts whether an employee will get a salary raise or not, that is_____________, but if it answers how much is the salary raise, that is _______________.

Classification ---- Regression.

This is a type of Unsupervised Learning where we find hidden patterns in the data based on their similarities or differences. These patterns can relate to the shape, size, or color and are used to group data items or create groups.

Clustering

What are two characteristics of Regression?

Continuous and numeric

There are also a variety of more advanced methods for creating validation datasets that perform repeated sampling of the data during an iterative approach to model development. These approaches, known as ______________ techniques, are particularly useful for smaller datasets where it is undesirable to reserve a portion of the dataset for validation purposes.

Cross-validation

What is Data Science?

Data Science helps with creating insights from data that deals with real world complexities.

What are some characteristics of Association?

Find rules and patterns

What are some characteristics of Clustering?

Group in subsets/classes

This approach to classification, uses a table of probabilities to predict the likelihood that an instance belongs to a particular class. This classifier is a probabilistic machine learning model that's used for classification task. It is a simple and effective machine learning algorithm for solving multi-class problems.

Naïve Bayes

Python and R are ______________ languages which means the source code of the Python or R program is converted into bytecode that is then executed by the Python virtual machine. they are not ____________languages (compilers) such as C and C++.

Interpreted languages Compiled languages

This error or noise occurs independently of the machine learning algorithm and training dataset that we use. It is an error inherent in the problem that we are trying to solve.

Irreducible error, which we cannot do much to address irreducible error. inherent to the model.

What is the performance measure used for Regression errors, which takes the square of each residual value and then adds those squared residuals together?

It is called the residual sum of squares.

This is an unsupervised learning algorithm that solves clustering problems. Data sets are classified into a particular number of clusters (let's call that number K) in such a way that all the data points within a cluster are homogenous and heterogeneous from the data in other clusters.

K-Means

This algorithm is a non-parametric supervised learning method. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. This data classification method for estimating the likelihood that a data point will become a member of one group or another based on what group the data points nearest to it belong to.

KNN (K- Nearest Neighbors) Algorithm

Supervised Data is?

Labeled

These techniques use supervised machine learning techniques to help us predict a continuous response. Simply put, this means that the output of our model is a numeric value. Instead of predicting membership in a discrete set of categories, we are predicting the value of a numeric variable.

Linear Regression

What is a subset of regression that assumes that the relationship between the predictor variables X and the response variable Y is linear.

Linear regression

The _____________ regression refers to any regression model in which the response variable is categorical.

Logistic Regression

What is Machine Learning?

Machine Learning helps in accurately predicting or classifying outcomes for new data points by learning patterns from historical data. ML allows machines to learn from data so they can provide accurate output.

What type of Logistic Regression is when there are three or more input categories with no natural ordering to the levels. Examples of these responses could include departments at a business (e.g., marketing, sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN), and color (black, red, blue, orange).

Nominal Logistic Regression

What type of Logistic Regression is used when there are three or more input categories with a natural ordering to the levels, but the ranking of the levels do not necessarily mean the intervals between them are equal. Examples of these responses could be how students rate the effectiveness of a college course (e.g., good, medium, poor), levels of flavors for hot wings, and medical condition (e.g., good, stable, serious, critical).

Ordinal Logistic Regression

This which occurs when we have a model with low bias but high variance? In this case, the model fits the training dataset too well.

Overfitting

What is the fraction of actual number of a cats among the cats the ML system found?

Precision = TP/(TP + FP)

What does Dimensionality Reduction use for reducing the features in a dataset?

Principal Component Analysis, or PCA, is a dimensionality-reduction method to find lower-dimensional space by preserving the variance as measured in the high dimensional input space. It is an unsupervised method for dimensionality reduction

For Regression errors, The red lines between the predicted and actual values are the magnitude of the error, this is called?

The residual value. The longer the line, the worse the algorithm performed on that dataset.

In Statistics, false positive errors are also known as what type of errors?

Type I errors.

In Statistics, false negative errors are also known as what type of errors?

Type II errors

In cases where we have high bias and low variance, it describes the model as _____________ the data.

Underfitting

Unsupervised Data is?

Unlabeled

What type of machine learning in which the algorithms are provided with data that does not contain any labels or explicit instructions on what to do with it. The goal is for the learning algorithm to find structure in the input data on its own. This is a kind of self-learning where the algorithm can find previously hidden patterns in the unlabeled datasets and give the required output without any interference.

Unsupervised Learning

What is the type of error that occurs when the dataset that we use to train our machine learning model is not representative of the entire universe of possible data?

Variance

What is the formula for false negative percentages?

We can compute this rate by dividing the number of false negatives (FN) by the sum of the number of false negatives and the number of true positives (TP), or, as a formula: FNR = FN/(FN + TP)

A ____________ is used to train your model. A _______________ is used to evaluate your model.

loss function metric

A ________________ is used during the learning process. A ______________ is used after the learning process

loss function metric


Conjuntos de estudio relacionados

Anatomy and positioning of the hand

View Set

AP Human Geography AP Exam Review Cards

View Set

Excel Chapter 1: End-of-Chapter Quiz

View Set

Sociology Vocab Slides for Quiz (10/1)

View Set

Econ 4200 Chapter 7, 8, 9 Quiz Set

View Set

Chpt 37 Nursing Care of the Child With an Infectious or Communicable Disorder

View Set