Machine Learning with Python

Ace your homework & exams now with Quizwiz!

Entropy

A measure of disorder or randomness.

Job Tracker

A central control program used to accept, distribute, monitor, and report on MapReduce processing jobs in a Hadoop environment.

Naive Bayes Classifier

A family of algorithms that consider every feature as independent of any other feature based on applying Bayes' theorem with strong independence assumptions between the features.

Hadoop Distributed File System (HDFS)

A highly distributed, fault-tolerant file storage system designed to manage large amounts of data at high speeds. default blocks of data are usually of a size 128 MB and each of these blocks is replicated 3x.

Bayes' Theorem

Calculation of probability based on prior probability and new information When given P(A) given P(B), you can use this to find the P(B) given P(A). P(A|B) = P(B|A)P(A)/P(B) posterior = (likelihood x prior)/evidence

MapReduce

A two-phase technique for harnessing the power of thousands of computers working in parallel. During the first phase, the Map phase, computers work on a task in parallel; during the second phase, the Reduce phase, the work of separate computers is combined, eventually obtaining a single result. Used to distribute a task to a distributed data set

Tree method

Edges: outcome of the split on the next node Nodes: split for the value of a certain attribute Leaves: terminal nodes that predict the outcome Root: the node that performs the first split Intuition: trying to choose features that best split the data into two categories

Deep Learning

Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks. uses neural networks to identify relationships in data by modeling processes of the human brain Multiple layers in network, each layer learns to detect a certain/specific level of features, for the stimulus, comparable to how the visual system process information step by step

EC2

Elastic Compute Cloud. Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers. Reduces the time required to obtain & boot new server instances to minutes.

Missclassification rate

Error rate FP+FN/total

Machine Learning

Leverages massive amounts of data so that computers can iteratively learn from data and improve/find hidden insights on their own without additional programming. a method of data analysis that automates analytical model building

Cross-validation

Process used with multiple regression techniques in which a regression equation developed on a first sample is tested on a second sample to determine if it still fits well; usually carried out with an incumbent sample, and the cross-validated results are used to weight the predictor scores of an applicant sample

task tracker

Service used to run Map Reduce, runs the task and then sends progress reports to the job tracker It allocates CPU and memory for the task and monitors the task on the worker nodes

Support Vector Machine

Supervised learning classification tool that seeks a dividing hyperplane for any number of dimensions can be used for regression or classification can extrapolate information from one dimensional data (input space) and some information about weights & correlative relationships to another dimension (feature space) can classify linear and nonlinear data by transforming the original data into a higher dimension from where it can find a hyperplane for data separation using essential training tuples. Parameters that can be adjusted with the grid search: C value controls misclassification Large C value = low bias, high variance (low bias because it penalizes the cause of misclassification a lot more than the small C) Small C = high bias, low variance Small gamma - gaussian of a large variance Small gamma - high bias low variance

TensorFlow

TensorFlow is an open source C++/Python software library for numerical computation using data flow graphs, particularly Deep Neural Networks. It was created by Google. In terms of design, it is most similar to Theano, and lower-level than Caffe or Keras DNN models - A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers. The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship. The batch size is the amount of samples you feed in your network.

Centroid

The intersection of the medians of a triangle the point of concurrency of the medians

KNN preprocessing

To avoid large deviation between data points (as larger numbers would affect the distance more) we need to standardize all data points using StandardScaler (centers and scales)

Types of Errors

Type I Error: null hypothesis is rejected FP Type II Error: null hypothesis is confirmed FN

Principal Component Analysis

a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables which are called principal components.

reinforcement learning

a type of ML, and thereby also a branch of AI. It allows machines and software agents to automatically determine the ideal behaviour within a specific context. the algorithm learns to perform an action from experience (from trial and error learns which actions yield the greatest rewards) agent(learner), env (everything that the agent interacts with), actions (what the agent can do) used: robotics, gaming and navigation

Directed Acyclic Graph

a directed graph with no cycles. Every tree is a DAG, but a DAG may be more general.

linear regression

a linear approach for modelling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. fit a linear model to a given data set (minimising the vertical distance b/n all the data points) evaluation methods: mean of square errors MSE, mean of absolute errors MAE; root mean square errors RMSE eg. Interpreting the coefficients: Holding all other features fixed, a 1 unit increase in Avg. Session Length is associated with an increase of 25.98 total dollars spent.

Kernel density estimation

a local interpolation method that associates each known point with a kernel function in the form of a bivariate probability density function it's a technique that lets you create a smooth curve given a set of data. This can be useful if you want to visualize just the "shape" of some data, as a kind of continuous replacement for the discrete histogram. It can also be used to generate points that look like they came from a certain dataset - this behaviour can power simple simulations, where simulated objects are modelled off of real data.

information gain

a measure of the predictive power of one or more attributes

Radial Basis Functions

a real-valued function whose value depends only on the distance between the input and some fixed point, either the origin, so that, or some other fixed point, called a center, so that. Any function that satisfies the property is a radial function. is a function that assigns a real value to each input from its domain.

Bias-variance tradeoff

adding model complexity (flexibility). The training error goes down, but the test error goes up. After trade-off the model begins to overfit If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it's going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can't be more complex and less complex at the same time. To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error. Total Error = Bias^2 + Variance + Irreducible Error

spark and RDD

allows to create a 'recipe' of transformations which are not executed until the action is called

spark

alternative to MapReduce, 100x faster( keeps most of the data in memory after each transformation instead of writing it to a disk) allows developing complex multistep pipelines

K-means

an algorithm in which "k" indicates the number of clusters and "means" represents the clusters' centroids

Confusion Matrix

compares recorded classes (the observations) with classes obtained by some more accurate process, or from a more accurate source (the reference) Easy to see if the system is commonly mislabelling one class as another

Bootstrapping

doing more with less bootstrap sampling - sampling from the training set with replacement

Accuracy

how close a measurement is to the true value TP+TN/total

kernel

innermost, essential part; seed grain, often in a shell

Resilient Distributed Dataset

is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. transformation - spark operation(a set of instructions) that will produce and RDD action - spark operation that produces a local object

Logistic Regression

method for binary classification a nonlinear regression model (sigmoid curve b/n 0 and 1) that relates a set of explanatory variables to a dichotomous dependent variable describes the proportion of new cases that develop in a given time period, i.e. the cumulative incidence. Recorded as OR (Odds Ratio) We can place linear regression solution and place into sigmoid function to transfer this to logistic regression then define cutoff = 0..5 anything bellow goes to class 0, above - class 1

Elbow Method

on a graph where elbow occurs this indicates the best value to chose on the x-axis

k-means clustering

organising observations into one of k groups based on a measure of similarity. goal is to find groups of points that are close to each other but far from points in other groups • Each cluster is defined entirely and only by its centre, or mean value µk

Application Programming Interface

programming hooks, or guidelines, published by firms that tell other programs how to get a service to perform a task such as send or receive data a set of routines, protocols, and tools for building software applications

tf-idf

short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model. Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

KNN

simple and efficient application of distance based classification examines each pixel to be classified, then identifies the k nearest training samples as measured in multispectral data space k-nearest neighbour given training data and some similarity function, the algorithm finds the k nearest neighbours to the current example based on the similarity function and classifies the current example based on the majority decision of the k nearest neighbours has a theoretical upper bound of 2*(whatever rule you can come up with) in terms of accuracy all instances become points in the nth dimensional space, where n is the number of features of each instance if the target range is continuous and not binary, the algorithm can output a classification based on the mean value of the k nearest neighbors parameters: training algorithm: store all the data Prediction algorithm: 1. calculate the distance from x to all points in your data 2. sort the points in your data by increasing the distance from x 3. predict the majority label of k closest points small k = picking up a lot of noise large k = introduces a lot of bias

Bias

the difference between the average prediction of our model and the correct value which we are trying to predict.

Supervised Machine Learning

the system is told the "right answer" requires humans to provide input and desired output as well as feedback about prediction accuracy during the beginnings of the system Training a model from input data and its corresponding labels. Analogous to a student learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, the student can then provide answers to new (never-before-seen) questions on the same topic. methods: classification, regression, prediction used: historical data predicts likely future events

Variance

the variability of model prediction for a given data point or a value which tells us the spread of our data.

Precision and Recall

two measures of information retrieval effectiveness In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned.

natural language processing

uses AI techniques to enable computers to generate and understand natural human languages, such as English

Unsupervised Machine Learning

we are not trying to predict any outcome, instead, we are trying to find patterns in the data the system is not told the "right answer" does not need input for the algorithms and does not need to be trained Training a model to find patterns in a data set, typically an unlabelled data set. The most common use is to cluster data into groups of similar examples. For example, cluster songs based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can be helpful in domains where true labels are hard to obtain. Another example, applying PCA on a data of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids. methods: nearest-neighbour, k-means, self-organizing maps, singular value decomposition used to: segment text topics, recommend items, identify data outliers (fraud credit card transaction, house price etc.)


Related study sets

El terrible... There are five of us: somos 5 Aunque suene algo raro, una frase como "somos cinco" se traduce literalmente del inglés como "hay cinco de nosotros". Si dices "we are five", ten en cuenta que estás diciendo "tenemos cinco años

View Set

Review Chapter 57 Skin Disorders

View Set

Aircraft Landing Gear Systems Terms

View Set

Chapter 6: Domain Controller & Active Directory Management quiz

View Set

Guide to the LEED Green Associate Exam

View Set

CELL 210 (Stark) Final Content Exam

View Set