Data Science
Decision trees
Uses a tree structure to represent a number of possible decision paths and an outcome for each path
Dimension reduction
Using PCA or a similar method to find the smallest subset of dimensions that captures the most variation
Addressing overfitting and underfitting with validation curves
...
Building feature vectors from text
...
CRF (conditional random fields)
...
Categorical data
...
Categorical features
...
Convolutional net
...
Diagnose the common problems
...
Distributed Systems.
...
Dummy features
...
Elitism
...
Encoding class labels
...
Entropy
...
Estimator API
...
Evaluate predictive models
...
Exploratory Data Analysis (EDA)
...
Feature extraction
...
Feature scaling
...
Feature selection
...
Fine-tune machine learning models
...
Free parameter
...
From Mud to Structure.
...
Generation
...
Gini index
...
Global minimum
...
Grouping and grading
...
Holdout cross-validation
...
Interactive Systems
...
LabelEncoder
...
Learning curves
...
Leave-one-out (LOO) cross-validation
...
Likelihood
...
Linear Discriminant Analysis
...
Local minimum
...
Log-likelihood
...
MajorityVotingClassifier
...
Mean imputation
...
Median or most_frequent
...
Mutation
...
Node impuritie
...
Nominal features
...
Normalization
...
Numerical feature
...
One-hot encoding
...
One-vs-Rest (OvR)
...
Ordinal features
...
OvR technique
...
Partial derivative of the log-likelihood function
...
Preprocessing techniques
...
Quantizer
...
Random-restart hill climbing
...
Raw term frequencies
...
Recursive backward elimination
...
SVM - Strengths and Weeknesses
...
Scatterplot matrix
...
Sequential Backward Selection (SBS)
...
Sequential feature selection
...
Sigmoid
...
Slack variable
...
Soft-margin classification
...
Standardization
...
Stochastic gradient descent
...
TBD: Cost function becomes differentiable
...
TBD: Crossover or breeding
...
The RM4Es (Research Methods Four Elements) is a good framework to summarize Machine Learning components and processes. The RM4Es include:Equation: Equations are used to represent the models for our researchEstimation: Estimation is the link between equations (models) and the data used for our researchEvaluation: Evaluation needs to be performed to assess the fit between models and the dataExplanation: Explanation is the link between equations (models) and our research purposes. How we explain our research results often depends on our research purposes and also on the subject we are studying
...
The mean of each feature is centered at value 0 and the feature column has a standard deviation of 1
...
Vector
...
Vector space
...
What make big datasets impractical
...
decreasing the degree of regularization
...
impute missing dat
...
imputing categorical feature values
...
increase the number of parameters
...
increasing the regularization parameter; for unregularized models
...
k-fold cross-validation
...
stratified k-fold cross-validation
...
structure prediction
...
structured SVMs
...
structured perceptron
...
subgroup discovery
...
term frequency
...
term frequency-inverse document frequency
...
test set is not to be used for model selection; its only purpose is to report an unbiased estimate of the generalization performance of a classifier system
...
unbiased estimates of a model's performance
...
validation curves
...
value means more similar."
...
variance measures
...
z
...
zTBD: disadvantage of the holdout method is that the performance estimate is sensitive to how we partition the training set into the training and validation subsets
...
Naive Bayes classifier
A family of algorithms that consider every feature as independent of any other feature
Histogram
A graphical representation of the distribution of a set of numeric data usually a vertical bar graph
What are kernels?
A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are. Kernels offer an alternative. Instead of defining a slew of features, you define a single kernel function to compute similarity between images. You provide this kernel, together with the images and labels to the learning algorithm, and out comes a classifier.
Feature
A machine learning expression for a piece of measurable information
Covariance
A measure of the relationship between two variables whose values are observed at the same time
Overfitting
A model that is too tied to a training set and will not perform well on test data
Deep learning
A multi-level algorithm that gradually identifies things at higher levels of abstraction
Standard normal distribution
A normal distribution with a mean of 0 and a standard deviation of 1
Coefficient
A number or algebraic symbol prefixed as a multiplier to a variable or unknown quantity (slope in line equation)
Data structure
A particular arrangement of units of data such as an array or a tree
S curve
A pattern in which something is adopted slowly gains popularity quickly
Serial correlation
A pattern where values in a series are correlated can shift time series by an interval called a lag and then compute the correlation of the shifted and original series
Gaussian distribution
A probability distribution that when graphed is a symmetrical bell curve with the mean at the center
Pandas
A python library for data manipulation
Logarithm
A quantity representing the power to which a fixed number base
Confidence interval
A range specified around an estimate to indicate margin of error combined with a probability that a value will fall in that range
Neural network
A robust function that takes an arbitrary set of inputs and fits it to an arbitrary set of outputs that are binary; unique because of hidden layer of weighted functions
Ruby
A scripting language that can be used for data science not as popular as Python
Time series data
A sequence of measurements of some quantity taken at different times often at equally spaced intervals
Algorithm
A series of repeatable steps for carrying out a certain type of task with data
Data engineer
A specialist in data wrangling they build infrastructure for real tangible analysis. Run ETL
Supervised learning
A type of machine learning algorithm in which a system is taught to classify input into specific known classes
Discrete variable
A variable whose potential value must be one of a specific number of values
Continuous variable
A variable whose value can be any of infinite values
Data wrangling
AKA data munging the conversion of data using scripting languages to make it easy to work with
DataFrame['A'].sum()
Adds up all values in column A
Computational linguistics
Also called natural language processing (NLP) converting text of spoken languages into structured data to extract valuable information
Backpropagation
An algorithm for iteratively adjusting the weights used in a neural network system. Often used to implement gradient descent.
Random forest
An algorithm used for regression or classification that uses a collection of tree data structures trees "vote" on the best model
Bayes' Theorem
An equation for calculating the probability that something is true if something potentially related is true. P(A|B) = P(B|A) * P(A) / P(B)
Perl
An older scripting language with roots in pre-unix systems. Popular for text processing like data cleanup and enhancement
Angular JS
An open-source javascript library maintained by google and the community. Lets you create single web page applications to display results
R
An open-source programming language and environment for statistical computing and graph generation
Lift
Compares the frequency of an observed pattern with how often you'd expect to see that pattern by chance near 1 is chance
DataFrame['A'].count()
Counts the number of values in column A gets the number of rows
Accuracy
Execution time, memory usage, throughput, tuning, and adaptability
Unstructured Information Management Architecture (UIMA)
Framework developed at IBM to analyze unstructured information especially natural language
GATE
General Architecture for Text Engineering; open source java-based framework for natural language processing tasks
Variance
How much a list of numbers vary from the average the average of the squared difference of each number from the mean
Instance-based learning
KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with no (zero) cost during the learning process.
K-nearest neighbors
Machine learning algorithm that classifies things based on their similarity to nearby neighbors. Pick the number of neighbors K
Gradient boosting
Machine learning technique for regression and classification. Produces a prediction model in the form of an ensemble of weak prediction models typically decision trees; stage-wise
DataFrame.hist()
Make a histogram using matplotlib
Linear algebra
Math that deals with vector spaces and operations on them such as addition and subtraction
Correlation coefficient
Measure of how closely two variables correlate. Ranges from -1 to 1
Logistic regression
Model where the dependent variable is categorical. Estimates the probability of a relationship between a categorical variable and one or more independent variables
Prior distribution
Models the many plausible values of the unknown quantity to be estimated in Bayes interference
Nonparametric
Nonparametric models can't be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of nonparametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM.
Gradient descent
Optimization algorithm for finding the input to a function that produces the optimal value; iterative
Optimization algorithm
Optimization algorithm such as gradient ascent
R
Python (or Perl), Excel, SQL, graphics (visualization), FTP, basic UNIX commands (sort, grep, head, tail, the pipe and redirect operators, cat, cron jobs, and so on) , ...
Import matplotlib.pyplot as plt
Python module useful for graphing data
Scalar
Quantity that has magnitude but No direction in space such as volume or temperature
Pivot table
Quickly summarizes long lists of data without requiring the writing of formulas or copying cells. Can be arranged dynamically or pivoted
DataFrame['A'].mean()
Returns average of values in column A
DataFrame.head(n = 5)
Returns first n rows of a dataframe
Cross-validation
Set of techniques that divide up data into training sets and test sets usually 80-20. Training sets are given the correct categorization and an algorithm is created
Least squares
Smallest sum of the squared distances to the data from the line
DataFrame.groupby()
Splits data into different groups depending on the variable you choose
Root Mean Square Error (RMSE)
Square root of mean squared error. More popular because it gives a number that is easier to understand in the units of the original observations
PCA Principal Component Analysis
Statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values linearly uncorrelated variables called principal components
Chi-square test
Statistical test of whether two categorical variables are independent
Decision tree classifiers - Strength and Weeknesses
Strength: 1) feature scaling is not a requirement for decision tree algorithms 2) Can visualize the DT (using GraphViz) Weakness: 1) we have to be careful since the deeper the decision tree, the more complex the decision boundary becomes, which can easily result in overfitting Note: Using Random Forest allows combining weak learners with strong learners
Hyperplane
Sub space of one dimension less than its ambient space for 3-D space
Support vector machine
Supervised learning classification tool that seeks a dividing hyperplane for any number of dimensions can be used for regression or classification
Linear regression
Technique that looks for a linear relationship between two variables using the line with the least squares
z
The ISO standard query language for relational databases
Data science
The ability to extract knowledge and insights from large and complex data sets
Artificial intelligence
The ability to have machines act with apparent intelligence. Can be through symbolic logic or statistical analysis
Standard deviation
The square root of the variance common way to indicate how different a particular measurement is from the mean
Collaborative filtering
The term collaborative filtering was first used by David Goldberg at Xerox PARC in 1992 in a paper called 'Using collaborative filtering to weave an information tapestry.' He designed a system called Tapestry that allowed people to annotate documents as either interesting or uninteresting and used this information to filter documents for other people.There are now hundreds of web sites that employ some sort of collaborative filtering algorithm for movies, music, books, dating, shopping, other web sites, podcasts, articles, and even jokes.
Data Mining
The use of computers to analyze large data sets to look for patterns that let people make business decisions
Machine learning
The use of data-driven algorithms that perform better as they have more data to work with; generally uses cross-validation
Econometrics
The use of mathematical and statistical methods in the field of economics to verify and develop economic theories
Monte Carlo method
The use of randomly generated numbers as part of an algorithm
Dependent variable
The value depends on the value of the independent variable
Classification error
This is a useful criterion for pruning but not recommended for growing a decision tree, since it is less sensitive to changes in the class probabilities of the nodes.
Spatiotemporal data
Time series data that also includes geographic identifiers such as latitude-longitude pairs
Standardized score
Transformed raw score into units of standard deviation above or below the mean
Clustering
Unsupervised learning technique for dividing data into groups based on an algorithm
Gaussian Kernel
Used for Kernel Trick in SVMs
Radial Basis Function kernel (RBF kernel)
Used for Kernel Trick in SVMs
Objective function
Used to find the optimal result of an objective; used to solve an optimization problem
Feature engineering
Using feature to come up with a good model through iteration
Parametric
Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM.
Kernel trick
Using the kernel trick to find separating hyperplanes in higher dimensional space To solve a nonlinear problem using an SVM, we transform the training data onto a higher dimensional feature space via a mapping function Using the kernel trick to find separating hyperplanes in higher dimensional space and train a linear SVM model to classify the data in this new feature space. Then we can use the same mapping function Using the kernel trick to find separating hyperplanes in higher dimensional space to transform new, unseen data to classify it using the linear SVM model. However, one problem with this mapping approach is that the construction of the new features is computationally very expensive, especially if we are dealing with high-dimensional data. This is where the so-called kernel trick comes into play. Although we didn't go into much detail about how to solve the quadratic programming task to train an SVM, in practice all we need is to replace the dot product Using the kernel trick to find separating hyperplanes in higher dimensional space by Using the kernel trick to find separating hyperplanes in higher dimensional space. In order to save the expensive step of calculating this dot product between two points explicitly, we define a so-called kernel function: Using the kernel trick to find separating hyperplanes in higher dimensional space. One of the most widely used kernels is the Radial Basis Function kernel (RBF kernel) or Gaussian kernel: The trick is to choose a transformation so that the kernel can be computed without actually computing the transformation. replacing the dot-product function with a new function that returns what the dot product would have been if the data had first been transformed to a higher dimensional space. Usually done using the radial-basis function
T-distribution
Variation of normal distribution that accounts for the fact that you're only using a sample of values not all of them
Three Classes Of Metrics: Centrality
Volatility, Bumpiness. , ...
Diagnosing bias and variance problems with learning curves
When a model has both low training and cross-validation accuracy, which indicates that it underfits the training data. Common ways to address this issue are to 1) increase the number of parameters of the model, for example, by collecting or constructing additional features, or 2) by decreasing the degree of regularization, for example, in SVM or logistic regression classifiers. When a model suffers from high variance, which is indicated by the large gap between the training and cross-validation accuracy. To address this problem of overfitting, 1) we can collect more training data or reduce the complexity of the model, for example, by increasing the regularization parameter; 2) for unregularized models, it can also help to decrease the number of features via feature selection or feature extraction (Compressing Data via Dimensionality Reduction). 3) Collecting more training data decreases the chance of overfitting. However, it may not always help, for example, when the training data is extremely noisy or the model is already very close to optimal.
Big data
Working with large datasets that usually require distributed storage
Six Sigma
approximate solutions, the 80/20 rule, cross-validation, design of experiments, modern pattern recognition, lift metrics, third-party data, Monte Carlo simulations, or the life cycle of data science projects , ...
Collinearity
collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting
geometric
probabilistic, and logical , ...
Random variables
probability, mean, variance, percentiles, experimental design, cross-validation, goodness of fit, and robust statistics , ...
Machine learning projects can be divided into five distinct activities
shown as follows:Defining the object and specificationPreparing and exploring the dataModel buildingImplementationTestingDeployment , ...
Correlograms
trends, change point, normalization and periodicity , ...
If we stop at this point and feed the array to our classifier
we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green , ...
High variance
which is indicated by the large gap between the training and cross-validation accuracy , ...
Load data
with packages like RODBC or RMySQLManipulate data, with packages like stringr or lubridateVisualize data, with packages like ggplot2 or leafletModel data, with packages like Random Forest or survivalReport results, with packages like shiny or markdown , ...
ICLR
which stands for the International Conference on Learning Representation , ...
z
,,,
Adaptable Systems
...
Validation curves
...
PageRank
An algorithm that determines the importance of something typically to rank it in a list of search results
Reinforcement learning
A class of machine learning algorithms which do not have specific goals but is continuously monitoring if it's doing well or not
SAS
A commercial drastically software suite that includes a programming language
Poisson distribution
A distribution of independent events used to predict the probability of an event occurring in a set time or place
Binomial distribution
A distribution of independent events with two mutually exclusive possible outcomes a fixed number of trials and a constant probability of success. Discrete probability distribution. Graphed using histograms.
PCA
PCA attempts to find the orthogonal component axes of maximum variance in a dataset. Kernel principal component analysis
Centroid
Center of a cluster
Unsupervised learning
Class of machine learning algorithms designed to identify groupings of data without knowing in advance what the groups will be
Regularization
Collect more training dataIntroduce a penalty for complexity via regularizationChoose a simpler model with fewer parametersReduce the dimensionality of the data L1 regularization can be understood as a technique for feature selection.
TBD: Pipeline
Combining transformers and estimators in a pipeline
CSV
Comma separated values common data file type
Cost function
Cost function is the key to solving any problem using optimization
DataFrame['A'].max()
Returns largest value in column A
D3
Data Driven Documents a JavaScript library that eases the creation of interactive visualizations embedded in web pages
K means clustering
Data mining algorithm to cluster classify
Quartile
Data set divided into 4 groups 25% of data in each
ETL
Extract Transform
Feature scaling
Feature scaling such as standardization
Types of Kernels for Kernel Trick
Fisher kernel Graph kernels Kernel smoother Polynomial kernel RBF kernel String kernels
Regression
Fitting a model to data
Bayesian network
Graphs that compactly represent the relationship between random variables for a given problem
Bias
In machine learning when a learner consistently learns the same thing wrong
Information gain (IG)
Information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities—the lower the impurity of the child nodes, the larger the information gain
IDF
Inverse document frequency
L2 regularization
L2 regularization (sometimes also called L2 shrinkage or weight decay)
Probability distribution
Listing of all possible distinct outcomes and their probabilities of occurring sum is equal to 1
Parametric versus nonparametric models
Machine learning algorithms can be grouped into parametric and nonparametric models. Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM. In contrast, nonparametric models can't be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of nonparametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM. KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with no (zero) cost during the learning process.
DataFrame.boxplot()
Make a boxplot using matplotlib
Layers
Pipes and Filters, Blackboard, Broker, Model-View-Controller, Presentation-Abstraction-Control, Microkernel, and Reflection , ...
Stratified sampling
Population is divided into homogeneous groups called strata
Python
Programming language that is used in data science. Easy to use and powerful for advanced users by using specialized libraries
Predictive analytics
The analysis of data to predict future events usually to aid in business planning
N-gram
The analysis of sequences of N items; usually words in natural language
Mean Absolute Error
The average error of all predicted values when compared with observed values
Mean Squared Error
The average of the squares of all the errors when comparing predicted values with observed values
Correlation
The degree of relative correspondence between two variables
Predictive modeling
The development of drastically models to predict future events
Classification
The identification of two or more discrete categories for items classic machine learning task. Spam or ham. Movie genres. Supervised learning.
Moving average
The mean of time series data from several consecutive periods; continually updated
P-value
The probability under the assumption of no difference (bill hypothesis)
Perceptron
The simplest neural network approximates a single neuron with N binary inputs. It computes a weighted sum of the inputs and "fires" if that weighted sum is zero or greater
Matrix
Two dimensional array of values arranged in rows and columns
Latent variable
Variables that are not directly observed but inferred from other variables that are observed
Three V's
Volume Velocity
TBD: Convert categorical data
such as text or words, into a numerical form , ...
AI or machine learning
the main conferences are NIPS and ICML, and also conferences like AI Stats, UAI, and KDD, which is more data scienceâ€"oriented , ...