Data Science

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Decision trees

Uses a tree structure to represent a number of possible decision paths and an outcome for each path

Dimension reduction

Using PCA or a similar method to find the smallest subset of dimensions that captures the most variation

Addressing overfitting and underfitting with validation curves

...

Building feature vectors from text

...

CRF (conditional random fields)

...

Categorical data

...

Categorical features

...

Convolutional net

...

Diagnose the common problems

...

Distributed Systems.

...

Dummy features

...

Elitism

...

Encoding class labels

...

Entropy

...

Estimator API

...

Evaluate predictive models

...

Exploratory Data Analysis (EDA)

...

Feature extraction

...

Feature scaling

...

Feature selection

...

Fine-tune machine learning models

...

Free parameter

...

From Mud to Structure.

...

Generation

...

Gini index

...

Global minimum

...

Grouping and grading

...

Holdout cross-validation

...

Interactive Systems

...

LabelEncoder

...

Learning curves

...

Leave-one-out (LOO) cross-validation

...

Likelihood

...

Linear Discriminant Analysis

...

Local minimum

...

Log-likelihood

...

MajorityVotingClassifier

...

Mean imputation

...

Median or most_frequent

...

Mutation

...

Node impuritie

...

Nominal features

...

Normalization

...

Numerical feature

...

One-hot encoding

...

One-vs-Rest (OvR)

...

Ordinal features

...

OvR technique

...

Partial derivative of the log-likelihood function

...

Preprocessing techniques

...

Quantizer

...

Random-restart hill climbing

...

Raw term frequencies

...

Recursive backward elimination

...

SVM - Strengths and Weeknesses

...

Scatterplot matrix

...

Sequential Backward Selection (SBS)

...

Sequential feature selection

...

Sigmoid

...

Slack variable

...

Soft-margin classification

...

Standardization

...

Stochastic gradient descent

...

TBD: Cost function becomes differentiable

...

TBD: Crossover or breeding

...

The RM4Es (Research Methods Four Elements) is a good framework to summarize Machine Learning components and processes. The RM4Es include:Equation: Equations are used to represent the models for our researchEstimation: Estimation is the link between equations (models) and the data used for our researchEvaluation: Evaluation needs to be performed to assess the fit between models and the dataExplanation: Explanation is the link between equations (models) and our research purposes. How we explain our research results often depends on our research purposes and also on the subject we are studying

...

The mean of each feature is centered at value 0 and the feature column has a standard deviation of 1

...

Vector

...

Vector space

...

What make big datasets impractical

...

decreasing the degree of regularization

...

impute missing dat

...

imputing categorical feature values

...

increase the number of parameters

...

increasing the regularization parameter; for unregularized models

...

k-fold cross-validation

...

stratified k-fold cross-validation

...

structure prediction

...

structured SVMs

...

structured perceptron

...

subgroup discovery

...

term frequency

...

term frequency-inverse document frequency

...

test set is not to be used for model selection; its only purpose is to report an unbiased estimate of the generalization performance of a classifier system

...

unbiased estimates of a model's performance

...

validation curves

...

value means more similar."

...

variance measures

...

z

...

zTBD: disadvantage of the holdout method is that the performance estimate is sensitive to how we partition the training set into the training and validation subsets

...

Naive Bayes classifier

A family of algorithms that consider every feature as independent of any other feature

Histogram

A graphical representation of the distribution of a set of numeric data usually a vertical bar graph

What are kernels?

A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are. Kernels offer an alternative. Instead of defining a slew of features, you define a single kernel function to compute similarity between images. You provide this kernel, together with the images and labels to the learning algorithm, and out comes a classifier.

Feature

A machine learning expression for a piece of measurable information

Covariance

A measure of the relationship between two variables whose values are observed at the same time

Overfitting

A model that is too tied to a training set and will not perform well on test data

Deep learning

A multi-level algorithm that gradually identifies things at higher levels of abstraction

Standard normal distribution

A normal distribution with a mean of 0 and a standard deviation of 1

Coefficient

A number or algebraic symbol prefixed as a multiplier to a variable or unknown quantity (slope in line equation)

Data structure

A particular arrangement of units of data such as an array or a tree

S curve

A pattern in which something is adopted slowly gains popularity quickly

Serial correlation

A pattern where values in a series are correlated can shift time series by an interval called a lag and then compute the correlation of the shifted and original series

Gaussian distribution

A probability distribution that when graphed is a symmetrical bell curve with the mean at the center

Pandas

A python library for data manipulation

Logarithm

A quantity representing the power to which a fixed number base

Confidence interval

A range specified around an estimate to indicate margin of error combined with a probability that a value will fall in that range

Neural network

A robust function that takes an arbitrary set of inputs and fits it to an arbitrary set of outputs that are binary; unique because of hidden layer of weighted functions

Ruby

A scripting language that can be used for data science not as popular as Python

Time series data

A sequence of measurements of some quantity taken at different times often at equally spaced intervals

Algorithm

A series of repeatable steps for carrying out a certain type of task with data

Data engineer

A specialist in data wrangling they build infrastructure for real tangible analysis. Run ETL

Supervised learning

A type of machine learning algorithm in which a system is taught to classify input into specific known classes

Discrete variable

A variable whose potential value must be one of a specific number of values

Continuous variable

A variable whose value can be any of infinite values

Data wrangling

AKA data munging the conversion of data using scripting languages to make it easy to work with

DataFrame['A'].sum()

Adds up all values in column A

Computational linguistics

Also called natural language processing (NLP) converting text of spoken languages into structured data to extract valuable information

Backpropagation

An algorithm for iteratively adjusting the weights used in a neural network system. Often used to implement gradient descent.

Random forest

An algorithm used for regression or classification that uses a collection of tree data structures trees "vote" on the best model

Bayes' Theorem

An equation for calculating the probability that something is true if something potentially related is true. P(A|B) = P(B|A) * P(A) / P(B)

Perl

An older scripting language with roots in pre-unix systems. Popular for text processing like data cleanup and enhancement

Angular JS

An open-source javascript library maintained by google and the community. Lets you create single web page applications to display results

R

An open-source programming language and environment for statistical computing and graph generation

Lift

Compares the frequency of an observed pattern with how often you'd expect to see that pattern by chance near 1 is chance

DataFrame['A'].count()

Counts the number of values in column A gets the number of rows

Accuracy

Execution time, memory usage, throughput, tuning, and adaptability

Unstructured Information Management Architecture (UIMA)

Framework developed at IBM to analyze unstructured information especially natural language

GATE

General Architecture for Text Engineering; open source java-based framework for natural language processing tasks

Variance

How much a list of numbers vary from the average the average of the squared difference of each number from the mean

Instance-based learning

KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with no (zero) cost during the learning process.

K-nearest neighbors

Machine learning algorithm that classifies things based on their similarity to nearby neighbors. Pick the number of neighbors K

Gradient boosting

Machine learning technique for regression and classification. Produces a prediction model in the form of an ensemble of weak prediction models typically decision trees; stage-wise

DataFrame.hist()

Make a histogram using matplotlib

Linear algebra

Math that deals with vector spaces and operations on them such as addition and subtraction

Correlation coefficient

Measure of how closely two variables correlate. Ranges from -1 to 1

Logistic regression

Model where the dependent variable is categorical. Estimates the probability of a relationship between a categorical variable and one or more independent variables

Prior distribution

Models the many plausible values of the unknown quantity to be estimated in Bayes interference

Nonparametric

Nonparametric models can't be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of nonparametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM.

Gradient descent

Optimization algorithm for finding the input to a function that produces the optimal value; iterative

Optimization algorithm

Optimization algorithm such as gradient ascent

R

Python (or Perl), Excel, SQL, graphics (visualization), FTP, basic UNIX commands (sort, grep, head, tail, the pipe and redirect operators, cat, cron jobs, and so on) , ...

Import matplotlib.pyplot as plt

Python module useful for graphing data

Scalar

Quantity that has magnitude but No direction in space such as volume or temperature

Pivot table

Quickly summarizes long lists of data without requiring the writing of formulas or copying cells. Can be arranged dynamically or pivoted

DataFrame['A'].mean()

Returns average of values in column A

DataFrame.head(n = 5)

Returns first n rows of a dataframe

Cross-validation

Set of techniques that divide up data into training sets and test sets usually 80-20. Training sets are given the correct categorization and an algorithm is created

Least squares

Smallest sum of the squared distances to the data from the line

DataFrame.groupby()

Splits data into different groups depending on the variable you choose

Root Mean Square Error (RMSE)

Square root of mean squared error. More popular because it gives a number that is easier to understand in the units of the original observations

PCA Principal Component Analysis

Statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values linearly uncorrelated variables called principal components

Chi-square test

Statistical test of whether two categorical variables are independent

Decision tree classifiers - Strength and Weeknesses

Strength: 1) feature scaling is not a requirement for decision tree algorithms 2) Can visualize the DT (using GraphViz) Weakness: 1) we have to be careful since the deeper the decision tree, the more complex the decision boundary becomes, which can easily result in overfitting Note: Using Random Forest allows combining weak learners with strong learners

Hyperplane

Sub space of one dimension less than its ambient space for 3-D space

Support vector machine

Supervised learning classification tool that seeks a dividing hyperplane for any number of dimensions can be used for regression or classification

Linear regression

Technique that looks for a linear relationship between two variables using the line with the least squares

z

The ISO standard query language for relational databases

Data science

The ability to extract knowledge and insights from large and complex data sets

Artificial intelligence

The ability to have machines act with apparent intelligence. Can be through symbolic logic or statistical analysis

Standard deviation

The square root of the variance common way to indicate how different a particular measurement is from the mean

Collaborative filtering

The term collaborative filtering was first used by David Goldberg at Xerox PARC in 1992 in a paper called 'Using collaborative filtering to weave an information tapestry.' He designed a system called Tapestry that allowed people to annotate documents as either interesting or uninteresting and used this information to filter documents for other people.There are now hundreds of web sites that employ some sort of collaborative filtering algorithm for movies, music, books, dating, shopping, other web sites, podcasts, articles, and even jokes.

Data Mining

The use of computers to analyze large data sets to look for patterns that let people make business decisions

Machine learning

The use of data-driven algorithms that perform better as they have more data to work with; generally uses cross-validation

Econometrics

The use of mathematical and statistical methods in the field of economics to verify and develop economic theories

Monte Carlo method

The use of randomly generated numbers as part of an algorithm

Dependent variable

The value depends on the value of the independent variable

Classification error

This is a useful criterion for pruning but not recommended for growing a decision tree, since it is less sensitive to changes in the class probabilities of the nodes.

Spatiotemporal data

Time series data that also includes geographic identifiers such as latitude-longitude pairs

Standardized score

Transformed raw score into units of standard deviation above or below the mean

Clustering

Unsupervised learning technique for dividing data into groups based on an algorithm

Gaussian Kernel

Used for Kernel Trick in SVMs

Radial Basis Function kernel (RBF kernel)

Used for Kernel Trick in SVMs

Objective function

Used to find the optimal result of an objective; used to solve an optimization problem

Feature engineering

Using feature to come up with a good model through iteration

Parametric

Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM.

Kernel trick

Using the kernel trick to find separating hyperplanes in higher dimensional space To solve a nonlinear problem using an SVM, we transform the training data onto a higher dimensional feature space via a mapping function Using the kernel trick to find separating hyperplanes in higher dimensional space and train a linear SVM model to classify the data in this new feature space. Then we can use the same mapping function Using the kernel trick to find separating hyperplanes in higher dimensional space to transform new, unseen data to classify it using the linear SVM model. However, one problem with this mapping approach is that the construction of the new features is computationally very expensive, especially if we are dealing with high-dimensional data. This is where the so-called kernel trick comes into play. Although we didn't go into much detail about how to solve the quadratic programming task to train an SVM, in practice all we need is to replace the dot product Using the kernel trick to find separating hyperplanes in higher dimensional space by Using the kernel trick to find separating hyperplanes in higher dimensional space. In order to save the expensive step of calculating this dot product between two points explicitly, we define a so-called kernel function: Using the kernel trick to find separating hyperplanes in higher dimensional space. One of the most widely used kernels is the Radial Basis Function kernel (RBF kernel) or Gaussian kernel: The trick is to choose a transformation so that the kernel can be computed without actually computing the transformation. replacing the dot-product function with a new function that returns what the dot product would have been if the data had first been transformed to a higher dimensional space. Usually done using the radial-basis function

T-distribution

Variation of normal distribution that accounts for the fact that you're only using a sample of values not all of them

Three Classes Of Metrics: Centrality

Volatility, Bumpiness. , ...

Diagnosing bias and variance problems with learning curves

When a model has both low training and cross-validation accuracy, which indicates that it underfits the training data. Common ways to address this issue are to 1) increase the number of parameters of the model, for example, by collecting or constructing additional features, or 2) by decreasing the degree of regularization, for example, in SVM or logistic regression classifiers. When a model suffers from high variance, which is indicated by the large gap between the training and cross-validation accuracy. To address this problem of overfitting, 1) we can collect more training data or reduce the complexity of the model, for example, by increasing the regularization parameter; 2) for unregularized models, it can also help to decrease the number of features via feature selection or feature extraction (Compressing Data via Dimensionality Reduction). 3) Collecting more training data decreases the chance of overfitting. However, it may not always help, for example, when the training data is extremely noisy or the model is already very close to optimal.

Big data

Working with large datasets that usually require distributed storage

Six Sigma

approximate solutions, the 80/20 rule, cross-validation, design of experiments, modern pattern recognition, lift metrics, third-party data, Monte Carlo simulations, or the life cycle of data science projects , ...

Collinearity

collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting

geometric

probabilistic, and logical , ...

Random variables

probability, mean, variance, percentiles, experimental design, cross-validation, goodness of fit, and robust statistics , ...

Machine learning projects can be divided into five distinct activities

shown as follows:Defining the object and specificationPreparing and exploring the dataModel buildingImplementationTestingDeployment , ...

Correlograms

trends, change point, normalization and periodicity , ...

If we stop at this point and feed the array to our classifier

we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green , ...

High variance

which is indicated by the large gap between the training and cross-validation accuracy , ...

Load data

with packages like RODBC or RMySQLManipulate data, with packages like stringr or lubridateVisualize data, with packages like ggplot2 or leafletModel data, with packages like Random Forest or survivalReport results, with packages like shiny or markdown , ...

ICLR

which stands for the International Conference on Learning Representation , ...

z

,,,

Adaptable Systems

...

Validation curves

...

PageRank

An algorithm that determines the importance of something typically to rank it in a list of search results

Reinforcement learning

A class of machine learning algorithms which do not have specific goals but is continuously monitoring if it's doing well or not

SAS

A commercial drastically software suite that includes a programming language

Poisson distribution

A distribution of independent events used to predict the probability of an event occurring in a set time or place

Binomial distribution

A distribution of independent events with two mutually exclusive possible outcomes a fixed number of trials and a constant probability of success. Discrete probability distribution. Graphed using histograms.

PCA

PCA attempts to find the orthogonal component axes of maximum variance in a dataset. Kernel principal component analysis

Centroid

Center of a cluster

Unsupervised learning

Class of machine learning algorithms designed to identify groupings of data without knowing in advance what the groups will be

Regularization

Collect more training dataIntroduce a penalty for complexity via regularizationChoose a simpler model with fewer parametersReduce the dimensionality of the data L1 regularization can be understood as a technique for feature selection.

TBD: Pipeline

Combining transformers and estimators in a pipeline

CSV

Comma separated values common data file type

Cost function

Cost function is the key to solving any problem using optimization

DataFrame['A'].max()

Returns largest value in column A

D3

Data Driven Documents a JavaScript library that eases the creation of interactive visualizations embedded in web pages

K means clustering

Data mining algorithm to cluster classify

Quartile

Data set divided into 4 groups 25% of data in each

ETL

Extract Transform

Feature scaling

Feature scaling such as standardization

Types of Kernels for Kernel Trick

Fisher kernel Graph kernels Kernel smoother Polynomial kernel RBF kernel String kernels

Regression

Fitting a model to data

Bayesian network

Graphs that compactly represent the relationship between random variables for a given problem

Bias

In machine learning when a learner consistently learns the same thing wrong

Information gain (IG)

Information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities—the lower the impurity of the child nodes, the larger the information gain

IDF

Inverse document frequency

L2 regularization

L2 regularization (sometimes also called L2 shrinkage or weight decay)

Probability distribution

Listing of all possible distinct outcomes and their probabilities of occurring sum is equal to 1

Parametric versus nonparametric models

Machine learning algorithms can be grouped into parametric and nonparametric models. Using parametric models, we estimate parameters from the training dataset to learn a function that can classify new data points without requiring the original training dataset anymore. Typical examples of parametric models are the perceptron, logistic regression, and the linear SVM. In contrast, nonparametric models can't be characterized by a fixed set of parameters, and the number of parameters grows with the training data. Two examples of nonparametric models that we have seen so far are the decision tree classifier/random forest and the kernel SVM. KNN belongs to a subcategory of nonparametric models that is described as instance-based learning. Models based on instance-based learning are characterized by memorizing the training dataset, and lazy learning is a special case of instance-based learning that is associated with no (zero) cost during the learning process.

DataFrame.boxplot()

Make a boxplot using matplotlib

Layers

Pipes and Filters, Blackboard, Broker, Model-View-Controller, Presentation-Abstraction-Control, Microkernel, and Reflection , ...

Stratified sampling

Population is divided into homogeneous groups called strata

Python

Programming language that is used in data science. Easy to use and powerful for advanced users by using specialized libraries

Predictive analytics

The analysis of data to predict future events usually to aid in business planning

N-gram

The analysis of sequences of N items; usually words in natural language

Mean Absolute Error

The average error of all predicted values when compared with observed values

Mean Squared Error

The average of the squares of all the errors when comparing predicted values with observed values

Correlation

The degree of relative correspondence between two variables

Predictive modeling

The development of drastically models to predict future events

Classification

The identification of two or more discrete categories for items classic machine learning task. Spam or ham. Movie genres. Supervised learning.

Moving average

The mean of time series data from several consecutive periods; continually updated

P-value

The probability under the assumption of no difference (bill hypothesis)

Perceptron

The simplest neural network approximates a single neuron with N binary inputs. It computes a weighted sum of the inputs and "fires" if that weighted sum is zero or greater

Matrix

Two dimensional array of values arranged in rows and columns

Latent variable

Variables that are not directly observed but inferred from other variables that are observed

Three V's

Volume Velocity

TBD: Convert categorical data

such as text or words, into a numerical form , ...

AI or machine learning

the main conferences are NIPS and ICML, and also conferences like AI Stats, UAI, and KDD, which is more data scienceâ€"oriented , ...


Set pelajaran terkait

PMP Chapter 3 - The Role of the Project Manager

View Set

Anthropology 101 - 2002 CSN Essentials of Cultural Anthropology Chapter 5

View Set

Chapter 17 - Neurologic Emergencies

View Set

mod 2; operating systems and file management

View Set

LAB EXERCISE 10.13 Interpreting Simple Geologic Maps

View Set

OSHA 30 construction elect/fall/trench

View Set

3.05: Author's Craft: "Mother Tonge"

View Set

BIOL 1407: CH 22 Study Questions

View Set