BUSI 488 Final Exam Review

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Grid Search

(essentially a *brute-force' search for optimal hyperparameters)

Target variable

= dependent variable = response variable = y

Features

= predictor variables = independent variables = X's

Series

A one-dimensional labeled array capable of holding any data type

Random Variables

A random variable is a variable whose possible values are numerical outcomes of some repeatable experiment (a random phenomenon).

DataFrame

A two-dimensional labeled data structure with columns of potentially different types

Label Encoding

Another approach is to encode categorical values with a technique called "label encoding".It allows you to convert each value in a column to a number.Numerical labels are always between 0 and n_categories-1.

Changing K changes things. Class example:

As k gets larger, the boundaries become smoother As k approaches the number of customers, the whole graph will take on a single color Larger k = smoother decision boundary = less complex model Smaller k = more complex model = can lead to overfiting

AIDA-model

Awareness, Interest, Desire, Action

KNN Machine Learning Algorithm

Basic idea: Predict the label of a data point by Looking at the 'k' closest labeled data points Taking a majority vote

Impute

Calculate the missing value based on other observations through regression, statistical values, or hot decking

LASSO Regression

Can be used to select important features of a dataset Shrinks the coefficients of less important features to exactly 0 Lasso adds a penalty αα times the sum of the absolute value of the aa parameters.

Ordinal

Can order, so median makes sense.

Supervised Machine Learning Tasks/Models

Classification: should we target a consumer? Regression: how much revenue can we expect from a consumer? Today, firms largely use (are biased towards) classification models.The reason behind this bias towards classification models is that most analytical problems involve making a decision that requires a simple Yes/No answer: Will a customer churn or not Will a customer respond to ad campaign or not Will the firm default or not Such analysis are insightful and can be directly linked to an implementation roadmap.

Cross-validation

Combines (averages) measures of fitness in prediction to derive a more accurate estimate of model prediction performance sometimes called rotation estimation or out-of-sample testing, goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. k folds = k-fold CV More folds = More computationally expensive

1. Missing Completely at Random (MCAR) 2. Missing at Random (MAR) 3. Missing Not at Random (MNAR)

Completely at random best, not at random worst because relationship between missingness and it's values. Ex. customers refused to answer a question.

One-Hot encoding

Convert each category value into a new column and assign a 1 or 0 (True/False) to each value. Does not imply a weight by its value. There are many libraries that support one-hot encoding. A simple one is pandas' .get_dummies() method.

Hot-deck:

Copy values from other similar observations by sequence of by group/clusters (only useful if sufficient data - and you may need to make strong assumptions).

What are data?

Data (singular datum) are individual units of information • A datum describes a single quality or quantity of some object or phenomenon • In analytical processes, data are represented by variables • Data are measured, collected, reported, and analyzed • Data is not equivalent to insight or knowledge • Data is the least abstract concept; Knowledge the most abstract

Preprocessing steps

Dealing with missing valueswhere we identify which, if, any missing data we have and how to deal with them. For example, we may replace missing values with the mean value for that feature, or by the average of the neighbouring values: pandas has a number of options for filling in missing data that is worth exploring We can also use k-nearest neighbourto help us predict what the missing values should be, or sklearn Imputer function (amongst other ways) Dealing with categorical values by converting them into a numerical representation that can be modelled. There are a number of different ways to do this in sklearn and pandas such as pandas.get_dummies Scaling the data to ensuring the data is, for example: all on the same scale (such as within two defined values) normally distributed has a zero-mean etc.

What to do with missing data

Drop, impute, flag

Reinforced learning applications

Economics Genetics Game playing AlphaGo: First computer to defeat the world champion in Go

Replacing Values

Essentially, you replace the categories with the desired numbers. This can be achieved with the help of the replace() function in pandas. The idea is that you have the liberty to choose whatever numbers you want to assign to the categories according to the business use case.

Ratio

Even spacing, well-defined zero

Interval

Evenly spaced, so mean makes sense. No zero.

Feature Scaling

Feature scaling (sometimes called data normalization) is used to standardize the range of features of data.

Same as column

Features, Variables, Fields

Same as dataset

File, Table, Sheet

Imputing Missing Data in Time Series

Forward fill, backward fill, linear interpolation, quadratic interpolation, nearest neighbor interpolation. Backward problem because time moves forward.

4 types ensemble coding

Identify, summarize, segment, estimation of structure (trends)

Logistic Regression

If the probability 'p' is greater than 0.5: The datum is labeled 1 If the probability 'p' is less than 0.5: The datum is labeled 0

When Not To Use pickle

If you want to use data across different programming languages, pickle is not recommended. In contrast, JSON is standardized and language-independent. This is a serious advantage over pickle. It's also much faster than pickle.

Data Cleaning Workflow

Inspect, clean, verify, report

Same a row

Instances Cases Examples Observations Tuples

Ridge Regression

Instead of forcing them to be exactly zero, let's penalize them if they are too far from zero, thus enforcing them to be small in a continuous way. This way, we decrease model complexity while keeping all variables in the model. Loss function = OLS loss function +α×∑ni=1a2i+α×∑i=1nai2 Alpha (αα ): Parameter we need to choose Picking αα here is similar to picking k in k-NN

Data sources

Internal vs. External

Time

Interval with (daily or seasonal) patterns

JSON (JavaScript Object Notation)

JSON is a lightweight data-interchange format • JSON is "self-describing" and easy to understand • JSON is language independent * • JSON is text only: it can easily be sent to and from a server, and used as a data format by any programming language

Machine learning examples

Learning to predict whether an email is spam or not Clustering wikipedia entries into different categories

Spatial

Multidimensional ratio coordinates

Anscombe's Quartet

Never rely solely on statistical abstractions—always inspect your data carefully!

Outlier detection: clusterin

Noise Points are identified as not belonging to any cluster

Mutliple Rounds (i.e., train-test-splits)

One round of cross-validation involves:partitioning a sample of data into complementary subsetsperforming the analysis on one subset (called the training set)validating the analysis on the other subset (called the testing set) Perform multiple rounds (do not seed!) and average performance

Supervised ML

Predictions from data with labels and features Recommendation systems Email subject optimization Churn prediction Objective of Supervised (Machine) Learning: Automate time-consuming or expensive manual tasks Examples: Doctor's diagnosis Make predictions about the future Will a customer click on an ad or not?

Regularization

Reduce variance at the cost of introducing some bias. This approach is called regularization and is almost always beneficial for the predictive performance of the model.

Binomial Distribution

Repeating the same Bernoulli experiment nn times and counting the successes gives a Binomial Distribution.

Encoding Strategies

Replacing values Encoding labels One-Hot encoding

Querying

SELECT displays field values from a table

Basic Database Operations

SELECT query (search) the data INSERT add new records to a table(s) UPDATE modify existing record(s) DELETE delete record(s) from a table

SK Learn Pipeline Def

Sequentially apply a list of transforms and a final estimator.Intermediate steps of pipeline must implement fit and transformmethods and the final estimator only needs to implement fit.

Strings

Single quote marks

Special case: Reinforced learning

Software agents interact with an environment Learn how to optimize their behavior Given a system of rewards and punishments Draws inspiration from behavioral psychology

Data structure

Structured vs Unstructured

Machine learning

The ART and SCIENCE of: Predictions from data Giving computers the ability to learn to make decisions from data ... without being explicitly programmed!

Relationship between Poisson and Binomial Distributions

The Poisson Distribution is a special case of the Binomial Distribution as n goes to infinity while the expected number of successes remains fixed.

Pickling

The Python Pickle Module implements binary protocols for serializing and de-serializing a Python object structure. "Pickling" is the process whereby a Python object hierarchy is converted into a byte stream, and "unpickling" is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as "serialization", "marshalling," or "flattening"; however, to avoid confusion, the terms used in this course are "pickling" and "unpickling".

Accuracy

The degree to which the data is close to the true values

Consistency

The degree to which the data is consistent, within the same data set or across multiple data sets

Uniformity

The degree to which the data is specified using the same unit of measure

Normalization

The goal of normalization is to change your observations (their values) so that they can be described as a normal distribution.

Min-Max Scaling

Transforms the data such that the features are within a specific range (e.g. from 0 to 1)

Standardization

Transforms your data such that the resulting distribution has a mean μ of 0 and a standard deviation σ of 1.

Unsupervised ML

Uncovering hidden patterns from unlabeled data Grouping customers into distinct categories (Clustering)

Pipelines

Usually you need to take several steps to transform the data you received Pipelines are a way to simplyfy and streamline the entire process. Pipelines make your workflow much easier to read and understand. Pipelines enforce the implementation and order of steps in your project. Your work becomes (more easily) reproducible.

Data quality

Validity, accuracy, completeness, uniformity, consistency

Cell =

Value

Every field has:

a name • a data type and length

Categorical

category names may be unrelated - no median or mean

purchase funnel,

consumer focused marketing model which illustrates the theoretical customer journey towards the purchase of a good or service.

Continuous random Variables

continuous random variable is one which takes an infinite number of possible values.

Model performance

dependent on way the data is split!

Poisson Distribution

described in terms of the rate (μ) at which the events happen. An event can occur 0, 1, 2, ... times in an interval. The average number of events in an interval is designated λ (lambda). Lambda is the event rate, also called the rate parameter

Exponential Distribution

describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. suppose we own a website which our content delivery network (CDN) tells us goes down on average once per 60 days,but one failure doesn't affect the probability of the next. All we know is the average time between failures.

Anomaly

deviation from what is normal

Overfitting

happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

Bernoulli Distribution

has only two possible outcomes, namely 1 (success) and 0 (failure),and a single (bernoulli) trial, for example, a coin toss or how many boys and how many girls are born each day.

Outlier detection

identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.

Precision

is intuitively the ability of the classifier not to label as positive a sample that is negative High Precision: Not many Tarheels predicted as Devils

Recall

is intuitively the ability of the classifier to find all the positive samples. High Recall: Predicted most Devils correctly

F1 Score

is the harmonic mean of the precision and recall: where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

Averages

macro average (averaging the unweighted mean per label) weighted average (averaging the support-weighted mean per label) sample average (only for multilabel classification)

Discrete Random Variables¶

may take on only a countable number of distinct values.

Variance

measures the spread, or uncertainty, in these estimates Bias and the variance are desired to be low, as large values result in poor predictions from the model.

Novelty detection

mechanism by which an intelligent organism is able to identify an incoming sensory pattern as being hitherto unknown.

Hyperparameters

number of parameters that need to be set Chosen by: Linear regression: Choosing parameters Ridge/lasso regression: Choosing alpha k-Nearest Neighbors: Choosing n_neighbors Parameters like alpha and k: Hyperparameters Hyperparameters cannot be learned by fi!ing the model

Receiver Operating Characteristic (ROC) Curves

provide a way to visually evaluate models ROC visually compares signal (True Positive Rate) against noise (False Positive Rate) Model performance is determined by inspecting the area under the ROC curve (or AUC) Best possible AUC is 1 Worst possible AUC is 0.5 (the 45 degree random line)

Underfitting

refers to a model that can neither model the training data nor generalize to new data.

Normal Distribution, also known as Gaussian distribution

roughly an equal number of observations fall above and below the mean the mean and the median are the same there are more observations closer to the mean.

Tidy Data

sets that are arranged such that each variable is a column and each observation (or case) is a row

Support Vector Machines

supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Representation of the examples (observations) as points in space ... ... mapped so that the examples of the separate categories are divided by a clear gap ... ... that is as wide as possible.

Bias

the difference between the true population parameter and the expected estimator It measures the accuracy of the estimates Bias and the variance are desired to be low, as large values result in poor predictions from the model.

Support

the number of occurrences of each class in the "ground truth" (i.e., correct) target values.

Gamma distribution

two-parameter family of continuous probability distributions. used is insurance claims and loan defaults where the variables are always positive and the results are skewed (unbalanced).

Structure of Databases

• A database system may contain many databases. • Each database is composed of schema and tables.

Key field for Identifying Rows

• A table contains a primary key that uniquely identifies a row of data (ID) • Each record must have a distinct value of primary key • The primary key is used to relate (join) tables.

Contents of a Table

• A table contains the actual data in records (rows) • A record is composed of fields (columns) • Each record contains one set of data values

SQL Lite

• SQLite is a relational database management system contained in a C library • SQLite is not a client-server database engine. • SQLite is embedded into the end program

Structured Query Language (SQL)

• Structured Query Language (SQL) is the standard language for accessing information a database • SQL is case-insensitive and free format • Enter commands interactively or in a script file • SQL statements can use multiple lines • End each statement with a semi-colon


Ensembles d'études connexes

Career Exploration Final Exam Study Guide

View Set

6.1 Measuring the Size of the Economy: Gross Domestic Product

View Set

Anatomy and Physiology: Midterm 1

View Set

Chapter 4 Exam - Premiums and Proceeds

View Set

Biology Chapter 5 review questions

View Set

Med Surg - Chapter 31 Patients with Infectious Respiratory Problems (1)

View Set

CHAPTER 7 ANXIETY AND SLEEP DISORDERS

View Set