MIS Final
cons of decision trees
Cons -Can never go back to revisit a split in a tree -Incorporate only one variable for each split -Weak learners - small changes in data can produce significant changes in how the tree looks and behaves -Can run out of data before finding good models -Single tree are often not as accurate as other algorithms in predictive accuracy on average => consider building ensembles of trees to increase accuracy
Data Mining Process: CRISP-DM (Cross-Industry Standard Process for Data Mining)
1. business understanding 2. data understanding 3. data preparation 4. model building 5. testing and evaluation 6. deployment
what is a confusion matrix?
A confusion matrix is an N by N matrix, where N is the number of classes being predicted. •Accuracy - the proportion of the total number of predictions that were correct. •Positive Predictive Value or Precision - the proportion of positive cases that were correctly identified (when it predicts yes, how often is it correct?) •Negative Predictive Value - the proportion of negative cases that were correctly identified. •Sensitivity or Recall - the proportion of actual positive cases which are correctly identified. •Specificity - the proportion of actual negative cases which are correctly identified (when it's actually no, how often does it predict no?). Look at Model Assessment slide 6
taxonomy of data mining tasks, methods, and algorithms
BI&A Ch 4 Data Mining.pptx -- Slide 13
what are the three prediction types?
Decision, Ranking, Estimate
common classification techniques
-Decision trees (machine-learning based) -Neural networks (machine-learning based) -Logistic regression (statistics-based) -Discriminant analysis (statistics-based) -Other newer techniques such as Support vector machines, Bayesian classifiers, Genetic algorithms, Rough sets
be familiar with subtree assessment measures and when to use
-Splitting continues until a boundary associated with a stopping rule is reached. Using the misclassification rate: if you're assessing your model, look at where is the classification? are you overfitting or underfitting? Make sure you're assessing training and validation. Average Squared Error and Misclassification Rate. Depending on what decision problem you're solving will determine what measure is best.
cluster analysis common techniques
-Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on. -Neural networks (adaptive resonance theory [ART], self-organizing map [SOM]) -Fuzzy logic (e.g., fuzzy c-means algorithm) -Genetic algorithms
outer join
A join that includes all of the rows from one table and only those rows from the other table that match the join field in the first table.
what is a maximal tree?
A maximal tree is the most complex model in the sequence, splits the data until it can't be split anymore.
how does a neural network work?
Based on probability, weight associated with each path. Model determines weights to split path, depending on nodes in hidden layer, model constantly readjusts the weights as it trains itself until it seems to fit.
how are decision rules identified?
Calculate the logworth of every partition on input x1. logworth = -log(p-value of chi-square) Logworth measures how different the child nodes are from each other and from the parent node highest logworth decides where to split
split criteria when growing a tree for categorical vs interval target variable
Categorical = you either are or are not in this category; typically binary, but also can be multiway Interval = where can you create a split on this measure to create two groupings that are near each other (Good or bad credit score); max logworth
compare contrast neural networks, decision trees, and regressions
Decision trees - simple to understand easy to walk through. better when you try to do categorical or nonlinear NN - can get complex quickly, hard to understand what's happening, handles complex relationships well, good for nonlinear learning. needs a lot of data. don't do well will multiclass, should be binary R - linear, specific relationship between input and target. cannot handle missing data and doesn't capture relationships between data without explicit relationships being defined where NNs would
Tableau - dimensions
Dimensions contain qualitative values (such as names, dates, or geographical data). You can use dimensions to categorize, segment, and reveal the details in your data. Dimensions affect the level of detail in the view.
dashboard building best practices
Easy to see key metrics, simple color scheme, potential to be static or interactive, overview and details are clear, small details where needed
live vs extracted data connections in tableau
Extract - you pull data to your PC and do all stuff in optimized way. Live - you make all stuff on database, so performance is dependent on database
Data mining process: SEMMA (Sample, Explore, Modify, Model, Assess)
Feedback loop
what is gain?
Gain Gain at a given decile level is the ratio of cumulative number of targets (events) up to that decile to the total number of targets (events) in the entire data set Interpretation: % of targets (events) covered at a given decile level. For example, 80% of targets covered in top 20% of data based on model.
how to use gain chart or lift chart for model assessment?
Gain chart/response chart: prefer models with a larger percentage of responses for the lowest percentage of data Lift chart: prefer models with highest curves that remain higher for a longer period chart
supervised learning model and examples
In machine learning and artificial intelligence, supervised learning refers to a class of systems and algorithms that determine a predictive model using data points with known outcomes. The model is learned by training through an appropriate learning algorithm (such as linear regression, random forests, or neural networks) that typically works through some optimization routine to minimize a loss or error function. Put another way, supervised learning is the process of teaching a model by feeding it input data as well as correct output data. This input/output pair is usually referred to as "labeled data." Think of a teacher who, knowing the correct answer, will either reward marks to or take marks from a student based on the correctness of her response to a question. Supervised learning is often used to create machine learning models for two types of problems. Examples: Linear regression for regression problems decision trees, logistic regression Random forest for classification and regression problems. Support vector machines for classification problems.
unsupervised learning model and examples
In unsupervised learning, a dataset is provided without labels, and a model learns useful properties of the structure of the dataset. We do not tell the model what it must learn, but allow it to find patterns and draw conclusions from the unlabeled data. The algorithms in unsupervised learning are more difficult than in supervised learning, since we have little or no information about the data. Unsupervised learning tasks typically involve grouping similar examples together, dimensionality reduction, and density estimation. Examples: K-means clustering, KNN (k-nearest neighbors), hierarchical clustering, anomaly detection, neural networks, principle component analysis, independent component analysis, apriori algorithm
what is lift?
Lift It measures how much better one can expect to do with the regression model comparing without a model. It is the ratio of gain % to the random expectation % at a given decile level. The random expectation at the xth decile is x%. Interpretation: The Cum Lift of 4.03 for top two deciles, means that when selecting 20% of the file based on the model, one can expect 4.03 times the total number of targets (events) found by randomly selecting 20%-of-file without a model.
what is linear regression?
Linear regressions are usually deployed for targets with an interval measurement scale
what is logistic regression?
Logistic regressions are usually deployed for binary targets
given a confusion matrix, calculate various measures such as accuracy, precision, specificity
Look at Model Assessment slide 6 Accuracy = (a+d)/(a+b+c+d) Sensitivity = a/(a+c) Specificity = d/(b+d) Positive Predictive Value = a/(a+b) Negative Predictive Value = d/(c+d)
Tableau - measures
Measures contain numeric, quantitative values that you can measure.
data pre-processing prior to building a regression model - missing data
Missing data always has to be addressed before performing a regression Consequence: Missing values can significantly reduce your amount of training data for regression modeling! Missing data is ignored by most programs. You may think you have 5k entries, but in reality you only have 2k. Predictive formulas can't handle missing data.
Examples of ROC chart, gain chart, and lift chart
Model assessment slide 22, can't input the photo :/
inner join
Most common type of join; includes rows in the query only when the joined field matches records in both tables.
cons of artificial neural network models
Neural networks are black boxes, meaning we cannot know how much each independent variable is influencing the dependent variables. It is computationally very expensive and time consuming to train with traditional CPUs. Neural networks depend a lot on training data. This leads to the problem of over-fitting and generalization. The mode relies more on the training data and may be tuned to the data.
pros of artificial neural network models
Neural networks are flexible and can be used for both regression and classification problems. Any data which can be made numeric can be used in the model, as neural network is a mathematical model with approximation functions. Neural networks are good to model with nonlinear data with large number of inputs; for example, images. It is reliable in an approach of tasks involving many features. It works by splitting the problem of classification into a layered network of simpler elements. Once trained, the predictions are pretty fast. Neural networks can be trained with any number of inputs and layers. Neural networks work best with more data points.
reasons for data partitioning
Partition available data into training and validation sets •Training data set is used for fitting the model •Validation data set is used for monitoring and tuning the model to improve its generalization. •Tuning process usually involves selecting among models of different types and complexities. •Tuning process optimizes the selected model on the validation data.
essential tasks in predictive modeling
Predict new cases - provide a rule to transform Inputs into a prediction. Select useful inputs - choose useful inputs from a potentially large set of candidates. Optimize complexity - adjust model complexity to compensate noisy training data.
pros of decision trees
Pros -Easy to understand -Easy to build. Efficient learning algorithms and scale well in large data sets with a large number of inputs -Handle both nominal and continuous inputs -Have built-in variable selection. Excellent for data exploration with hundreds of variables in predicting the target variable -Non-parametric, making no assumption about distributions for inputs or the target variable. Can be used without transformations of inputs or target variable -Can handle missing data automatically
how is pruning or model selection done?
Pruning is removing nodes from the maximal tree until you find the best fitting model against the validation data. You want to select the least complex model that's the best fitting
how to use ROC chart for model assessment?
ROC curves are frequently used to show in a graphical way the connection/trade-off between clinical sensitivity and specificity for every possible cut-off for a test or a combination of tests. In addition the area under the ROC curve gives an idea about the benefit of using the test(s) in question.
how is regression complexity optimized? be familiar with various model selection criteria
Regression complexity is optimized by choosing the optimal model in the sequential selection sequence. The process involves two steps. First, fit statistics are calculated for the models that are generated in each step of the selection process. Both the training and validation data sets are used. Second, the simplest model (that is, the one with the fewest inputs) with the optimal fit statistic on validation data is selected. Evaluate each sequence step and choose simplest optimal model.
how does SAS EM optimize neural network complexity? what is stopped training?
SAS Enterprise Miner treats each iteration in the optimization process as a separate model. The iteration with the smallest value of the selected fit statistic is chosen as the final model. This method of model optimization is called stopped training.
Compare and contrast CRISP-DM and SEMMA
SEMMA vs. CRISP-DM Main difference -CRISP-DM takes a more comprehensive approach - including understanding of the business and the relevant data -SEMMA implicitly assumes that the data mining project's goals and objectives along with the appropriate data sources have been identified and understood
how are useful inputs selected? - compare and contrast
Stepwise works in the same way as forward, sequentially adds steps, but recalculates, and can remove all that isn't important.
missing value replacement strategies
Synthetic distribution methods - Use a one-size-fits-all approach to handle missing values. In any case with a missing input measurement, the missing value is replaced with a fixed number. The net effect is to modify an input's distribution to include a point mass at the selected fixed number, which should be chosen to have minimal impact on the magnitude of an input's association with the target. With many modeling methods, this can be achieved by locating the point mass at the input's mean value. Estimation methods Provide customized imputations for each case with missing values. This is done by viewing the missing value problem as a prediction problem. Train a model to predict an input's value from other inputs. Then, when an input's value is unknown, use this model to predict or estimate the unknown missing value. This approach is most suitable for missing values that result from a lack of knowledge, that is, no-match or nondisclosure, but it is not appropriate for not-applicable missing values.
What is Tableau?
Tableau is a leading data visualization tool used for data analysis and business intelligence.
What is a tableau packaged workbook?
Tableau packaged workbooks have the . twbx file extension. A packaged workbook is a single zip file that contains a workbook along with any supporting local file data and background images. This format is the best way to package your work for sharing with others who don't have access to the original data.
data types supported by tableau
Text (string) values, Date values, Date and time values, Numerical values, Boolean values (relational only), Geographical values (used with maps)
LEFT JOIN
The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match. SELECT column_name(s) FROM table1 LEFT JOIN table2 ON table1.column_name=table2.column_name;
RIGHT JOIN
The RIGHT JOIN keyword returns all rows from the right table (table2), with the matching rows in the left table (table1). The result is NULL in the left side when there is no match. SELECT column_name(s) FROM table1 RIGHT JOIN table2 ON table1.column_name=table2.column_name;
what is ROC chart?
The ROC chart illustrates a tradeoff between a captured response fraction and a false positive fraction.
What is polynomial regression?
The Regression tool assumes (by default) a linear and additive association between the inputs and the logit of the target. If the true association is more complicated, such an assumption might result in biased predictions. For estimates, this bias is more influential and may need to reduced. When minimizing prediction bias is important, you can increase the flexibility of a regression model by adding polynomial combinations of the model inputs. Pro: enables predictions to better match the true input/target association. Con: increases the chances of overfitting while reducing the interpretability of the predictions. ->Approach Polynomial Regression with Care In SAS Enterprise Miner, adding polynomial terms can be done selectively or autonomously TLDR: if it's not that linear, polynomial can be used, but pros and cons need to be considered
what is an ROC index or AUC?
The area under the ROC (Receiver Operating Characteristics) curve
why is a sigmoid function often used before building a neural network?
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
Two principle strategies for input reduction
The two principal reasons for eliminating a variable are redundancy and irrelevancy Redundancy - Input x2 has the same information as input x1. Example: x1 is household income and x2 is home value. Irrelevancy - Predictions change with input x4 but much less with input x3. Example: Target is the response to direct mail solicitation, x3 is religious affiliation, and x4 is the response to previous solicitations.
types of actions in a tableau dashboard
This is accomplished in three primary ways: Quick Filters, Parameters and Dashboard Actions. Of these three Dashboard Actions have become my favourite as they are probably the most flexible and offer the most immersive experience. So here's my rough guide to Tableau Dashboard Actions.
data pre-processing required before building a neural network
Transform Variables, Replace Missing Values, Impute •All attribute values must be numerical Categorical variables with k classes may be translated into k-1 dummy variables
what factors determine the size of the maximal tree?
Within SAS, maximum number of levels, splits. Overall number of related variables, total categories that have weight
prediction formula for logistic regression
^y = ^w0 + ^w1 x1 + ^w2 x2 ^y = prediction estimate ^w0 = intercept estimate ^w1 = parameter estimate ^w2 = input measurement (y's have hats and x's have dots? .) In standard linear regression, a prediction estimate for the target variable is formed from a simple linear combination of the inputs. The intercept centers the range of predictions The remaining parameter estimates determine the trend strength (or slope) between each input and the target squared error function: ∑( yi - yi )2
what is classification?
classification is perhaps the most frequently used data mining method. classification learns patterns from past data, to place new instances, into their respective groups or classes. classification learns function between the characteristics of things and their membership through a supervised learning process where both types of variables are presented to the algorithm
Given a table of actual outcomes and predicted probabilities for validation data, draw gain chart, lift chart, and ROC chart
don't think we can make a flash card out of this
model assessment statistic and when to use which statistic
formatting: Prediction type validation fit statistic - direction (probably can cut out direction on cheat sheet if need to) Decisions misclassification - smallest average profit/loss - largest/smallest kolmogorov-smirnov statistic - largest Rankings ROC index (concordance) - largest Gini coefficient - largest Estimates average squared error - smallest schwarz's bayesian criterion - smallest log-likelihood - largest
Videos on confusion matrix and ROC chart
http://www2.cs.uregina.ca/~dbd/cs831/notes/lift_chart/lift_chart.html Machine Learning Fundamentals: The Confusion Matrix https://www.youtube.com/watch?v=Kdsp6soqA7o Understanding Confusion Matrix https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62 ROC and AUC Clearly Explained (00:00 - 14:30) https://www.youtube.com/watch?v=4jRBRDbJemM
types of charts in tableau and when to use them
https://www.edureka.co/blog/tableau-charts/
parameters of decision trees that the user can modify in SAS EM
leaf size, number of rules, number of surrogate rules, split size, use decisions, use priors, exhaustive, node sample, method, assessment measure, perform cross validation
prediction formula for linear regression
log(^p/(1-^p)) = ^w0 + ^w1 x1 + ^w2 x2 The predicted/expected value of the target is transformed by a link function to restrict its value to the unit interval. A linear combination of the inputs generates a logit score, the log of the odds of primary outcome Logit score can be used to produce ranking predictions
Logit link function
log(^p/(1-^p)) = ^w0 + ^w1 x1 + ^w2 x2 = logit(^p) ^p = 1 / ( 1 + e^(-logit(^p)) When the predictions are decisions or estimates, obtain a prediction estimate using the logistic function (inverse of the logit function)
cons of polynomial regression
need to choose the right polynomial degree for good bias/variance tradeoff
curse of dimensionality
the more attributes (input variables) there are, the easier it is to build a model that fits the sample data but that is worthless as a predictor •The dimension of a problem: the number of input variables (more accurately, degrees of freedom) that are available for creating a prediction. •Data mining problems are often massive in dimension. •The curse of dimensionality refers to the exponential increase in data required to densely populate space as the dimension increases. •Limits the ability to fit a flexible model to noisy data when there are a large number of input variables. •A densely populated input space is required to fit highly complex models. Very hard to generalize, as it'd be hard to get enough data. "more inputs more problems"
data mining definition and related disciplines?
the process of finding anomalies, patterns and correlations within large data sets to predict outcomes Data mining brings together different methods from a variety of disciplines, including data visualization, machine learning, database management, statistics, and others. These techniques can be made to work together to tackle complex problem
pros of polynomial regression
works well on any size of data, works very well on nonlinear problems
should missing data be managed prior to building a regression model? why or why not?
yes - What should be done when one of the input values used in the prediction formula is missing? You might be tempted to simply treat the missing value as zero and skip the term involving the missing value. Although this approach can generate a prediction, this prediction is usually biased beyond reason.
components of a decision tree
• •The hierarchy is called a tree •Each segment or sub-segment is called a node •Final segments that are not partitioned further -> terminal nodes or leaf nodes or leaves •No overlap between leaf nodes • •
what is a decision tree?
•A decision tree represents a hierarchical collection of rules that describes how to divide a large set of data into successively smaller sets of data. With each successive division, the members of the resulting segments become more and more similar to one another with respect to the target. •Original segment is the entire data set/root node, which is partitioned into two or more segments by applying a series of simple rules • •Each resulting segment is further partitioned into more sub-segments • •Process continues until no more partitioning is possible
characteristics of a neural network?
•A neural network consists of a layered, feedforward, completely connected network of artificial neurons or nodes •Feedforward: a single direction of flow and not allowing looping or cycling •Composed of two or more layers. Most networks consist of three layers •An input layer •A hidden layer •An output layer •Completely connected: every node in a given layer is connected to every node in the next layer, but not to other nodes in the same layer •Each connection between nodes has a weight associated with it
data pre-processing prior to building a regression model - nonnumeric inputs
•A single categorical input can vastly increase a model's degrees of freedom and increases the chances of a model overfitting •To represent these nonnumeric inputs in a model, you must convert them to some sort of numeric values. •Commonly done by creating design variables (or dummy variables), with each design variable representing approximately one level of the categorical input. •The total number of design variables required is one less than the number of levels in the categorical input
neural network - activation function
•Activation function •Examples: step, linear, logistic, hyperbolic tangent functions •Logistic and hyperbolic tangent functions
underfitting model complexity
•An insufficiently complex model might not be flexible enough •Underfitting - systematically missing the signal (high bias) High misclassification
overfitting model complexity
•An overly complex model might be too flexible •Overfitting - accommodating nuances of the random noise in the particular sample; trying to account for every possible trend or structure in the training data (high variance) Good performance on training, but doesn't do well on other data.
what is a neural network?
•Artificial neural networks/neural networks represent an attempt at a very basic level to imitate the type of nonlinear learning that occurs in the networks of neurons found in nature Google says: Definition of neural network: a computer architecture in which a number of processors are interconnected in a manner suggestive of the connections between neurons in a human brain and which is able to learn by a process of trial and error. — called also neural net.
how are useful inputs selected? - backward
•Backward selection creates a sequence of models of decreasing complexity. •The sequence starts with a saturated model, which is a model that contains all available inputs. •Inputs are sequentially removed from the model. •At each step, the input chosen for removal least reduces the overall model fit statistic. This is equivalent to removing the input with the highest p-value. •The sequence terminates when all remaining inputs have a p-value that is less than the predetermined stay cutoff.
cluster analysis applications
•Clustering results may be used to -Identify natural groupings of customers -Identify rules for assigning new cases to classes for targeting/diagnostic purposes -Provide characterization, definition, labeling of populations -Identify outliers in a specific domain (e.g., rare-event detection) -Decrease the size and complexity of problems for other data mining methods
neural network - combination work
•Combination function combines inputs into a single value, which is passed to the activation function, which then produces an output •Combination function uses a set of weights assigned to each of the inputs •A typical combination function is the weighted sum (default in most data mining tools)
components of a neural network
•Composed of two or more layers. Most networks consist of three layers •An input layer •A hidden layer (can have multiple) •An output layer
examples of data mining applications
•Customer Relationship Management -Maximize return on marketing campaigns -Improve customer retention (churn analysis) -Maximize customer value (cross-, up-selling) -Identify and treat most valued customers •Banking & Other Financial -Automate the loan application process -Detecting fraudulent transactions -Maximize customer value (cross-, up-selling) -Optimizing cash reserves with forecasting •Retailing and Logistics -Optimize inventory levels at different locations -Improve the store layout and sales promotions -Optimize logistics by predicting seasonal effects -Minimize losses due to limited shelf life •Manufacturing and Maintenance -Predict/prevent machinery failures -Identify anomalies in production systems to optimize the use manufacturing capacity -Discover novel patterns to improve product quality •Brokerage and Securities Trading -Predict changes on certain bond prices -Forecast the direction of stock fluctuations -Assess the effect of events on market movements -Identify and prevent fraudulent activities in trading •Insurance -Forecast claim costs for better business planning -Determine optimal rate plans -Optimize marketing to specific customers -Identify and prevent fraudulent claim activities •Computer hardware and software •Science and engineering •Government and defense •Homeland security and law enforcement •Travel, entertainment •Healthcare and medicine •Sports,... virtually everywhere...
types of patterns and examples
•DM extract patterns from data •Types of patterns -Prediction §Tells the nature of future occurrences of certain events based on what has happened in the past -Association (beer and diapers) §Commonly co-occurring groupings of things -Cluster (segmentation) §Natural groupings of things based on their known characteristics -Sequential (or time series) relationships -
Prediction - Decision Example
•Decision - Simplest type of prediction •Also known as classification •Decision prediction examples: handwriting recognition, fraud detection, and direct mail solicitation. •Usually relate to a categorical target variable
Prediction - Estimate Example
•Estimate predictions approximate the expected value of the target, conditioned on the input values. •Prediction estimates are most commonly used when their values are integrated into a mathematical expression. •Prediction estimates are also useful when you are not certain of the ultimate application of the model. •Estimate predictions can be transformed into both decision and ranking predictions. => When in doubt, use this option.
association rule mining (expression of an association rule, support and confidence)
•Finds interesting relationships (affinities) between variables (items or events) -Part of machine learning family -Employs unsupervised learning - no output variable -Also known as market basket analysis •Often used as an example to describe DM to ordinary people, such as the famous "relationship between diapers and beers!" A Generic Rule: X -> Y [S%, C%] X, Y: products and/or services X: Left-hand-side (LHS) Y: Right-hand-side (RHS) S: Support: how often X and Y go together C: Confidence: how often Y go together with the X Example: {Laptop Computer, Antivirus Software} Þ {Extended Service Plan} [30%, 70%] -
example of simple neural network, calculate predicted value of output node
•First, a combination function (usually summation) produces a linear combination of the node inputs and the connection weights into a single scalar value Netj = ∑ Wijxij=W0jx0j+W1jx1j+ ...+Wijxij • NetA = ∑ WiAxiA=0.5∗1.0+0.6∗0.4+0.8∗0.2+0.6∗0.7=1.32 THIS ONE SHOULD PROBABLY BE PRINTED. Neural Networks ppt, slides 10 and 11
how are useful inputs selected? - forward
•Forward selection creates a sequence of models of increasing complexity. •The sequence starts with the baseline model •Search the set of one-input models and selects the model that most improves on the baseline model. •Search the set of two-input models that contain the input selected in the previous step and selects the model showing the most significant improvement. •Process continues until no significant improvement can be made. • •Model Improvement based on the change in the fit statistic •Binary target: a chi-squared distribution. A large fit statistic change (corresponding to a large chi-squared value) is unlikely due to chance. Therefore, a small p-value indicates a significant improvement. •When no p-value is below a predetermined entry cutoff, the forward selection procedure terminates
data pre-processing prior to building a regression model - input transformation
•In most real-world applications, the relationship between expected target value and input value exists within a boundary. •It typically tapers off to some horizontal asymptote • •As a point expands from the overall mean of a distribution, the point has more influence, or leverage, on model fit. •Models built on inputs with extreme distributions attempt to optimize fit for the most extreme points at the cost of fit for the bulk of the data, usually near the input mean Solution: transform or regularize offending inputs in order to eliminate extreme values a standard regression model can be accurately fit using the transformed input in place of the original input Input transformation/regularization not only mitigates the influence of extreme cases but also creates the desired association between input and target on the original input scale TLDR: get rid or transform extreme values so it doesn't skew your results
inspiration for neural networks
•Inspiration for neural networks •Complex learning systems in animals' brains consisting of closely interconnected sets of neurons •A neuron may be simple in structure •Dense networks of interconnected neurons could perform complex learning tasks such as classification and pattern recognition
what is regression?
•Regression is a parametric model - assuming a specific association structure between inputs and target. •One of several heuristic sequential selection techniques is used to choose from a collection of possible inputs and creates a series of models with increasing complexity. •Fit statistics calculated from validation data select the best model from the sequence
steps in calculating optimum weights for neural network
•Step 1: Finding the error-minimizing weights from the training data set •Using the iterative procedure below •In first iteration, a set of initial weights is used, error function is evaluated. •In second iteration, weights are changed by a small amount in such a way that the error is reduced •The process continues until the error cannot be further reduced, or until the specified maximum number of iterations is reached •Generated N models in the process (N: number of iterations) • •Step 2: finding the optimum weights from the validation data set •Apply user-supplied model selection criterion to select one of the N sets of weights generated in step 1 •Using validation data set TLDR: uses training data set, evaluates model, continues to tweak
how are useful inputs selected? - stepwise
•Stepwise selection combines elements from both the forward and backward selection procedures. •The method begins in the same way as the forward procedure, sequentially adding inputs with the smallest p‑value below the entry cutoff. •After each input is added, the algorithm reevaluates the statistical significance of all included inputs. •If the p-value of any of the included inputs exceeds the stay cutoff, the input is removed from the model and reentered into the pool of inputs that are available for inclusion in a subsequent step. •The process terminates when all inputs available for inclusion in the model have p-values in excess of the entry cutoff and all inputs already included in the model have p-values below the stay cutoff.
how to construct an ROC chart?
•To create a ROC chart, predictions (rankings or estimates) are generated for a set of validation data. •The validation data is sorted from high to low (either scores or estimates). •Each point on the ROC chart corresponds to a specific fraction of the sorted data Each point on the ROC chart corresponds to a specific fraction of cases, ordered by their predicted value.
Phases of supervised learning
•Training phase: use the training data to train/estimate models •Validation/Test phase: •Validation: select the best performing model from a set of possible models using the validation data •Test: apply the final model to test data and get an honest estimate of the accuracy of the model •Application phase: apply the final model to the real-world data and get the results.
cluster analysis overview
•Used for classifying objects, events, or entities into common groupings called clusters that share similar characteristics -Part of the machine-learning family -Unsupervised learning - no output variable -Also known as segmentation •Learns the clusters of things from past data, then assigns new instances •Unlike classification, the class labels are unknown •Goal: create clusters that the members within each cluster have maximum similarity and the members across clusters have minimum similarity
Prediction - Ranking Example
•Using the training data, the prediction model attempts to rank high value cases higher than low value cases. •The actual, produced scores are inconsequential; only the relative order is important. •The most common example of a ranking prediction is a credit score. •Ranking predictions can be transformed into decision predictions by taking the primary decision for cases above a certain threshold while making secondary and tertiary decisions for cases below the corresponding lower thresholds.