GBUS 607 Chapter 10,11,14

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What performance measure(s) should we use to filter out the most useful rules?

#1: Support #2: Confidence #3: Lift

Batch Updating

-All records in the training set are fed to the network before updating takes place -In this case, the error used for updating is the sum of all errors from all records -Completion of one sweep through training data is one epoch (also called iteration) Note: limit iterations> for clear relationships FROM BOOK -the entire training set is run through the network before each updating of weights takes place -hundreds or even thousands of sweeps through the T.D. are executed

Summary

-Association rules (or affinity analysis, or market basket analysis) produce rules on associations between items from a database of transactions -Widely used in recommender systems -The method described is called the Apriori algorithm -Performance of the rule is measured by support, confidence and lift -Can produce too many rules; review is required to identify useful rules and to reduce redundancy FROM BOOK: Summary Association rules search for generic rules about items that are purchased together -the main advantage of this method is that it generates clear, simple rules of the form "IF X is purchased, THEN Y is also likely to be purchased." method is very transparent & easy to understand.

Why back propagation works

-Big errors lead to big changes in weights -Small errors leave weights relatively unchanged -Over thousands of updates, a given weight keeps changing until the error associated with that weight is negligible -start off w/ random rates> go back and see how much error you are class/pred. and then learn from how much error you made & how much you want to change the weights in the 2nd round by going back & adjusting weights & intercepts so new error from 2nd is smaller than 1st time -stop when minimized -use for future classification of validation or prediction? From BOOK -most popular method for using model errors to update weights (learning) called this. -errors are computed from the last layer (the output layer back to the hidden layers)

NOTES FROM BOOK: Chapter 10 LR

-Dependent variable (Y) is categorical -can be used for classifying new observation, where its class is unknown, into one of the classes based on the values of its predictor variables -can also be used in data where the class is known, to find factors distinguishing b/w observations in different classes in terms of their predictor variables, or "predictor profile (called profiling) Examples: -Classifying customers as returning or non-returning (classification). -Finding factors that differentiate b/w male & female top executives (profiling) -Predicting the approval or disapproval of a loan based on info. such as credit scores (classification). -LR for classification-we deal only w/ a binary dependent variable having two possible classes. (success/failure, yes/no, buy/dont buy, default/dont default, survive/die) also as 0 or 1 In some cases we choose to convert continuous data or data w/ multiple outcomes into binary data for purposes of simplification. In LR-the goal is to predict which class a new observation will belong to, or simply to classify the observation into one of the classes In LR-two steps 1)yields estimates of the probabilities of belonging to each class 2)use a cut-off value on these prob in order to classify each case into one of the classes.

Logistic Regression

-Extends idea of linear regression to situation where outcome variable is categorical. -Useful to classify and profile. -We focus on binary classification (y = 0 or 1). -Linear regression will not work since y can take any value between -∞ to ∞. -So, we use a Probability curve with a S shape which go through most of 0 and 1 points.

Process of Rule Selection

-Find frequent item sets (ie. those with support that meets the threshold) -From these item sets, generate all possible rules -Calculate the confidence values for each rule -Select the rules with confidence that meets the threshold Example: If confidence threshold is 70%, select only rules 2, 3 and 6.

Summary

-Logistic regression is similar to linear regression, except that it is used with a categorical response. -The predictors are related to the response through a function called the logit. -It can be used for profiling; ie. knowing the class of a response variable, one can identify distinguishing traits of that class in terms of the predictors. -As in linear regression, reducing predictors can be done through variable selection to create a parsimonious model. -The method helps to compare the odds of different predictors and determine those that have the most impact.

Calculating p from Predictor variables

-Using training data, calculate the regression coefficients β0, β1, ... (Supervised Data Mining) in the standard regression equation, y= B0 + B1x1+ B2x2 ...... -Using the above equation, predict outcome y values for the validation data. From prior slide, remember that these predicted y values are equal to log(Odds). -Take anti-log of the log(Odds) values in the above step to get the Odds. -Convert the calculated Odds values to probabilities using the formula below, p=odds/1+odds Note: We saw earlier that Odds=p/(1-p) *(1-p) represents prob of failure*

Case Updating

-Weights are updated after each record is run through the network (trial) -After one epoch is completed, return to first record and repeat the process FROM BOOK -one of the two methods for updating the weights -In practice, case updating tends to yield more accurate results than batch updating but requires a longer time

Common Criteria to Stop the Updating

-When weights change very little from one iteration to the next -When the error reaches a required threshold -When a limit on runs is reached FROM BOOK: When does the updating stop for both: 1. when the new weights are only incrementally different from those of the preceding iteration 2. When the misclassification rate reaches a required threshold 3. when the limit on the # of runs is reached

NOTES example of

A large number of rules can be quickly generated in subsets by combining items in various combinations. Egs. redàwhite; redàgreen; whiteàred; whiteàgreen; greenàred; greenàwhite; (red,white)àgreen, (red,green)àwhite; (white,green)àred; (red,white,blue)àgreen, .... Left side (can be any # of items) > right side (only one # for rules on dataset

Frequent set

A transaction may contain just one item or many items (1,2,3, ... , k items). In general, a transaction containing k items is called a k-items set. A k-items set is called a frequent set if it has a support level S ≥ threshold level. Other possible itemsets such as (red,orange), (red,white,yellow),... have S<2. Hence, are not frequent. Note all subsets of a frequent set is also frequent. -frequent set=certain # of rules/how many times occur (measured in terms of a lot of items together)

Considerations before choosing Neural Net

Advantages -Good predictive ability -Can capture complex relationships -No need to specify a model IN BOOK: Adv. They are known to have high tolerance to noisy data & the ability to capture highly complicated relationships b/w the predictors and a response.

NOTES FROM BOOK: Chapter 11 Neural Nets

Also called artificial NN; are models for class and pred -main strength of NN is their high predictive performance -their structure supports capturing very complex relationships b/w predictors and a response, which is often not possible w/ other predictive models -idea behind NN is to combine the input info in a very flexible way that captures complicated relationships among these variables and b/w them and the response variable -linear regr. & log. regr. can be thought of as a special cases of very simple NN that have only input & output layers and no hidden layers --most successful app in data mining of NN have been multilayer feedforward networks; there is an input layer consisting of nodes (called neurons) that simply accept the input values, and successive layers of nodes that receive input from the previous layers. The outputs of nodes in each layer aer inputs to nodes in the next layer. output layer=last layer -layers b/w input & output layers are hidden layers. -A feedforward network is a fully connected network w/ a one-way flow and no cycles. Input nodes take as input the values of the predictors. Their output is the same as the input. Hidden layer nodes take take as input the output values from the input layers. To compute the output of a hidden layer node, we compute a weighted sum of the inputs & apply a certain function to it. Preprocessing the data -when using a logistic activation function (XLMiner), NN perform best when the predictors and response variables are on a scale of [0,1].

NOTES FROM BOOK: Chapter 14 Association Rules cont...

Apriori Algorithm -key idea of the algorithm is to begin by generating itemsets w/ two items, then w/ three items, and so on, until we have generated frequent itemsets of all sizes. -it is easy to generate frequent one-itemsets. all we need to do is to count, for each item, how many transactions in the database include the item -we drop one-itemset that have support below the desired minimum support to create a list of the frequent one-itemsets -To generate frequent two-itemsets, we use the frequent one-itemsets. (generating k-items used the frequent k-1 itemsets that were generated in the preceding step -this algorithm is very fast even for a large # of unique items in a dataset Selecting strong rules -from the abundance of rules generate, the goal is to find only the rules that indicate a strong dependence b/w the antecedent & consequent Data format -transaction data are usually displayed in one of two formats: a list of items purchases (each row representing a transaction), or a binary matrix in which columns are items, rows again represent transactions and each cell has either a 1 or a 0, indicating the presence or absence of an items in the transaction

Chapter 14

Association Rules A large number of rules can be quickly generated in subsets by combining items in various combinations -also called Recommendation System/Market Analysis Example: Which items are frequently purchased together by customers Are there any dependency on products based on similar -recommend items/coupons for next buy -have to have past transactions (which are frequently purchases) IN BOOK: the goal is to identify item clusters in transaction-type databases. -market basket analysis- association rule discovery is called this; is aimed at discovering which groups of products tend to be purchased together. These items can then be displayed together, offered in post-transaction coupons, or recommended in online shopping. -or affinity analysis constitutes a study of "what goes w/ what". -also used for other fields besides retail: example: medical researcher might want to learn what symptoms go with what confirmed diagnoses. In law, word combinations that appear too often might indicate plagiarism.

Considerations before choosing Neural Net

Avoiding Overfitting Over excessive iterations, neural net can easily overfit the data, causing the error rate on V.D. (and most important on new data) to be too large To avoid overfitting: -Track error in validation data -Limit iterations -Limit complexity of network by reducing number of hidden layers FROM BOOK: It is important to limit the # of training epochs and not to over-train the data.

NOTES FROM BOOK: Chapter 14 Association Rules cont...

Collaborative Filtering vs Association Rules -both unsupervised methods used for generating recommendations, they differ in several ways: frequent itemsets vs. personalized recommendations: -association rules look for frequent item combinations and will provide recommendations only for those items; A.R. require require data on a very large # of "baskets" (transactions) in order to find a sufficient # of baskets that contain certain combinations of items -collaborative filtering provides personalized recommendations for every item, thereby catering to users w/ unusual taste. C.F. is useful for the "long tail" of user preferences, while association rules look for the "head" -C.F. does not require many "baskets", but it does require data on as many items as possible for many users Transactional data vs. user data -A.R. provide recommendations of items based on their co-purchase w/ other items in many transactions/baskets -C.F. provides recommendations of items based on their co-purchase or co-rating by evan a smaller # of users. -binary data & ratings data: A.R. treat items as binary data (1=purchase, 0=nonpurchaded), where C.F. can operate on either binary data or on numerical ratings -two or more items: A.R., the antecedent & consequent can each include one or more items-hence, a recommendation might be a bundle of the item of interest w/ multiple items. C.F. similarity is measured b/w pairs of items or pairs of users-a recommendation, therefore, be either for a single item (the most popular item purchased by people like you, which you haven't purchased), or for multiple single items that do not necessarily relate to each other (the top 2 most popular items purchased by people like you, which you haven't purchased).

NOTES FROM BOOK: Chapter 14 Association Rules

Collaborative filtering: -goal is to provide personalized recommendations that leverage user-level information -starts w/ a user, then finds users who have purchased a similar set of items or ranked items in similar fashion, makes a recommendation to the initial user based on what the similar users purchase or like. -item-based collaborative filtering starts w/ an item being considered by a user, then locates other items that tend to be co-purchased w/ that first item -is based on the notions of identifying relevant items for a specific user from very large set of items ("filtering") by considering preferences of many users ("collaboration"). Generate Candidate Rules: -idea behind association rules is to examine all possible rules b/w items in an if-then format, and select only those that are most likely to be indicators of true dependence. -we use the word antecedent to describe the IF part and consequent to describe the THEN part* -In association analysis, the antecedent & and consequent are sets of items (called itemsets) that are disjoint (do not have any items in common) -NOTE that itemsets are not records of what people buy; they are simply possible combinations of items, including single items

NOTES FROM BOOK: Chapter 10 LR cont..

Computing Parameter Estimates In LR, the relation b/w Y and Beta parameters is nonlinear. The B parameters are not estimated using the method of least squares (as in MLinear Regression). -Instead a method called Maximum likelihood is used-idea is to find the estimates that maximize the change of obtaining the data that we have (requires iterations); algorithms to compute the coefficient estimates are less robust than algorithms for linear regression. (estimates are generally reliables for well-behaved datasets where the # of observations w/ outcome variable values of both 0 and 1 are large; their ratio is "not too close" to either 0 or 1; and when the # of coefficients in the lr model is small relative to the sample size (no more than 10%) Evaluating Classification Performance -most popular measure being those based on the classification matrix (accuracy alone or combined w/ costs) and the lift chart LR for more than two classes -Logistic model for a binary response can be extended for more than two class. Since m prob. must add up to 1, we need estimate only m-1 prob.

Considerations before choosing Neural Net

Disadvantages -Requires large amounts of data -Considered a "black box" prediction machine, with no insight into relationships between predictors and outcome -No variable-selection mechanism, so you have to exercise care in selecting input variables (utilizes all variables & makes up relationships) -Heavy computational requirements if there are many variables (additional variables dramatically increase the number of weights to calculate) IN BOOK: Dis. -Weakest point is in providing insight into the structure of the relationship, hence their black box reputation

Adjusting weights and biases

Goal: Finding weight and bias that improve fit using back propagation (minimum error b/w actual & what is predicted-what weight to use to find minimum error) -Start with random weight and bias -At each record, compare predicted result yk̂ to actual yk and determine the error (Yk with a hat=predicted value, Yk w/ hat=actual value) yk^-yk=prediction in error -Do this for all records and find the sum of errors, err (sum of all errors # is = to err) -Adjust* the weight and bias using the formulas, T^new=T^old+l(eer) W^new=W^old+l(eer) T or 0=Theta *where l is called the Learning Rate; it takes a value between 0 and 1.* -Repeat the cycle with the new weight and bias until error is within acceptable limit NOTE: both have to adjust Theta & Weight CANNOT change input value!

NOTES

How many rules can we have? depends on # of items sold (performance measures)

Processing Information in ANN

Inputs: X1, X2... Weights: W1, W2... Input & Weight> Summation>Transfer Function> Output Neuron/Summation: S=T+Sigma x X x W (look @ formula on cheat cheat) Transfer Function: Output =g (s) = 1 / (1+e-s) -s from above equation (s represents actual variables/sum=weightxintercept+bias) Each X is between -∞ and +∞. Output g(s) is between 0 and 1. -Guaranteed a # between 0 and 1 because it is a Sigmoid function/Use standardize >make values b/w 0 & 1 If accepted can be tested in V.D. the function g (called transfer function or activation function, is some monotone function , and examples are a linear function, and exponential function, and a logistic/sigmoidal function.

Chapter 10

Logistic Regression

NOTES

Logistic Regression-used for classification purposes; take numerical number to class (dependent variable Y is categorical) Probability values always take values b/w 0&1 Logistic Regression-you need numerical input; you can have dummy variable though -we are classifying output -we do not want to just look @ y (look @ standard regression) Estimate-coefficient

In order to build layer how many neuron in each hidden layer do we want?

Most popular=1; makes model simple and interprets weight, bias, & coefficient better. -summation function; multiplied certain way like LR FROM BOOK (general guidelines for choosing architecture/user input: -Number of hidden layers: 1 -Size of hidden layers: # of nodes in hidden layer also determines the level of complexity of the relationship b/w the predictors that the network captures. The trade-off is b/w under and overfitting. Too few nodes=might not be sufficient to capture complex relationships. Too many nodes= might lead to overfitting. A rule of thumb is to start w/ p (# of predictors) nodes and gradually decrease or increase while checking for overfitting. -Number of output nodes: For a binary response, a single node is sufficient * and a cutoff is used for classification. For a categorical response with M>2 classes, the # of nodes should be equal the number of classes. For a number response: a single output node is used unless we are interested in predicting more than one function. -User should also pay attention to the choice of predictors. Since NN are highly dependent on the quality of the input, the choice of predictors should be done carefully using domain knowledge, variable selection and dimension reduction techniques before using the network.

Network Structure

Multiple layers -Input layer -Hidden layer(s) -Output layer Input Layer>Hidden Layer>Output Layer Nodes represent the neurons Weights (like coefficients in a regression equation) Bias values (like intercept in a regression equation) NOTES: Note input layer just passes on external data (input layer passes info to hidden layer)/ no process done in input layer Output layer=shows results/cutoff for pred/class (only 1 output remains) Hidden layer=can have more than 1/something coming in and what needs to be done More neurons in hidden layer=more generally accurate of output (disadvantages= not good for validation data) -should use 1-2 to see easier relationships

NOTE from exercise

NN -tracks errors in Validation date (use lowest possible for V.D.) & track error of Training data error of T.D. check how affects V.D. -have to standardize data for input** (Rescale data)

NOTE

Neural Nets -trial and error in data instead of building a systematic way/theory -uses random weights to fit model on pred/class (cannot come up with theory-regression) -focuses on what is important (different from regression) -neurons-series of decision points -input transform to output -you decide what inputs & how much weight to give them (sum output) -same equation from LR (E^-1) -easy to create imitating scenario -looking at pattern? -system doesn't know how to give weight to each (start w/ random value of weights -Output always b/w 0&1 like LR -We use a cutoff after equation (can use any cutoff)

Chapter 11

Neutral Nets Another name RAMN

Performance measure #3: Lift

Not all rules with a high confidence are useful. Eg. The rule X > Milk may have high confidence just because Milk is routinely bought. But X may have no real association with Milk; ie. Milk and X are independent. We know: C(X>Milk) = P(Milk|X) =(P(X and Milk) )/(P(X))=(P(X)∗P(Milk))/(P(X)) = P(Milk). Here, P(Milk) is called Benchmark Confidence, since buying X does not increase the probability of buying Milk. Note, P(Milk) is same value as the Support for the consequent item Milk. Lift ratio of a rule =Confidence/Support for Consequent where Support for Consequent is calls Benchmark Confidence where also Supper for Consequent=# of Transaction w/ Consequent/Total # of Transaction in the data where also Confidence=P(Antecedent and Consequent)/P(Antecedent) Lift ratio greater than 1 indicates there is some usefulness to the rule. In other words, the level of association between antecedent and consequent itemsets is higher than would be expected if they were independent. The larger the lift ratio above 1, the greater the strength of the association. NOTE: cannot just use confidence because products are independent alone FROM BOOK: -lift ratio indicates how efficient the rule is in finding consequents, compared to random selection. -the confidence tells us at what rate consequences will be found, and it is useful in determining the business or operational usefulles of a rule. -a rule w/ low confidence may find consequent at too low a rate to be worth the cost of (say) promoting the consequent in all the transactions that involve the antecedent.

epoch

Note: each time you go through records -more options to fit on Training data but poorer Validation data -typically there are many epochs

User Inputs: Network parameters

Number of hidden layers -Most popular - one hidden layer Number of nodes in hidden layer(s) -More nodes capture complexity, but increase chances of overfit -More number of epochs on training data lead to poorer fit on validation data Number of output nodes -For classification, one node per class -For numerical prediction use one Learning Rate (l) -Low values slows learning Momentum -Rate of change in learning rate to attain global minimum/utilized for the purpose of attaining global min. FROM BOOK: idea is to keep the weights changing in the same direction as they did in the preceding iteration. This helps avoid getting stuck in a local optimum. High values of momentum mean the the network will be "reluctant" to learn from data that want to change the direction of the weights especially when we consider case updating. Values in range 0 to 2 are used.

Odds vs Odds Ratio

Odds=p/(1-p) :Represents the ratio of the probability of belonging to one class as opposed to the other. -Odds Ratio is the ratio of odds of two different predictors. Which is more important/parismonous?

Multiple hidden layers

One can have many neurons stacked in each hidden layer

NOTES

Pick rules w/ highest confidence/lift

NOTES FROM BOOK: Chapter 11 Neural Nets cont...

Preprocessing the data -for binary variables, no adjustment is needed other than creating dummy variables. For categorical variables w/ m categories, if they are ordinal in nature, a choice of m fractions in [0,1] should reflect their perceived ordering. -if the categories are nominal, transforming into m-1 dummies is a good solution. -another operation that improves the performance of the network is to transform highly skewed predictors. Training the Model means estimating the weights Theta and W that lead to the best predictive results. Black Boxes- In NN, their output does not shed light on the patterns in the data that it models.

Support, Confidence and Lift in our example

Rule for Row 1: If Orange faceplate is purchased, then with Confidence 100%, White faceplate will also be purchased. This rule has a lift ratio of 1.43. Rule 1, Confidence: =S(A&C)/S(A) =2/2=1 (100%) Rule 1, Lift Ratio: =1/(7 / 10*) =1.43 *There are 10 transactions in our data*

Performance measure #1: Support (Do we have enough transactions in the data set to support the rule?)

Support for a rule (A>B) is written as, S (A>B) = Number of transactions in the data set that contain both item A and item B. It may also be written as a probability measure, S(A>B) = P(A and B) . What is the value of S(white > blue)? There are 4 transactions (Rows 3,6,8 & 9) that contain both colors. Hence, S(white à blue) = 4. Support expressed as probability, in this case, will be = 4/10 = 0.4. NOTE: Based on conditional probability (what is its value)

Converting y-value into a Probability value

The following equation does: p=1/(1+e^-y) where y= B0 + B1x1+ B2x2 ...... The above equation for p is formally known as the Sigmoid Transformation Function. It takes a variable with values between between -∞ to ∞ and outputs an S-curve with values between 0 and 1. Note from graph that y can take any value and p-values are between 0 and 1.

NOTE

Use training data to start w/ random weights look @ record and compare same output in dependent variable If not> try to change weights in a way to minimize errors encountered the 1st time (knows direction) to go around and minimize -neuron starts w/ regression equation (the one with Beta) -0/T=Theta Sigma (E)=Summation T+Summation=regression equation S=summation of I & W products (input value x weight) X=actual value of column for a specific record What is intercept? Theta/0 (also bias>no influence from variable but there is still an influence) more Coefficient=more weight

NOTES FROM BOOK: Chapter 14 Association Rules cont...

User-based collaborative filtering: "people like you" -algorithm has two steps 1. find users who are most similar to the user of interest (neighbors). This is done by comparing the preference of our user to the preferences of other users. 2. Considering only the items that user has not yet purchased, recommend the ones that are most preferred by the user's neighbors. Item-Based Collaborative filtering -when the # of users is much larger than the # of items, it is computationally cheaper (and faster) to find similar items rather than similar users. -when a user expresses interest in a particular item, the item-based collaborative filtering has two steps: 1. find the items that were co-rated, or co-purchased, (by any user) w/ the item of interest 2. Recommend the most popular or correlated item(s) among the similar items.

Performance measure #2: Confidence

We know that a rule is expressed in the form, Antecedent > Consequent But, what is the conditional probability given that an Antecedent appears in the basket, the customer will buy the Consequent item also? P(Consequent | Antecedent)= P(Consequent and Antecedent/P(Antecedent) just like Conditional Prob formula

The Logit Function

We know that: Odds= p/(1-p) NOTE: P represents p=1/(1+e^-y) Upon simplification, we get: Odds= e^B0 + B1x1+ B2x2 ...... Taking natural log on both sides, we can reduce RHS to a standard linear form: log(Odds)=B0 + B1x1+ B2x2 ......=y -Definition: "log (Odds)" is called the logit function. The logit function relates the predictor variables x1, x2,... to the target variable log(Odds). Excel function =ln(#) -(can get back original value) log(odds)=linear output -to get rid of odds get rid of exponential powers LOGIT=log of units anti-log=e^-1 NOTES: idea behind LR is straightforward: instead of using Y as the dependent variable, we use a function of it (Logit)-can be modeled as a linear function of the predictors . Once logit function has been predicted it can be mapped back to a probability.

When classifying output into binary groups, all original input variable values must be standardized using the formula,

Xnorm=X-a/b-a where a<X<b -a is the smallest -b is the largest # compared to a -every variable is given equal amount of weights

Multiple Predictors

brings back dummy variables (categorical variables) -take one less (drop redundant) always n-1. -if three; has to be 2; Data Preprocessing-we start by creating dummy variables for each of the categorical predictors

Note from graph:

local minimum=slight what weight will I find the error to be minimum? y axis= amount of error x axis=weight if goes up=gives you minimum error/gives answer or conclusion of what weight to use global minimum=all possible x values of weights flat area-misled/missed the minimum

NOTES FROM BOOK: Chapter 14 Association Rules cont...

process of rule selection -process of selecting strong rules is based on generating all association rules that meet stipulated support & confidence requirements. -this is done in 2 steps -first step-consists of finding all "frequent" itemsets, those itemsets that have q requisite support. -second stage, we generate from the frequent itemsets, association rules that meet a confidence requirement. Interpreting the results -the support for the rule indicates its impact in terms of overall size: How many transactions are affected? If only a small # of transactions are affected, the rule may be of little use (unless the consequent is very valuable and/or the rule is very efficient in finding it). Rules and Chance -two principles can guide us in assessing rules for possible spuriousness due to chance effects: 1. the more records the rule is based on, the more solid is the conclusion. 2. the more distinct are the rules we consider seriously (perhaps consolidating multiple rules that deal w/ the same items, the more likely it is that at least some will be based on chance sampling results. Data Type and Format -collaborative filtering requires availability of all item-user info.

Learning Rate

will be adjusted also; if LR adjusts=change in momentum if difference is large need to change Weight substantially towards to direction of under or over prediction -way to low=increase LR high; otherwise go through small step to get the closest value of actual value whatever that a system tries to fit for a Training D. & uses independent variable in Training Record to generate rates & try to predict what will be actual value in dependent column (checks how far) Note: go through alot of iterations before finds smaller rate FROM BOOK -where l is a learning rate or weight decay parameter, a constant ranging typically b/w 0 and 1, which controls the amount of change in weights from one iteration to the next -user can control l, and the momentum. Learning rate to avoid overfitting by down-weighting new info. this helps tone down the effect of outliers on the weights and avoids getting stuck in local optima. -Suggestion: start w/ a large (moving away from initial weights, thereby "learning quickly" from the data) and then slowly decreasing it as the iterations progress and the weights are more reliable.

Biological Neural Networks

•Fundamental component is a neuron cell •Receives input charge through dendrites from the input layer •The neuron "weighs" the inputs •The neuron "transforms" to an output charge •Outputs the charge to the next layer


Ensembles d'études connexes

American Drama (Edmentum answers)

View Set

Final Exam Org. Behavior Quizzes

View Set

Color Theory: Full Spectrum Deposit Only Color

View Set

unit five, civil liberties and civil rights

View Set