decision trees pt 3 , neural networks, and whatever is missing
•precision (
% of predicted positives that are actually positives)
•recall
(% of positives falling in the predicted positive class)
•log-likelihood
(probability of observing the data given the model)
•Design of a learning element is affected by -Which components of the performance element are to be learned
-What feedback is available to learn these components -What representation is used for the components
Weakness of K means clustering
Weaknesses: Sensitive to initialization: K-means is sensitive to the initial placement of centroids, which can lead to different results for different initializations. This can be mitigated by running the algorithm multiple times with different initializations. Assumes spherical clusters: K-means assumes that the clusters are spherical in shape and have equal variance, which may not always be the case in real-world datasets. Requires predefined number of clusters: K-means requires the user to specify the number of clusters to be created, which can be challenging in situations where the optimal number of clusters is not known in advance. Can converge to local optima: K-means can converge to local optima, which may not be the global optimum. To mitigate this, multiple runs with different initializations are recommended.
kmeans algo
1. Choose the number of clusters k 2. Select k random points from the data as centroids 3. Assign all the points to the closest cluster centroid 4. Recompute the centroids of newly formed clusters
How many distinct decision trees with n Boolean attributes? = number of Boolean functions
= number of distinct truth tables with 2n rows = 2^2^n
Bernoulli Distribution
A Bernoulli distribution is a discrete probability distribution for a Bernoulli trial — a random experiment that has only two outcomes (usually called a "Success" or a "Failure"). For example, the probability of getting a heads (a "success") while flipping a coin is 0.5. The probability of "failure" is 1 - P (1 minus the probability of success, which also equals 0.5 for a coin toss).
naive bayes cons
Assumes independence of features: Limited expressiveness: Naive Bayes is a linear classifier, which means it cannot capture complex relationships between features. Sensitivity to input data: Naive Bayes is sensitive to input data and can be affected by outliers and noise in the data. Cannot handle continuous features: Naive Bayes assumes that the features are discrete and categorical, which can limit its usefulness in some applications. May require a large number of examples: In some cases, Naive Bayes may require a large number of examples to accurately estimate the probabilities of different classes.
expected value for a random value
E[X]=p variance VAR[X]=p(1-p)
IG E(parent|energy) 30 instances. 16 gym 14 dont
Eparent=(-16/30log2(16/30))-(14/30log2(14/30))
gini impurity• suppose a particular split separates the data into two groups:• group A: 1 (12) and 0 (5)
GI_A=1-(12/17)^2-(5/17)^2
•To implement Choose-Attribute in the DTL algorithm
IG which feature should be root/after split
Train" model with stochastic gradientdescent for neural netowkr s
In standard Gradient Descent, the objective is to find the minimum of a cost function by iteratively updating the model parameters in the opposite direction of the gradient of the cost function. In contrast, SGD updates the model parameters for each training example (or a small subset of training examples) in a random fashion, rather than using the entire dataset to compute the gradient at each step.
Polytomous logistic regression (PLR) is a statistical technique used for modeling relationships between a categorical dependent variable with more than two categories and one or more independent variables.
It is an extension of binary logistic regression, which is used for modeling binary outcomes (i.e., outcomes with two categories) PLR models the relationship between the dependent variable and the independent variables by estimating separate sets of coefficients for each category of the dependent variable. The coefficients for each category are then compared to estimate the odds of one category compared to another, with a reference category usually chosen for comparison..
kmeans clustering
K-means clustering is a popular unsupervised machine learning algorithm used for clustering similar data points into a fixed number of groups or clusters. It is a simple and fast algorithm The algorithm works by iteratively assigning each data point to the nearest cluster centroid, and then updating the centroid based on the mean of the assigned points. The process continues until the cluster assignments no longer change or a maximum number of iterations is reached.
maximum likelihood for Bernoulli: the maximum likelihood method can be used to estimate the unknown parameter p, which represents the probability of success in a single trial
L(p|x) = p^k * (1-p)^(n-k) where k is the number of successes in the sample and (n-k) is the number of failures
The main advantage of SGD over Gradient Descent is its computational efficiency, especially for large-scale datasets. By updating the parameters for each training example (or a small subset of examples),
SGD can converge much faster than Gradient Descent, as it takes advantage of the stochastic nature of the training data to avoid getting stuck in local minima. However, the resulting trajectory may be more noisy than the one obtained from the full batch method.
Naive Bayes pros
Simple and easy to implement: Fast and efficient: Good performance: Handles irrelevant features: Can work with small datasets: Naive Bayes can work well with small datasets, as it can handle missing data and can perform well with a small number of examples.
trengths/weaknesses• simple to compute, method is easy to explain• need to specify K, which may be unknown• only finds convex clusters STRENGTHS:
Strengths: Scalability: K-means is a relatively simple and fast algorithm that can be applied to large datasets. Ease of use: K-means is easy to implement and can be applied to a wide range of problems, making it a popular choice for clustering tasks. Clusters are well separated: K-means tends to produce well-separated clusters, which can make it easy to interpret and understand the results.
Maximum Likelihood Trainin
Tends to overfit • Not defined if d > n • Feature selection
clustering
want to group data, may not necessarily havelabels
Naïve Bayes Prediction •Usually
add a small constant (e.g. 0.5) to avoid divide by zero problems and to reduce bi
back propagation starts with output and works backward
adjust wts of neural network so minimze different bet predicted output and actual out
Training a model on a dataset refers to the process of using a set of input data (features)
along with their corresponding output labels to teach the model to make accurate predictions.
The decision tree algorithm starts with the entire training dataset at the root node, and recursively splits the data into smaller subsets based on the values of the input features. At each node, the algorithm selects the feature
and the threshold value that best separates the data into different classes. The process continues until the subsets are pure, meaning that all instances in a subset belong to the same class, or until a stopping criterion is met.
Learning curve = % correct on test set
as a function of training set size
Maximum likelihood is a statistical method used to
estimate the parameters of a probability distribution that best explain the observed data. The maximum likelihood method involves finding the set of parameter values that maximize the likelihood function, which is a function that measures the goodness of fit between the observed data and the theoretical probability distribution
RBF nEURAL nETWORK
for function approximation
GLM stands for Generalized Linear Models, which is a class of regression models that extends the linear regression framework to
handle non-normal response variables and non-constant variance. GLMs can handle response variables that follow different distributions such as binomial, Poisson, and gamma.
In machine learning, learning refers to the process of training a model on a dataset
in order to make accurate predictions on new, unseen data. The goal of learning is to enable the model to capture the underlying patterns and relationships in the data, so that it can generalize well to new data.
Patrons has the highest IG of all attributes and so
is chosen by the DTL algorithm as the root
To find the maximum likelihood estimate of the parameters, we need to
maximize the likelihood function with respect to θ. This is often done by taking the logarithm of the likelihood function and then differentiating it with respect to the parameters to obtain the maximum likelihood estimates. The resulting estimates are the values of θ that make the observed data most likely under the assumed probability distribution.
rationale for using neural network
nonlineraity uncertainity complexity parallel/distributed processing fault tolerance
CNN is a neural network with some convolutional layers(and some other layers). A convolutional layer has a
numberof filters that does convolutional operation
n simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
or example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as 'Naive'.
Then, we use the glm() function to fit a logistic regression model, where response is the binary response variable and
predictor1 and predictor2 are the two predictor variables.
An important part of every Bernoulli trial is that each action must be independent. That means the
probabilities must remain the same throughout the trials; each event must be completely separate and have nothing to do with the previous event.