ML CS7641 Final
Cons of K Means
- Can get stuck in local minima - exponential number of potential iterations - have to select number of clusters in advance - memory intensive - only supports spherical clusters
The Nash equilibrium leads to a set of beautiful properties about these types of games...
- In the n-player pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination is the unique Nash equilibrium. • More generally, any Nash equilibrium will survive the elimination of strictly dominated strategy. • If n is finite, and all sis (the sets of possible strategies for player i) are finite, then there exists at least one Nash equilibrium (w
Markov Decision Process Structure
- The environment can be described by a particular state, s. - The agent can take actions in the world based on its state: A(s). • The model of the world is described by a transition function, T, which (probabilistically) describes how the environment responds to the action. - The actions are rewarded (or punished) based on their outcome and the result state
K Means Clustering
1. Pick k random cluster centers. 2. Associate each point with its closest center point. 3. Recompute the centers by averaging the above points. 4. Repeat from Step 2 until convergence—that is, when cluster centers no longer moving
Desirable Properties of Clustering Algorithms
1. Richness: Ideally, our clustering algorithm would be able to associate any number of points with any number of clusters depending on our inputs and distance metric. Formally, richness means that there is some distance metric D such that any clustering can work, so: ∀c, ∃D : PD = c 2. Scale-invariance: If we were to upscale our distances by some arbitrary constant, the clustering should not change; intuitively, the "units" of our feature space (miles vs. kilometers, for example) shouldn't matter. Formally, we say that a clustering algorithm has scale invariance if ∀D, ∀k > D : PD = PkD. 3: Consistency: Given a particular clustering, we would imagine that by "compressing" or "expanding" the points within a cluster (think squeezing or stretching a circle), none of the points would magically assign themselves to another cluster. In other words, shrinking or expanding intracluster distances should not change the clustering
Feature Relevance
1. strongly relevant if removing it degrades the Bayes' optimal classifier; 2. weakly relevant if it's not strongly relevant (obviously) and there's some subset of features S such that adding xi to S improves the Bayes optimal classifier and 3. irrelevant, otherwise
Single Linkage Clustering (overview and complexity)
1. treat each object as a cluster (start with n clusters) 2. define inter cluster distance as being the closest-possible distance between the two clusters (that is, the minimum distance between all pairs of points in two clusters) 3. merge the two closest clusters 4. repeat n-k times to make k clusters O(n^3) complexity
Good Feature Transformation
A good feature transformation will combine features together (like lumping words with similar meanings or overarching concepts together), providing a more compact, generalized, and efficient way to query things. For example, a query for the word "car" should rank documents with the word "automobile" (a synonym), "Tesla" (a brand), and even "motor" (and underlying concept) higher than a generic term like "turtle."
ICA Projection
Aims to find a projection such that the above constraints hold: minimizing mutual information between each pair of new features and maximizing information between the new features and the old features
Folk Thm
Any feasible payoff profile that strictly dominated the minmax profile can be realized as a Nash equilibrium payoff profile, with a sufficiently large discount factor
2-player, zero-sum, finite, deterministic game example
At each point in time, Agent A makes an action choice, and then Agent B gets a choice. Eventually, this series of choices leads to some final reward for A and the opposite reward for B.
Hierarchical clustering with complete link
At the beginning of the process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters until all elements end up being in the same cluster. The method is also known as farthest neighbour clustering. The result of the clustering can be visualized as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place.
Hierarchical clustering with average link
Average-link clustering merges in each iteration the pair of clusters with the highest cohesion.
The key equation to RL
Bellman Equation
Principal Component Analysis (PCA)
Breaking down a feature set into principal components
Suppose we ran the prisoner's dilemma multiple times. Intuitively, once you saw that I would defect regardless (and you would likewise), wouldn't it make sense for us to collectively decide to cooperate to reduce our sentences>
Consider the very last repeat of the experiment. At this point you may have built up some trust in your criminal partner, and you think you can trust them to cooperate. Well, isn't that the best time to betray them and get off scot-free? Yes (since guilt has no utility), and by that same rationale your partner would do likewise, and we're back where we started. This leads to another property of the Nash equilibrium: If you repeat the game n times, you will get the same Nash equilibrium for all n times
Why perform feature selection?
Curse of dimensionality: the amount of data that we need grows exponentially with the number of feature dimensions
K Means Complexity
Each iteration takes polynomial time: O(kn). There's a finite-though exponential-number of iterations: O(k n ).
T/F: Tit for Tat is subgame perfect
F, dependent on previous history
T/F: Error always decreases with each K Means iteration
False. Error only is guaranteed to decrease if ties are broken consistently.
T/F: Both K Means and EM are guaranteed to converge
False. K Means is, EM is not. However, EM will never get worse, but it might bet better at increasingly slow rates, which can be assumed as convergence.
Matrix structure of ICA
First, ICA samples each feature in the dataset to form a matrix. Each row in the matrix is a feature, and each column is a sample
Random Component Analysis
Generates random directions and projects the data onto them. It works remarkably well for classification because enough random components can still capture correlations in data, though the resulting m-dimensional space is bigger than the one found by other methods. Primary benefit is speed.
Von Neumann Theorem
In a 2 player, zero-sum, finite game with perfect information, following the minimax strategy is equivalent to following the inverse "maximin" strategy (e.g. Agent A minimizing its reward): minimax ≡ maximin and there always exists an optimal pure strategy for each player
When to use model-free approach?
In general, it would be advisable to use model-based approaches if you have access to the model of the world. Unfortunately, in complex situations it can be more difficult to discover and define this model than to use model-free approaches, such as q- learning. Model-free approaches can work well, too, but will often take many more iterations and wall clock time. Additionally, they can be less reliable in the optimality of solution and require more diligent exploration and tuning of hyperparameters.
the clustering problem
Input: A set of objects: X A distance metric D(·, ·), defining inter-object distances such that D(x, y) = D(y, x) where x, y ∈ X . Output: A partitioning of the objects such that PD(x) = PD(y) if x and y belong to the same cluster.
Hierarchical clustering with single link
It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other. A drawback of this method is that it tends to produce long thin clusters in which nearby elements of the same cluster have small distances, but elements at opposite ends of a cluster may be much farther from each other than two elements of other clusters. This may lead to difficulties in defining classes that could usefully subdivide the data.
How is an MDP different from RL
MDP knew everything there is to know about the world: the states, the transitions, the rewards, etc. In reality, though, a lot of that is hidden from us: only after taking (or even trying to take) an action can we see its effect on the world (our new state) and the resulting reward
Policy
Maps states to action
How does Q Learning choose actions?
Neither always choosing the optimal action nor a random action actually incorporates what we've learned into our decision. For this, we can actually use Qˆ, but we need to be careful: relying on our (potentially faulty, unconverged) knowledge of actions might lead to us simply reinforcing the wrong actions. Sometime explore by choosing a random action.
Non-deterministic Game
Outcomes are not always reflected by agents choice
Q function
Q(s,a) = R(s) + γ * Σ [ T(s, a, s') * max Q(s', a')]
Expectation Maximization (EM)
Soft clustering technique similar to k-means clustering. There are two parts iterating hand in-hand: adjusting the cluster probabilities zi and the chosen mean µ. The process is called expectation maximization. In fact, k-means is a special case of expectation maximization
Pavlovian Strategy
Start of cooperating: as long as you cooperate when we do, we will continue to cooperate. If you snitch, we will also snitch, but if you then decide to cooperate we will still snitch. This will continue until you snitch again, in which case we will revert back to cooperation as a sort of olive branch of peace.
Single ML Gaussian
Suppose k = 1 for the simplest possible case. What Gaussian maximizes the likelihood of some collection of points? Conveniently (and intuitively), it's simply the Gaussian with µ set to the mean of the points!
T/F: GMM has the capability of handling overlapping clusters.
T
T/F: Pavlov is subgame perfect
T, the average reward is always mutual cooperation.
Wrapping (Feature Selection): Pros and Cons
The natural perk of involving the learner in a feedback loop of feature search is that model bias and learning is taken into account. Naturally, though, this takes much longer for each iteration
T/F: MDP, we have no understand of how our immediate action will lead to things down the road.
True
Bellman Equation
U(s) = R(s) + γ * Σ [ T(s, a, s') * U(s')]
Value Iteration Equation
U_{t+1}(s) = R(s) + γ * Σ [ T(s, a, s') * U_t(s')] U: true utility of a state finding best action
k ML Gaussian
With k possible sources for each point, we'll introduce hidden variables for each point that represent which cluster they came from. They're hidden because, obviously, we don't know them: if we knew that information we wouldn't be trying to cluster them. Now each point x is actually coupled with the probabilities of coming from the clusters
Computational Folk Theorem
You can build a Pavlov-like machine for any game and construct a subgame perfect Nash equilibrium in polynomial time
minimax profile
a pair of payoffs - one for each player- that represent the payoffs that can be achieved by a player defending itself from a malicious adversary. In other words, reform the game to be zero sum
Prisoner's Dilemma
a particular "game" between two captured prisoners that illustrates why cooperation is difficult to maintain even when it is mutually beneficial. at nash equilibrium.
Nash Equilibrium
a set of strategies is a Nash equilibrium if no one player would change their strategy given the opportunity. This concept works for both pure and mixed strategies
strictly dominated strategy
a strategy is strictly dominated if it leads to a payoff that is worse than the payoffs from some other available strategy, regardless of the other player's choice
Model-Free RL
a value-based approach that learns utilities to relate states to actions: model → simulator → transitions → policy
What are game theory strategies?
akin to policies for MDP
subgame perfect equilibrium
always take the best response independent of any historical responses. We can think of subgame perfect strategies as eventually equalizing again, whereas ones that are not subgame perfect are extremely fragile and dependent on very ideal sequences—the difference between stability of a marble at the bottom of a ∪ parabola vs. one at the exact, barely balanced peak of a n one
Principal Component
direction along which points in that feature space have the greatest variance
Mixed Strategy
distribution of strategies. In the context of mini-poker, for example, Agent A could choose to be a holder or a resigner half of the time
Soft Clustering
each point now belongs to each of the possible clusters with some probabilistic certainty.
Critical Assumption of Game Theory
everyone behaves optimally, and this means that they behave greedily with respect to their utility values. The key that lets this work in the real world is that the utility value of someone staying out of jail might not be only the raw amount of months they avoided
Feasible Payoff
feasible payoff is one that can be reached by some combination of the extremes (by varying probabilities, etc.). This is simply the convex hull of the points defined in the value matrix on a "player plot"
Downside of PCA
features with high variance don't necessarily correlate to features with high importance. You might have a bunch of random noise with one useful feature, and PCA will almost certainly drop the useful feature since the random noise has high variance
Goal of PCA
find vectors of maximal variance (and correlation) to aid in data reconstruction
Linear Discriminant Analysis
finds a projection that discriminates based on the label. In other words, it aims to find a collection of good linear separators on the data
Unsupervised Learning
focused on inferring patterns from the data alone: we want to make sense of unlabeled data
Types of Wrapping (feature selection)
forward search and backward search
Bellman Equation
fully encodes the utility of a state in an MDP
Clustering
given a set of objects, we want to divide them into groups
mechanism design
if we know the outcome, we can try to manipulate the game itself to change the outcome to what we want
grim trigger
in this strategy, we guarantee cooperation for mutual benefit as long as our partner/opponent doesn't "cross the line"; if they do, we will deal out vengeance forever. In the context of the prisoner's dilemma, A cooperates until B snitches, at which point A will always snitch
Tit for Tat
involves playing cooperatively at first, then doing whatever the other player did in the previous period
Fundamental Problem behind Q Learning
is that we don't have R or T, but we do have samples about the world: hs, a, r, s0 i is the reward and new state that result from taking an action in a state. Given enough tuples we can estimate this Q function.
Optimal Mixed Strategy Probability
it's the maximum of the possible lower bounds of profit vs probability
Value function
maps states to utilities
Goal of ICA
maximize independence
Why use ICA over PCA?
method over PCA is that it does not restrict itself to orthogonal components.
Markovian Property
only the present matters: notice that the new state in transition function only depends on the previous state. Furthermore, the environment in which the agent can act stays static
Assumption of Von Neumann Thm
optimality: all agents are behaving optimally, and thus are always trying to maximize their respective reward
Subsequent principal components are...
orthogonal to the previous components and describe the "next" direction of maximum residual variance
Infinite Horizon
our agent can live and simulate forever and their optimal policy assumes there's infinite time to execute its plans
Policy vs Value Iteration Convergence
policy iteration reliably converges in fewer iterations than value iteration, but not always in less wall clock time. In larger worlds (large forest management) value iteration can converge to the same valued result in less wall clock time than policy iteration. However, in smaller worlds, policy iteration clocks in lower for both number of iterations and wall clock time (while achieving same/similar results)
Feature Transformation
process a set of features to create a new, optimized feature set while retaining as much relevant information as possible. This might sound like the same thing as feature selection, but the difference is subtle: the resulting feature set here can be completely new; it doesn't have to just be a subset of the original features.
Value Iteration
repeated iteration until convergence: we'll start with arbitrary utilities, then update them based on their neighbors until they converge. However, because the transitions and rewards are actually true, we'll eventually converge by overwhelming our initially-random guesses about utility. This simple process is called value iteration; the fundamental reason why value iteration works is because rewards propogate through their neighbors.
Policy Iteration
start with a guess Π0 of how we should act in the world. Following this policy will result in a particular utility, so we can find Ut = U Πt . Then, given that utility, we can figure out how to improve the policy by then finding the action that maximizes the expected utility. With this formulation, we have n linear equations (the maxa is gone) in n unknowns which is easily solvable.
The principal components of a matrix are...
the eigenvectors of the points, ranked in importance by their eigenvalues. The eigenvectors can be found algorithmically through a bunch of methods.
True utility of a state
the true utility of a state is simply its immediate reward plus all discounted future rewards
Hidden Information Game
there is no longer a pure consistent strategy that works for both players, minimax != maximin
minimax
two-way process of picking a strategy such that it minimizes the impact your opponent could have
Foundation Behind Q Learning
using data that we learn about the world as we take actions in order to evaluate the Bellman equation.
Utility Assumption MDP
utility of sequences is Markovian: if we preferred a particular state today (when we started from s0), we actually always prefer it
Filtering (Feature Selection)
where a search process directly reduces to a smaller feature set and feeds it to a learning algorithm.
Wrapping (Feature Selection)
where the search process interacts with the learning algorithm directly to iteratively adjust the feature set
Fundamental Principle Behind Game Theory
you are not alone, you are often collaborating and competing with other agents to achieve various and potentially conflicting goals
Policy Iteration Equation
Π_{t+1}(s) = arg max_{a∈A(s)} Σ [ T(s, a, s') * U_t(s')] U: true utility of a state finding best action
Tit for Tat behavior when: 1. they always snitch 2. they always cooperate 3. they tit for tat 4. they snitch then cooperate
• under the always snitch opponent strategy, we'll cooperate, then snitch, snitch, . . . • if they always cooperate, we'll always cooperate. • if they also do tit-for-tat, we'll again always cooperate. • if they snitch, then cooperate, then snitch, etc. we'll do the exact opposite— cooperate, then snitch, then cooperate, etc