ML CS7641 Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Cons of K Means

- Can get stuck in local minima - exponential number of potential iterations - have to select number of clusters in advance - memory intensive - only supports spherical clusters

The Nash equilibrium leads to a set of beautiful properties about these types of games...

- In the n-player pure strategy game, if elimination of strictly dominated strategies eliminates all but one combination, that combination is the unique Nash equilibrium. • More generally, any Nash equilibrium will survive the elimination of strictly dominated strategy. • If n is finite, and all sis (the sets of possible strategies for player i) are finite, then there exists at least one Nash equilibrium (w

Markov Decision Process Structure

- The environment can be described by a particular state, s. - The agent can take actions in the world based on its state: A(s). • The model of the world is described by a transition function, T, which (probabilistically) describes how the environment responds to the action. - The actions are rewarded (or punished) based on their outcome and the result state

K Means Clustering

1. Pick k random cluster centers. 2. Associate each point with its closest center point. 3. Recompute the centers by averaging the above points. 4. Repeat from Step 2 until convergence—that is, when cluster centers no longer moving

Desirable Properties of Clustering Algorithms

1. Richness: Ideally, our clustering algorithm would be able to associate any number of points with any number of clusters depending on our inputs and distance metric. Formally, richness means that there is some distance metric D such that any clustering can work, so: ∀c, ∃D : PD = c 2. Scale-invariance: If we were to upscale our distances by some arbitrary constant, the clustering should not change; intuitively, the "units" of our feature space (miles vs. kilometers, for example) shouldn't matter. Formally, we say that a clustering algorithm has scale invariance if ∀D, ∀k > D : PD = PkD. 3: Consistency: Given a particular clustering, we would imagine that by "compressing" or "expanding" the points within a cluster (think squeezing or stretching a circle), none of the points would magically assign themselves to another cluster. In other words, shrinking or expanding intracluster distances should not change the clustering

Feature Relevance

1. strongly relevant if removing it degrades the Bayes' optimal classifier; 2. weakly relevant if it's not strongly relevant (obviously) and there's some subset of features S such that adding xi to S improves the Bayes optimal classifier and 3. irrelevant, otherwise

Single Linkage Clustering (overview and complexity)

1. treat each object as a cluster (start with n clusters) 2. define inter cluster distance as being the closest-possible distance between the two clusters (that is, the minimum distance between all pairs of points in two clusters) 3. merge the two closest clusters 4. repeat n-k times to make k clusters O(n^3) complexity

Good Feature Transformation

A good feature transformation will combine features together (like lumping words with similar meanings or overarching concepts together), providing a more compact, generalized, and efficient way to query things. For example, a query for the word "car" should rank documents with the word "automobile" (a synonym), "Tesla" (a brand), and even "motor" (and underlying concept) higher than a generic term like "turtle."

ICA Projection

Aims to find a projection such that the above constraints hold: minimizing mutual information between each pair of new features and maximizing information between the new features and the old features

Folk Thm

Any feasible payoff profile that strictly dominated the minmax profile can be realized as a Nash equilibrium payoff profile, with a sufficiently large discount factor

2-player, zero-sum, finite, deterministic game example

At each point in time, Agent A makes an action choice, and then Agent B gets a choice. Eventually, this series of choices leads to some final reward for A and the opposite reward for B.

Hierarchical clustering with complete link

At the beginning of the process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters until all elements end up being in the same cluster. The method is also known as farthest neighbour clustering. The result of the clustering can be visualized as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place.

Hierarchical clustering with average link

Average-link clustering merges in each iteration the pair of clusters with the highest cohesion.

The key equation to RL

Bellman Equation

Principal Component Analysis (PCA)

Breaking down a feature set into principal components

Suppose we ran the prisoner's dilemma multiple times. Intuitively, once you saw that I would defect regardless (and you would likewise), wouldn't it make sense for us to collectively decide to cooperate to reduce our sentences>

Consider the very last repeat of the experiment. At this point you may have built up some trust in your criminal partner, and you think you can trust them to cooperate. Well, isn't that the best time to betray them and get off scot-free? Yes (since guilt has no utility), and by that same rationale your partner would do likewise, and we're back where we started. This leads to another property of the Nash equilibrium: If you repeat the game n times, you will get the same Nash equilibrium for all n times

Why perform feature selection?

Curse of dimensionality: the amount of data that we need grows exponentially with the number of feature dimensions

K Means Complexity

Each iteration takes polynomial time: O(kn). There's a finite-though exponential-number of iterations: O(k n ).

T/F: Tit for Tat is subgame perfect

F, dependent on previous history

T/F: Error always decreases with each K Means iteration

False. Error only is guaranteed to decrease if ties are broken consistently.

T/F: Both K Means and EM are guaranteed to converge

False. K Means is, EM is not. However, EM will never get worse, but it might bet better at increasingly slow rates, which can be assumed as convergence.

Matrix structure of ICA

First, ICA samples each feature in the dataset to form a matrix. Each row in the matrix is a feature, and each column is a sample

Random Component Analysis

Generates random directions and projects the data onto them. It works remarkably well for classification because enough random components can still capture correlations in data, though the resulting m-dimensional space is bigger than the one found by other methods. Primary benefit is speed.

Von Neumann Theorem

In a 2 player, zero-sum, finite game with perfect information, following the minimax strategy is equivalent to following the inverse "maximin" strategy (e.g. Agent A minimizing its reward): minimax ≡ maximin and there always exists an optimal pure strategy for each player

When to use model-free approach?

In general, it would be advisable to use model-based approaches if you have access to the model of the world. Unfortunately, in complex situations it can be more difficult to discover and define this model than to use model-free approaches, such as q- learning. Model-free approaches can work well, too, but will often take many more iterations and wall clock time. Additionally, they can be less reliable in the optimality of solution and require more diligent exploration and tuning of hyperparameters.

the clustering problem

Input: A set of objects: X A distance metric D(·, ·), defining inter-object distances such that D(x, y) = D(y, x) where x, y ∈ X . Output: A partitioning of the objects such that PD(x) = PD(y) if x and y belong to the same cluster.

Hierarchical clustering with single link

It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other. A drawback of this method is that it tends to produce long thin clusters in which nearby elements of the same cluster have small distances, but elements at opposite ends of a cluster may be much farther from each other than two elements of other clusters. This may lead to difficulties in defining classes that could usefully subdivide the data.

How is an MDP different from RL

MDP knew everything there is to know about the world: the states, the transitions, the rewards, etc. In reality, though, a lot of that is hidden from us: only after taking (or even trying to take) an action can we see its effect on the world (our new state) and the resulting reward

Policy

Maps states to action

How does Q Learning choose actions?

Neither always choosing the optimal action nor a random action actually incorporates what we've learned into our decision. For this, we can actually use Qˆ, but we need to be careful: relying on our (potentially faulty, unconverged) knowledge of actions might lead to us simply reinforcing the wrong actions. Sometime explore by choosing a random action.

Non-deterministic Game

Outcomes are not always reflected by agents choice

Q function

Q(s,a) = R(s) + γ * Σ [ T(s, a, s') * max Q(s', a')]

Expectation Maximization (EM)

Soft clustering technique similar to k-means clustering. There are two parts iterating hand in-hand: adjusting the cluster probabilities zi and the chosen mean µ. The process is called expectation maximization. In fact, k-means is a special case of expectation maximization

Pavlovian Strategy

Start of cooperating: as long as you cooperate when we do, we will continue to cooperate. If you snitch, we will also snitch, but if you then decide to cooperate we will still snitch. This will continue until you snitch again, in which case we will revert back to cooperation as a sort of olive branch of peace.

Single ML Gaussian

Suppose k = 1 for the simplest possible case. What Gaussian maximizes the likelihood of some collection of points? Conveniently (and intuitively), it's simply the Gaussian with µ set to the mean of the points!

T/F: GMM has the capability of handling overlapping clusters.

T

T/F: Pavlov is subgame perfect

T, the average reward is always mutual cooperation.

Wrapping (Feature Selection): Pros and Cons

The natural perk of involving the learner in a feedback loop of feature search is that model bias and learning is taken into account. Naturally, though, this takes much longer for each iteration

T/F: MDP, we have no understand of how our immediate action will lead to things down the road.

True

Bellman Equation

U(s) = R(s) + γ * Σ [ T(s, a, s') * U(s')]

Value Iteration Equation

U_{t+1}(s) = R(s) + γ * Σ [ T(s, a, s') * U_t(s')] U: true utility of a state finding best action

k ML Gaussian

With k possible sources for each point, we'll introduce hidden variables for each point that represent which cluster they came from. They're hidden because, obviously, we don't know them: if we knew that information we wouldn't be trying to cluster them. Now each point x is actually coupled with the probabilities of coming from the clusters

Computational Folk Theorem

You can build a Pavlov-like machine for any game and construct a subgame perfect Nash equilibrium in polynomial time

minimax profile

a pair of payoffs - one for each player- that represent the payoffs that can be achieved by a player defending itself from a malicious adversary. In other words, reform the game to be zero sum

Prisoner's Dilemma

a particular "game" between two captured prisoners that illustrates why cooperation is difficult to maintain even when it is mutually beneficial. at nash equilibrium.

Nash Equilibrium

a set of strategies is a Nash equilibrium if no one player would change their strategy given the opportunity. This concept works for both pure and mixed strategies

strictly dominated strategy

a strategy is strictly dominated if it leads to a payoff that is worse than the payoffs from some other available strategy, regardless of the other player's choice

Model-Free RL

a value-based approach that learns utilities to relate states to actions: model → simulator → transitions → policy

What are game theory strategies?

akin to policies for MDP

subgame perfect equilibrium

always take the best response independent of any historical responses. We can think of subgame perfect strategies as eventually equalizing again, whereas ones that are not subgame perfect are extremely fragile and dependent on very ideal sequences—the difference between stability of a marble at the bottom of a ∪ parabola vs. one at the exact, barely balanced peak of a n one

Principal Component

direction along which points in that feature space have the greatest variance

Mixed Strategy

distribution of strategies. In the context of mini-poker, for example, Agent A could choose to be a holder or a resigner half of the time

Soft Clustering

each point now belongs to each of the possible clusters with some probabilistic certainty.

Critical Assumption of Game Theory

everyone behaves optimally, and this means that they behave greedily with respect to their utility values. The key that lets this work in the real world is that the utility value of someone staying out of jail might not be only the raw amount of months they avoided

Feasible Payoff

feasible payoff is one that can be reached by some combination of the extremes (by varying probabilities, etc.). This is simply the convex hull of the points defined in the value matrix on a "player plot"

Downside of PCA

features with high variance don't necessarily correlate to features with high importance. You might have a bunch of random noise with one useful feature, and PCA will almost certainly drop the useful feature since the random noise has high variance

Goal of PCA

find vectors of maximal variance (and correlation) to aid in data reconstruction

Linear Discriminant Analysis

finds a projection that discriminates based on the label. In other words, it aims to find a collection of good linear separators on the data

Unsupervised Learning

focused on inferring patterns from the data alone: we want to make sense of unlabeled data

Types of Wrapping (feature selection)

forward search and backward search

Bellman Equation

fully encodes the utility of a state in an MDP

Clustering

given a set of objects, we want to divide them into groups

mechanism design

if we know the outcome, we can try to manipulate the game itself to change the outcome to what we want

grim trigger

in this strategy, we guarantee cooperation for mutual benefit as long as our partner/opponent doesn't "cross the line"; if they do, we will deal out vengeance forever. In the context of the prisoner's dilemma, A cooperates until B snitches, at which point A will always snitch

Tit for Tat

involves playing cooperatively at first, then doing whatever the other player did in the previous period

Fundamental Problem behind Q Learning

is that we don't have R or T, but we do have samples about the world: hs, a, r, s0 i is the reward and new state that result from taking an action in a state. Given enough tuples we can estimate this Q function.

Optimal Mixed Strategy Probability

it's the maximum of the possible lower bounds of profit vs probability

Value function

maps states to utilities

Goal of ICA

maximize independence

Why use ICA over PCA?

method over PCA is that it does not restrict itself to orthogonal components.

Markovian Property

only the present matters: notice that the new state in transition function only depends on the previous state. Furthermore, the environment in which the agent can act stays static

Assumption of Von Neumann Thm

optimality: all agents are behaving optimally, and thus are always trying to maximize their respective reward

Subsequent principal components are...

orthogonal to the previous components and describe the "next" direction of maximum residual variance

Infinite Horizon

our agent can live and simulate forever and their optimal policy assumes there's infinite time to execute its plans

Policy vs Value Iteration Convergence

policy iteration reliably converges in fewer iterations than value iteration, but not always in less wall clock time. In larger worlds (large forest management) value iteration can converge to the same valued result in less wall clock time than policy iteration. However, in smaller worlds, policy iteration clocks in lower for both number of iterations and wall clock time (while achieving same/similar results)

Feature Transformation

process a set of features to create a new, optimized feature set while retaining as much relevant information as possible. This might sound like the same thing as feature selection, but the difference is subtle: the resulting feature set here can be completely new; it doesn't have to just be a subset of the original features.

Value Iteration

repeated iteration until convergence: we'll start with arbitrary utilities, then update them based on their neighbors until they converge. However, because the transitions and rewards are actually true, we'll eventually converge by overwhelming our initially-random guesses about utility. This simple process is called value iteration; the fundamental reason why value iteration works is because rewards propogate through their neighbors.

Policy Iteration

start with a guess Π0 of how we should act in the world. Following this policy will result in a particular utility, so we can find Ut = U Πt . Then, given that utility, we can figure out how to improve the policy by then finding the action that maximizes the expected utility. With this formulation, we have n linear equations (the maxa is gone) in n unknowns which is easily solvable.

The principal components of a matrix are...

the eigenvectors of the points, ranked in importance by their eigenvalues. The eigenvectors can be found algorithmically through a bunch of methods.

True utility of a state

the true utility of a state is simply its immediate reward plus all discounted future rewards

Hidden Information Game

there is no longer a pure consistent strategy that works for both players, minimax != maximin

minimax

two-way process of picking a strategy such that it minimizes the impact your opponent could have

Foundation Behind Q Learning

using data that we learn about the world as we take actions in order to evaluate the Bellman equation.

Utility Assumption MDP

utility of sequences is Markovian: if we preferred a particular state today (when we started from s0), we actually always prefer it

Filtering (Feature Selection)

where a search process directly reduces to a smaller feature set and feeds it to a learning algorithm.

Wrapping (Feature Selection)

where the search process interacts with the learning algorithm directly to iteratively adjust the feature set

Fundamental Principle Behind Game Theory

you are not alone, you are often collaborating and competing with other agents to achieve various and potentially conflicting goals

Policy Iteration Equation

Π_{t+1}(s) = arg max_{a∈A(s)} Σ [ T(s, a, s') * U_t(s')] U: true utility of a state finding best action

Tit for Tat behavior when: 1. they always snitch 2. they always cooperate 3. they tit for tat 4. they snitch then cooperate

• under the always snitch opponent strategy, we'll cooperate, then snitch, snitch, . . . • if they always cooperate, we'll always cooperate. • if they also do tit-for-tat, we'll again always cooperate. • if they snitch, then cooperate, then snitch, etc. we'll do the exact opposite— cooperate, then snitch, then cooperate, etc


Ensembles d'études connexes

Science Lesson 20 What Causes Climate?

View Set

Biology 190 : Unit 3 : Cells : Organelles & their Interaction

View Set

African American Studies: The Trans-Atlantic Slave Trade

View Set

Persistent Questions in Psychology

View Set

chapter 67 exam 3 testbank questions

View Set

Code Switch: Why Now, White People?

View Set

MLA format and in-text citations

View Set