COMP 560 Midterm: A.I. Artificial Intelligence
Quiescence search
(Mitigates the horizon effect) Continue the search until the position becomes stable. In chess, make sure that you complete all piece captures.
most constraining variable
(Used to improve backtracking) Picks a variable that affects the values for the most other variables
Least Constraining Variable
(Used to improve backtracking) Picks the variable that constrains the least number of other nodes aka choose value that rules out the smallest number of values in variables connected to the chosen variable by constraints It is a good heuristic to choose the variable that is most constrained but the value that is least constraining in a CSP search.
Admissibility (of A* heuristic)
Admissibility means never overestimating the cost
Games: Perfect Information
monopoly, chess, checkers, tic-tac-toe (you can see the whole board)
Observable Environment
the agent always knows the current state. Example: chess, crossword Parital: poker Counterexample: guessing games
Known Environment
the agent knows which states are reached by each action example: checkers counter: unknown environment (might require some exploration of the solution space)
RL: Basic Process
1. In each time step, take some action 2. Observe resulting state and reward 3. Update internal representation of the environment and policy 4. Upon reaching a terminal state, start over. (Mostly, esp. at the beginning, this is failure.)
Markov Decision Process
1. States S, with a beginning state s-sub-0 2. Actions A(s) which are the actions available to us 3. Transition Model P(s' | s, a) the probability of ending up in state s' given current state s and action a 4. Reward function R(s, a, s') the reward we'll get from moving from s to s' with the action a 5. Policy pi(s) which returns the action to take when in a given states Consists of a set of states, set of actions, a probability function for the next state given the current state and action, and an immediate reward function given the current state, current action, and next state.
Local Search: How not to get stuck
1. random restarts 2. multiple random starting positions 3. random sideways jumps 4. simulated annealing 5. beam search 6. genetic algorithm
Basic steps for local search
1. start with an initial assignment 2. iterations are guided by an objective function trying to move state toward the goal state 3. at each iteration, choose any conflicted variable and re-assign it to a value that reduces the number of conflicts (violated constraints)
Minimax
Algorithm to choose the next move in a 2-player game. At each decision point in the tree the player chooses their best possible move. Recursively this is the maximum move of minimum moves of maximums of... In a simple game like tic-tac-toe, values can be assigned to win, loss or draw results. But in a more complex game like chess, where it is impossible to fully traverse the decision tree, minimax has to terminate at intermediate depths and evaluate the static position. This evaluation is an estimate of the final outcome given all current information, a very basic chess static position evaluator could use the difference in total piece values. For deterministic games that have too many possibilities to be considered, we can only use a partial tree to estimate our standing. Weaknesses: - Searches nodes that we don't need to search (taken care of with alpha-beta pruning) - Time complexity is prohibitive for large problems - Assumes that both players are playing optimally (not the case with chess!) Time complexity? O(b^m) (Where m is the depth of the search space in the worst case) visualization: https://thimbleby.gitlab.io/algorithm-wiki-site/wiki/minimax_search/
Arc Consistency
All values in a variable's domain satisfy all relevant constraints Basic steps: 1. Create a queue with all the arcs in the CSP 2. While the queue is not empty 3. Pop an arc from the queue (Xi, Xj) 4. Remove any values from the domain of Xi that do not have any valid values in the domain of Xj 5. If any values were removed from the domain of Xi then add all arcs that lead to X If arc consistency is performed on a CSP network of nodes, and only one option for a domain is left in each node, then there exists a unique solution. If multiple values in a domain exist, then a search still needs to be performed. let the max size of a variable domain be d let the number of constraints be e complexity is O(ed^3) worst-case
Alpha-beta
Alpha represents the value that the "max" player is guaranteed to get on a certain branch. Beta represents the value that the "min" player is guaranteed on a certain branch. The best score is kept in mind when exploring nodes, and if a branch is not able to improve the best score, then it is pruned. Ordering matters very much on these searches - the sooner the optimal solution is found, the sooner unpromising branches are discarded. Time: With perfect ordering the runtime drops from O(b^m) to O(b^m/2) doubling the depth that can be searched in minimax time visualization: https://thimbleby.gitlab.io/algorithm-wiki-site/wiki/minimax_search_with_alpha-beta_pruning/
Value Iteration
An algorithm for calculating an optimal policy. The basic idea is to calculate the utility of each state and then use the state utilities to select an optimal action in each state. https://www.youtube.com/watch?v=KovN7WKI9Y0
Greedy Best First Search
Complete (iff loops are handled) Not optimal (returns first solution found. Might take larger steps than is optimal) Time O(b^m) Space O(b^m) Is an informed search, since it uses h(n) where h(n) is n's cost/distance to goal. Chooses nodes that are closest to the goal node first. Tends to use a priority queue to sort the nodes closest to goal.
DFS
Complete (iff search space is finite & we handle loops) Not optimal (returns first solution) Time O(b^m) Space O(bm) b is max number of children for any node m is depth of search space in the worst case Visualization: https://thimbleby.gitlab.io/algorithm-wiki-site/wiki/depth-first_search/
A*
Complete (iff there are a finite number of nodes less than optimal cost) Optimal Time O(# nodes where f(n) < optimal cost) Space O(# nodes where f(n) < optimal cost) Basically combines Uniform Cost with Greedy Best First - it keeps track of how far we've travelled, and how far we have remaining Is an INFORMED SEARCH since it uses an evaluation function. f(n) = h(n) + g(n) total = distance travelled + est. remaining. A* is optimally efficient for any given consistent heuristic. That is, no other optimal algorithm is guaranteed to expand fewer nodes than A*
BFS
Complete (works only if branching factor is finite) Optimal (returns shallowest solution) Time O(b^d) Space O(b^d) b is max number of children for any node d is depth of solution node Visualization: https://thimbleby.gitlab.io/algorithm-wiki-site/wiki/breadth-first_search/
Resource Constraints
Constraining factors (normally physical, like time or pace)
Iterative deepening
Doing searches of increasing depths using the timings to determine whether it's reasonable to search to the next depth or not. We store the current best move found and use that as the first move to search next in the next iteration.
Higher Order Constraints
Example: Y is between X and Z Between(X, Y, Z) Alldiff(A, B, C, D, E) means all variables must be unique
Games: Stochastic
Examples: Poker, monopoly (element of chance)
Games: Deterministic
Examples: chess, checkers (no chance, completely controlled)
Uniform Cost Search
Expands the node n with the lowest path cost (if all step costs are equal, this is identical to a breadth-first search) Complete (iff branching factor is finite) Not optimal (UNLESS cost for each move is the same) Time: O(b^C*/e) where C* is number of nodes with path cost less than optimum cost and e is smallest action cost Space: O(b^C*/e)
Exploration v. Exploitation (pros/cons)
Exploration: take a new action with unknown consequences Pros: Get a more accurate model of the environment Uncover higher value states than you have already found Cons: Might not be maximizing the utility Something bad might happen Exploitation: go with the best strategy you've found so far Pros: Maximize reward as reflected in the current utility functions Avoid bad stuff Cons: Might prevent you from finding the true optimal strategy
Limited Search
Getting a good evaluation function is key. The simplest eval functions are a linear sum of some variables. Deep Blue had 8000 features
Dominance
If h1 and h2 are both admissible and h2(n) >= h1(n) for all n, h2 dominates h1. If h2 dominates h1, then A* will expand fewer nodes with h2 than h1. We want as granular of a measure of goodness/badness as we can, and a higher number (that still does not overestimate) gives us more precise information.
Local Policy Search
In some ways, policy search is the simplest of all the methods in this chapter: the idea is to keep twiddling the policy as long as its performance improves, then stop. Start with an initial (possibly random) policy. Use parameterized representation with far fewer parameters than states. Remember that a policy π is a function that maps states to actions. We are interested primarily in parameterized representations of π that have far fewer parameters than there are states in the state space (just as in the preceding section). For example, we could represent π by a collection of parameterized Q-functions, one for each action, and take the action with the highest predicted value: pi = max of a set of Q-functions that account for action, state space, probabilities, parameters like (in pacman, if pacman is in a tunnel, dots eaten, etc etc etc) Policy search needs an accurate model of the domain before it can find a policy.
Model-based v. Model-free RL
Model-based: know the policy & want to determine transitions and rewards for each state uses a model P and a utility function U Model-free: Only learn the utility functions for each state uses an action-utility function Q Policy-search methods operate directly on a representation of the policy, attempting to improve it based on observed performance. The variation in the performance in a stochastic domain is a serious problem; for simulated domains this can be overcome by fixing the randomness in advance.
With X assumptions, the solution to any problem is a fixed sequence of actions
Observable, discrete, known, & deterministic Example: The Bucharest to Arad problem
Passive v. active RL
Passive: KNOW the policy & want to determine the utility function for each state Active: KNOW the actions for each state & want to determine the best policy
Games: Imperfect Information
Poker (partial), battleship How do we handle partially observable games?
Constraint Satisfaction Problem
Problem defined as variables with domains, and constraints on the variables
Iterative Deepening Depth First Search
Repetitive DFS with increasing max depth - first 1, then n + 1. Complete Not optimal (returns first solution) Time and space is that of DFS Time? O(b^d) Space? O(bd)
Unary Constraint
Restricts the value of a single variable
Consistency (of A* heuristic)
The addition has to work - it must fulfill triangle inequality, which means that the cost of reaching the goal form cannot be greater than the step cost of getting to n' plus the cost of reading the goal.
Monte Carlo Simulation
Simulate multiple games each randomly assigning the unobservable information and deducing your chances for many random outcomes. (I win in more worlds with this move than this one) Works for some games, but not for betting and bluffing in poker, where you must override the signals.
Preference Constraints
Softer discounting Example: Try not to schedule a professor in the afternoon, but don't make it impossible to do so
Horizon Effect
Something bad is going to happen but it is just beyond the horizon of the search. (example: doing an iterative deepening search for the best move but put a max depth of 6, and disaster is at layer 7) Even worse is that the bad outcome is seen but there are inconsequential moves that can be used to push the bad outcome beyond the horizon of the search.
Policy Iteration/Evaluation v. Passive RL
The main difference is that the passive learning agent does not know the transition model P(s'| s, a), which specifies the probability of reaching state s' from state s after doing action a; nor does it know the reward function R(s), which specifies the reward for each state. The RL agent is more blind - it doesn't need a transition model or probabilities, just what utility we get from following it
How do we find the utility of each state?
The utility of each state equals its own reward plus the expected utility of its successor states.
Transposition tables
There are often multiple paths that lead to the same position and there is no need to search that position again Store a hash of the position along with the value and the depth of that search where that position was previously found. When you process a node in the search, look it up in the transposition table and if it exists use its associated value Example: Endgame solutions in chess - once a board has a recognizable endgame solution, we can cease searching
Move Ordering
Try to order the moves so that the best move is processed first Unfortunately domain dependent and even situation dependent For alpha-beta pruning: BFS can be used to shallowly explore branches Unfortunate ordering can result in processing every single node
most constrained variable
Used to improve backtracking Minimum-remaining-values - it picks a variable that is most likely to cause a failure soon. If some variable has no legal values left, this heuristic will pick that variable right away and detect failure. also called the minimum remaining-values (MRV) heuristic - picks the variable with the fewest available legal values
When does hill climbing work best?
When there is only one solution. Warning: flat spaces can feel like progress - it's easy to get stuck on shoulders.
Forward Checking
When variable X is assigned, delete any value of constraint-graph neighbor variables inconsistent with the assigned value of X If a variable has an empty domain, we must start over. No solution can be found with current assignments. One of the simplest forms of inference is called forward checking. Whenever a variable X is assigned, the forward-checking process establishes arc consistency for it: for each unassigned variable Y that is connected to X by a constraint, delete from Y 's domain any value that is inconsistent with the value chosen for X. There is no reason to do forward checking if we have already done arc consistency as a preprocessing step
Uninformed search algorithms
are given no information about the problem other than its definition. It can find a solution to any solvable problem, but cannot do so efficiently. examples: BFS, DFS, uniform cost, iterative deepening
Discrete Environment
at any given state there are only finitely many actions to choose from. Example: poker (broken into moves) Counter: taxi driving (time itself is possibly continuous) Antonym: Continuous. There are infinitely many options, or the search space is infinitely granular.
Time and space complexity variables b, m, and d
b - Max successors/children of ANY node d - is the depth of the SHALLOWEST GOAL node m - is the maximum length of ANY path in the state space
Informed search algorithms
can do "quite well" given some guidance on where to look for solutions. May have access to a heuristic function that estimates the cost of a solution from n. examples: A* and greedy first
Minimax and alpha-beta algorithms work on games with these features
deterministic, perfect information, 2-player
Deterministic environment
each action has exactly one outcome example: chess, crossword counterexample: poker, taxi driving Antonym: Stochastic (Like the agent in the maze with .8 probability of moving forward, .1 left, .1 right)
Binary Constraint
relates the values of two variables. Example: A(value) != B(value)