Course Main Topics
- Learning from Data
- View problem solving as heruristic search through a large space of possible solutions
- Use probabilistic reasoning and learning
- Use logical reasoning – representing knowledge about the world in computers
- Neural networks
- Philosophical Foundation of AI
Class Mottos so far
- Generalize do not memorize
- Let the data decide
ID3 Algorithms
- no examples left, return majority class at parent as leaf
- all examples same category, return leaf with that category
- no features left, returning majority class in examples as leaf
base cases
- overfitting
- Decision-forests
- Pruning decision-trees
recursive case
- choose the best feature and use it label an interior node
- recur on each possible value of the chosen feature
General ML issues covered
- supervised ML, e.g. examples have labels
- fixed-length feature vectors, 99.9% pratical way
- info theory/ gain/ needed/ remaining
- expected_value calculations
- overfitting reduction
- TRAIN/ TUNE/ TEST sets
- Ensemble
- learning curve
Search
- fundamental problem-solving method used to create intelligence behavior (heuristics are where “intelligence” comes in)
- basic idea in search
- OPEN holds alternates
- key questions: which node to consider next?
- CLOSED: deal with OO-loops
- sometimes arcs have cost and we want least-cost solution
- OPEN & CLOSED can grow rapidly, so we need to deal with that
- remember search space built as we go
- tree traversal
- beam search
- iterative deepening depth-first search
- best-first search
hill climbing
weakness: stops at local max
Why stop if not at a GOAL?
- if we really need a goal, then use beam search with beam width=1
- but some tests do not have explicit goal states, and so we want to max h
- if we can generate multiple start nodes, then an excellent algo is to do hill climbing N times and keep best(though not a guarantee)
- what if we only have 1 start state?
- could make k random moves, then do hill climbing
- or could sometimes accept bad moves – simulated annealing
simulated annealing
// Assume *h* higher is better
current = start state
for time = 1 to maxTime
temp = coolingSchedule(time) // as time higher, temp lower
if temp == 0:
return current
next = randomChildOf(current)
delta_h = h(next) - h(current)
if delta_h >= 0:
current = next // move uphill
elif random() <= e ** (delta_h / temp):
current = next // sometimes accept **bad** moves
return current
if abs(delta_h) >> temp:
acceptanceProbability == 0
if abs(delta_h) << temp:
acceptanceProbability == 1
if abs(delta_h) ~~ temp:
acceptanceProbability == 1/e
Tradeoffs
- HC/Beam Search/Simmulated Annealing
- OPEN can get large, so we limit size (sometimes just to 1)
- might throw best (or even all) goal nodes
- BFS/DFS/Best/Iterative Deepening
Genetic algorithm
- it is a framework more than an algorithm
- choose a knowledge representation and randomly create an initial population of entities
- evaluate each entity (the “fitness function”)
- discard N% (e.g. 50%)
- stochastically grab best parents and combine them (“cross over”) to create new entities
- do some mutation (“bit flips”)
- goto 1
- choose a knowledge representation and randomly create an initial population of entities
- premature convergence
- are population might lose diversity, i.e. consist of a lot of very similar entities
- imagine A + child(A) are crossed over, then children will be 75% similar to A
- Genetic Programming: data structures are something other than bit strings
- GA as Searching a Space
GA as manipulating bit strings
- assume we have a knowledge representation that uses k bits per entity. e.g.
- a 101101
- b 011011 - cross over: pick a random location in the bit string, then copy first portion[0:3] from parent a and the second portion[4:] from parent b:
- d 011001
- c 101111 - mutation, randomly flip j bits of some children
Fitness Function
- task specified
- e.g. play M games of chess, fitness = fraction won; evaluate bit string an training set, fitness = accuracy
- higher better
- we will assume non-negative fitness values
- if fitnesses < 0, transform using
new_fitness = e ** fitness
- if fitnesses < 0, transform using
- fitness-proportional reproduction
- probability of selecting entity i with
fitness *Fi* = Fi / sum(Fj)
- probability of selecting entity i with
GA for supervised machine learning
assume we wish to learn models of the form
if sum(Weight_d * Value(Feature_d)) >= threshold:
then predict TRUE
else predict FALSE
Our bit string will have d+1 (d is the number of features) real represented as P-bits each
Building-block hypothesis
- if local neighbouring bits encode goal “subroutines” then GA can find good combination of subroutines
- GA is likely to work better when the fitness functions “rewards”/recognizes valuable subskills
Interesting idea: let the fitness function eval as well, called coevaluation
Another GA Ex-chess playing
- learn the SBE
- fitness: fraction of games won against random opponents
Game
- costs on arcs
- game playing
First Half Recap
- Learn from data
- Search for solution
- Projecting possible futures
-
Simulate biology/physics to get intelligent algorithms: genetic algorithm, simulated annealing, neural networks
- d-forests: decreased error rate by 25% (by 8 percentage points)
- Assume tax rate is 5%, it increases by 1%, what is new tax rate?
Probability
- Marginal distribution
- Naive Bayes classifier
- M-estimator
- Odds ratio
-
OR = P(D S1…Sn) / P(^D S1…Sn) = Pi(P(Si D)) * P(D) / (Pi(P(Si ^D)) * P(^D)) = Pi(P(Si D) / P(Si ^D)) * P(D) / P(^D)
-
Some Rules of Probability
- 0 <= P(A) <= 1
- P(true) == 1, P(false) == 0
-
P(A B) = P(A) + P(B) - P(A^B) -
P(A^B) = P(B A) * P(A) = P(A B) * P(B) - P(^A) = 1 - P(A)
- If A and B are independent events, P(A^B) = P(A) * P(B)
-
conditional independence: P(A B^C) = P(A C) - P(A^B C) = P(A B^C) * P(B C) -
if conditional independece: P(A^B C) = P(A C) * P(B C) -
multi-symptom: P(D S1, S2…Sn) = P(S1, S2…Sn D) * P(D) / P(S1, S2…Sn) -
P(D S1, S2…Sn) = Pi(P(Si D)) * P(D) / [Pi(P(Si D)) * P(D) + Pi(P(Si ^D)) * P(^D)] - How To Build a Naive Bayes Classifier
-
Basic Idea
- create probabilities that only involve AND and NOT
- “AND in” all remaining variables
- lookup each full specified world states
- do the arithmetic
- these full joint probability tables can easily become too large, so we cannot do table lookup and need another method
- Bayesian Networks could fill every cell in the full joint probability table
- Full Joint has 2 ** (N+1) cells
- Naive Bayesian has 2 * N + 1 cells
- Naive Bayesian can fill the full joint table
Bayesian
- an acyclic, directed graph where nodes are our random variables and arcs represent direct dependencies
-
P(A, ^B, ^C, D, E) = P(A) * (1 - P(B)) * (1 - P(C A and ^B)) * P(D) * P(E ^C and D) - A, B are independent and parents of C; C, D are parents of E
- P(A, ^B, C, D) = P(A, ^B, C, D, ^E) + P(A, ^B, C, D, E)
-
P(B ^A, C, ^D, E) = P(^A, B, C, ^D, E) / P(^A, C, ^D, E) -
P(A B) = P(B A) * P(A) / P(B)
where does the graph structure come from?
- from expert knowledge
- from search (using some data)
Some Java for Bayesian Network
int s1_win[] = new int[3]
int s2_win[] = new int[3]
// 10 dimensional domains
Going Naive Bayes
- D is the parent of Si, and S2 is the parent of S1
-
P(S1 D^S2) * Pi(P(Si D)) * P(D), i >= 2 -
since P(A^B C) = P(A B^C) * P(B C) # conditional independence -
so = P(S1^S2 D) * Pi(P(Si D)) * P(D), i >= 3
-
-
Full Joint Probability as a Bayesian Network
P(A^B^C^D) = P(A^B^C | D) * P(D) = P(A^B | C^D) * P(C|D) * P(D) = P(A | B^C^D) * P(B | C^D) * P(C|D) * P(D)
Markov Blanket
- A node in a BN is conditionaly independent of all other node given its Markov Blanket
- its parent, its children and its children’s other parents
Assume we have ran variables A to Z, we can answer any query of the form (unnamed ran variables are hiddent, intently) P(A^!H^!M | !B^D^Q^!Z) use marginalization rules and P(A|B) = P(AB) / P(B)