AI – 希言自然

Course Main Topics

Learning from Data
View problem solving as heruristic search through a large space of possible solutions
Use probabilistic reasoning and learning
Use logical reasoning – representing knowledge about the world in computers
Neural networks
Philosophical Foundation of AI

Class Mottos so far

Generalize do not memorize
Let the data decide

ID3 Algorithms

no examples left, return majority class at parent as leaf
- all examples same category, return leaf with that category
- no features left, returning majority class in examples as leaf

base cases

overfitting
Decision-forests
Pruning decision-trees

recursive case

choose the best feature and use it label an interior node
recur on each possible value of the chosen feature

General ML issues covered

supervised ML, e.g. examples have labels
fixed-length feature vectors, 99.9% pratical way
info theory/ gain/ needed/ remaining
expected_value calculations
overfitting reduction
TRAIN/ TUNE/ TEST sets
Ensemble
- Why not learn multiple models? Which works well and is work
- Why not score only a small subset of the candidate features
- bagging
- boosting
learning curve

Search

fundamental problem-solving method used to create intelligence behavior (heuristics are where “intelligence” comes in)
basic idea in search
- OPEN holds alternates
- key questions: which node to consider next?
- CLOSED: deal with OO-loops
- sometimes arcs have cost and we want least-cost solution
- OPEN & CLOSED can grow rapidly, so we need to deal with that
- remember search space built as we go
tree traversal
beam search
iterative deepening depth-first search
best-first search

hill climbing

weakness: stops at local max

Why stop if not at a GOAL?

if we really need a goal, then use beam search with beam width=1
but some tests do not have explicit goal states, and so we want to max h
if we can generate multiple start nodes, then an excellent algo is to do hill climbing N times and keep best(though not a guarantee)
what if we only have 1 start state?
- could make k random moves, then do hill climbing
- or could sometimes accept bad moves – simulated annealing

simulated annealing

    // Assume *h* higher is better
    current = start state
    for time = 1 to maxTime
        temp = coolingSchedule(time) // as time higher, temp lower
        if temp == 0:
            return current
        next = randomChildOf(current)
        delta_h = h(next) - h(current)
        if delta_h >= 0:
            current = next // move uphill
        elif random() <= e ** (delta_h / temp):
            current = next // sometimes accept **bad** moves
    return current

    if abs(delta_h) >> temp:
        acceptanceProbability == 0
    if abs(delta_h) << temp:
        acceptanceProbability == 1
    if abs(delta_h) ~~ temp:
        acceptanceProbability == 1/e

Tradeoffs

HC/Beam Search/Simmulated Annealing
- OPEN can get large, so we limit size (sometimes just to 1)
- might throw best (or even all) goal nodes
BFS/DFS/Best/Iterative Deepening

Genetic algorithm

it is a framework more than an algorithm
1. choose a knowledge representation and randomly create an initial population of entities
  - evaluate each entity (the “fitness function”)
  - discard N% (e.g. 50%)
  - stochastically grab best parents and combine them (“cross over”) to create new entities
  - do some mutation (“bit flips”)
  - goto 1
premature convergence
- are population might lose diversity, i.e. consist of a lot of very similar entities
- imagine A + child(A) are crossed over, then children will be 75% similar to A
Genetic Programming: data structures are something other than bit strings
GA as Searching a Space

GA as manipulating bit strings

assume we have a knowledge representation that uses k bits per entity. e.g.
- a 101101
- b 011011 - cross over: pick a random location in the bit string, then copy first portion[0:3] from parent a and the second portion[4:] from parent b:
- d 011001
- c 101111 - mutation, randomly flip j bits of some children

Fitness Function

task specified
e.g. play M games of chess, fitness = fraction won; evaluate bit string an training set, fitness = accuracy
higher better
we will assume non-negative fitness values
- if fitnesses < 0, transform using new_fitness = e ** fitness
fitness-proportional reproduction
- probability of selecting entity i with fitness *Fi* = Fi / sum(Fj)

GA for supervised machine learning

assume we wish to learn models of the form

if sum(Weight_d * Value(Feature_d)) >= threshold:
    then predict TRUE
    else predict FALSE

Our bit string will have d+1 (d is the number of features) real represented as P-bits each

Building-block hypothesis

if local neighbouring bits encode goal “subroutines” then GA can find good combination of subroutines
GA is likely to work better when the fitness functions “rewards”/recognizes valuable subskills

Interesting idea: let the fitness function eval as well, called coevaluation

Another GA Ex-chess playing

learn the SBE
fitness: fraction of games won against random opponents

Game

costs on arcs
game playing

First Half Recap

Learn from data
Search for solution
Projecting possible futures
Simulate biology/physics to get intelligent algorithms: genetic algorithm, simulated annealing, neural networks
d-forests: decreased error rate by 25% (by 8 percentage points)
- Assume tax rate is 5%, it increases by 1%, what is new tax rate?

Probability

Marginal distribution
Naive Bayes classifier
M-estimator

Odds ratio

OR = P(D

S1…Sn) / P(^D

S1…Sn) = Pi(P(Si

D)) * P(D) / (Pi(P(Si

^D)) * P(^D)) = Pi(P(Si

D) / P(Si

^D)) * P(D) / P(^D)

Some Rules of Probability

0 <= P(A) <= 1

P(true) == 1, P(false) == 0
P(A B) = P(A) + P(B) - P(A^B)
P(A^B) = P(B A) * P(A) = P(A B) * P(B)
P(^A) = 1 - P(A)
If A and B are independent events, P(A^B) = P(A) * P(B)

conditional independence: P(A	B^C) = P(A	C)
- P(A^B	C) = P(A	B^C) * P(B	C)

if conditional independece: P(A^B C) = P(A C) * P(B C)

multi-symptom: P(D

S1, S2…Sn) = P(S1, S2…Sn

D) * P(D) / P(S1, S2…Sn)

P(D

S1, S2…Sn) = Pi(P(Si

D)) * P(D) / [Pi(P(Si

D)) * P(D) + Pi(P(Si

^D)) * P(^D)]

How To Build a Naive Bayes Classifier

Basic Idea

create probabilities that only involve AND and NOT
“AND in” all remaining variables
lookup each full specified world states
do the arithmetic
these full joint probability tables can easily become too large, so we cannot do table lookup and need another method
- Bayesian Networks could fill every cell in the full joint probability table
- Full Joint has 2 ** (N+1) cells
- Naive Bayesian has 2 * N + 1 cells
  - Naive Bayesian can fill the full joint table

Bayesian

an acyclic, directed graph where nodes are our random variables and arcs represent direct dependencies
P(A, ^B, ^C, D, E) = P(A) * (1 - P(B)) * (1 - P(C A and ^B)) * P(D) * P(E ^C and D)
- A, B are independent and parents of C; C, D are parents of E
P(A, ^B, C, D) = P(A, ^B, C, D, ^E) + P(A, ^B, C, D, E)
P(B ^A, C, ^D, E) = P(^A, B, C, ^D, E) / P(^A, C, ^D, E)
P(A B) = P(B A) * P(A) / P(B)

where does the graph structure come from?

from expert knowledge
from search (using some data)

Some Java for Bayesian Network

int s1_win[] = new int[3]
int s2_win[] = new int[3]
// 10 dimensional domains

Going Naive Bayes

D is the parent of Si, and S2 is the parent of S1

P(S1

D^S2) * Pi(P(Si

D)) * P(D), i >= 2

since P(A^B

C) = P(A

B^C) * P(B

C) # conditional independence

so = P(S1^S2 D) * Pi(P(Si D)) * P(D), i >= 3

Full Joint Probability as a Bayesian Network

Markov Blanket

A node in a BN is conditionaly independent of all other node given its Markov Blanket
- its parent, its children and its children’s other parents

Assume we have ran variables A to Z, we can answer any query of the form (unnamed ran variables are hiddent, intently) P(A^!H^!M | !B^D^Q^!Z) use marginalization rules and P(A|B) = P(AB) / P(B)