NLP

Zhaoyu Luo bio photo By Zhaoyu Luo

2014-09-11

2014-09-16

2014-09-18

Quotes Extraction

methods

  • regular expressions & pattern matching
  • machine learning
    • citizen-quotes is using [NLTK Maximum Entrophy Classifiers] (http://www.nltk.org/_modules/nltk/classify/maxent.html) as its ML classifier
      • it is a supervised learning
      • selected six features. Each feature is represented by a function that takes paragraph text as an input and returns either a boolean or categorical variable.
        1. common attribution words (said, asked, etc.)
          • quote marks
          • common attribution word within five words of a quote mark (“I love tacos,” Smith said.)
          • how many words does it have in quotes (helps deal with the “air quotes” problem)
          • what are the five words that fall immediately after a closed quote
          • what is the last word in the paragraph
      • maxent figures out the most useful features and weighs them, then value the input paragraph
  • Google inquotes is a experimental tool which is now under the shelf

Coreference resolution

Named-entity recognition

Terms to Know

2014-10-23