Taming Text: How to Find, Organize, and Manipulate It

Taming Text: How to Find, Organize, and Manipulate It

Grant S. Ingersoll

Language: English

Pages: 320

ISBN: 193398838X

Format: PDF / Kindle (mobi) / ePub


Taming Text, winner of the 2013 Jolt Awards for Productivity, is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built.

About this Book

There is so much text in our lives, we are practically drowningin it. Fortunately, there are innovative tools and techniquesfor managing unstructured information that can throw thesmart developer a much-needed lifeline. You'll find them in thisbook.

Taming Text is a practical, example-driven guide to working withtext in real applications. This book introduces you to useful techniques like full-text search, proper name recognition,clustering, tagging, information extraction, and summarization.You'll explore real use cases as you systematically absorb thefoundations upon which they are built.Written in a clear and concise style, this book avoids jargon, explainingthe subject in terms you can understand without a backgroundin statistics or natural language processing. Examples arein Java, but the concepts can be applied in any language.

Written for Java developers, the book requires no prior knowledge of GWT.

Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.

Winner of 2013 Jolt Awards: The Best Books—one of five notable books every serious programmer should read.

What's Inside

  • When to use text-taming techniques
  • Important open-source libraries like Solr and Mahout
  • How to build text-processing applications

About the Authors

Grant Ingersoll is an engineer, speaker, and trainer, a Lucenecommitter, and a cofounder of the Mahout machine-learning project. Thomas Morton is the primary developer of OpenNLP and Maximum Entropy. Drew Farris is a technology consultant, software developer, and contributor to Mahout,Lucene, and Solr.

"Takes the mystery out of verycomplex processes."—From the Foreword by Liz Liddy, Dean, iSchool, Syracuse University

Table of Contents

  1. Getting started taming text
  2. Foundations of taming text
  3. Searching
  4. Fuzzy string matching
  5. Identifying people, places, and things
  6. Clustering text
  7. Classification, categorization, and tagging
  8. Building an example question answering system
  9. Untamed text: exploring the next frontier

Beginning COBOL for Programmers

Decoding the iOS6 SDK

The Green Scorecard: Measuring the Return on Investment in Sustainable Initiatives

Beginning Windows Phone App Development

Web Application Development with Yii and PHP (2nd Edition)

















Taming Text Taming Text HOW TO FIND, ORGANIZE, AND MANIPULATE IT GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. FARRIS MANNING SHELTER ISLAND For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: orders@manning.com ©2013 by Manning

setting a boost factor. For instance, it’s common to boost title Fields higher than regular content since matches in titles often yield better results. multiValued Allows for the same Field to be added multiple times to a document. omitNorms Effectively disables the use of the length of a field (the number of tokens) as part of the score. Used to save disk space when a Field doesn’t contribute to the score of a search. 55 Introducing the Apache Solr search server Solr is slightly more

Set cast2) { int size = 0; for (String actor : cast1) if (cast2.contains(actor)) size++; return size; } Combine scores into single score. Compute intersection using exact string matching. EVALUATING THE RESULTS Let’s look at some examples using this approach. An example where you can see the benefits of combining multiple sets of data for matching is shown in tables 4.4 and 4.5. Table 4.4 Table 4.5 Score Example of importance of combining multiple sets of data ID Title Year

the same entities, as described in chapter 4!) and makes intuitive sense to a user. In this chapter, we’ll look at how to perform the task of identifying names in text automatically. We’ll examine the accuracy of a popular open source tool for performing named-entity recognition as well as its runtime performance characteristics in order to assist you in choosing where and when to employ this technology. We’ll also Approaches to named-entity recognition 117 Figure 5.1 Snippet of article on

might ask. For instance, Wikipedia or a collection of research papers might be used as a source for finding answers. In other words, the QA system we propose is based on identifying and analyzing text that has a chance of providing the answer based on patterns it has seen in the past. It won’t be capable of inferring an answer from a variety of sources. For instance, if the system is asked “Who is Bob’s uncle?” and there’s a document in the collection with the sentences “Bob’s father is Ola.

Download sample