關於文本分析的一個訪談

It is my pleasure to interview Seth Grimes, who has agreed to write a monthly column for KDnuggets on Text Analytics.

Seth Grimes is an analytics strategy consultant, a recognized expert on business intelligence and text analytics. He is contributing editor at Intelligent Enterprise magazine, founding chair of the Text Analytics Summit, Data Warehousing Institute (TDWI) instructor, and text analytics channel expert at the Business Intelligence Network. Seth founded Washington DC-based Alta Plana Corporation in 1997. He consults, writes, and speaks on information-systems strategy, data management and analysis systems, industry trends, and emerging analytical technologies.


Gregory Piatetsky-Shapiro: What is text analytics?

Seth Grimes: For me, functionally, text analytics is the same as text mining: applying statistical, linguistic, and machine-learning methods to extract information from text, improve search and information retrieval, and automate document processing. With text analytics, there is a sense that you may want to integrate these functions with or into BI and line-of-business systems. Contrast with (text) data mining where the workbench or programming interface is the only way to go.

 

GPS: Are there other ways text analytics differs from information retrieval and from text mining?

SG: Focusing on applications, "text mining" is perhaps used more often in scientific and technical contexts and "text analytics," a newer term, is found more often in business contexts. I also wonder if people don't think primarily of statistical and machine-learning approaches, extended from the data mining world, when they picture text mining. When you bring in computational linguistics for, say, building better indexes to support conceptual or semantic search and information retrieval (IR), that's not considered text mining.

IR does differ from text mining/analytics. IR is about bringing back documents, where text analytics additionally aims to automate their processing and make sense of their contents. Automated processing: that could involve routing e-mail to the right customer-service rep, it could involve identifying legally discoverable documents, to could involve tagging and selectively forwarding news articles. And sense making involves information extraction -- pulling named and pattern-based entities (e.g., phone and Social Security numbers), topics, concepts, facts, and attitudinal information from text -- and diverse analytical steps such as clustering and classification, data integration, and visualization.

Alright, that was a dense response.

Obviously, text analytics also operates not only on retrieved documents but also on text that comes to you, for instance, e-mail, stuff you get via RSS or Atom feeds, text in databases and enterprise systems, and so on. Similarly, IR covers all information sources and not just textual documents.

 

GPS: How did you become interested in text analytics?

SG: In '96-'97, I had a gig with a Web developer as the database guy and technical director. We used Illustra, Michael Stonebraker's Postgres commercialization, which provided character large objects and a couple of text-search options. Illustra let us manage Web-page templates in-database and query text and conventional fields together, so I got a taste of a certain style of unified analysis.

I've done a lot of work with governmental statistics. Coming off a long contract at the Census Bureau -- I designed the 2000 Census analysis system, which the Bureau used to produce hundreds of billions of statistical tables -- I saw text as an area with huge potential. This was back in 2002-3. I started writing on the text-analytics topic. Text analytics resonates with a lot of people so I've focused on it more and more.

My involvement has been really rewarding. I feel I've been able to contribute to making benefits visible to a broad range of potential users, which has accelerated business uptake.

 

GPS: Why is text analytics not yet widely adopted - is it just the matter of time?

SG: It's perhaps more widely used than you'd think. Consider that Google, Yahoo!, and Live search respond to "43+99", "map philadelphia", and "ORCL" appropriately, as requests for arithmetic help, a map, and Oracle share-price information, rather than as requests for a long list of URLs. That is, they all recognize named entities and query patterns and infer user intent. That's basic text analytics.

I estimate a $350 million 2008 market for text analytics software and vendor support and professional services, up 40% from 2007. The three of the five biggest BI vendors, IBM, SAP, and SAS, have serious text-analytics capabilities, albeit not yet widely deployed in their product lines. Applications such as Voice of the Customer, used for media monitoring and brand and reputation management, are getting a huge-amount of attention. So (even) wider adoption is just a matter of time.

 

Gregory Piatetsky-Shapiro: Where text analytics is most used right now? Where do you see the biggest potential?

Seth Grimes: The greatest potential for text analytics is in enabling a computer to pass the Turing Test. Text analytics will decode whatever the tester sends to the computer, it will mine a corpus for response material, and it will support generation of a credible and convincing response. It will do all this throughout a contextualized, conversational exchange complete with noise, external references and anaphora, and multiple topics and voices. That is, text analytics has the potential to enable a computer to understand and talk to people.

Alright, that's visionary stuff. The current hot topics are long-standing applications to life sciences, for instance pharmaceutical drug discovery, and intelligence, and newer applications for functions that include customer support; marketing; media and publishing; insurance, risk, and fraud; search enrichment, etc. As I said, I see a $350 million 2008 market, and that figure does not consider the value created by university and industrial research, systems integrators, and custom development, nor the value of the products and capabilities enabled by text analytics.

 

GPS: What is the relationship of Text Analytics with Semantic Web and XML ?

SG: The Semantic Web is a concept that is being implemented with technologies that include XML, notably RDF (Resource Description Framework) and OWL (Web Ontology Language). But the Semantic Web is about a lot more than those encoding technologies and XML is of course used for many applications, most of which have nothing to do with the Semantic Web.

But it's funny you should ask about the Semantic Web, or perhaps you did so knowing that I recently published a blog article entitled Semantic Web Snake Oil. I got a lot of fire for that one but I stick to my main point, that intentional publication of semantic mark-up is largely not happening -- most of the strongest proponents still aren't doing it -- which is a telling indicator. I need to post a follow-up clarifying my views and looking at Linked Data, a worthy effort but one that I believe will have modest adoption and utility.

The answer to information findability issues isn't an intentional and, I believe, rigid and difficult approach like the Semantic Web's. It's analytics, which a) can make sense of information in whatever form it comes in and b) does not pre-judge the use that the information will be put to.

 

GPS: I understand that you like Python. Is it better than Perl and other scripting languages for text analysis?

SG: Perl, historically, was obscure. Python code is clear(er). Perl is great for pattern matching via regular expressions, but so is Python. I do know that the Natural Language Toolkit for Python, nltk, has a great reputation -- I've played with it only a bit -- and that I find Python powerful and earlier to program.

GPS: Advice to people who want to enter text analytics area?

SG: Same as for any computing arena: Just do it. There are a couple of open-source options out these, GATE if you prefer a linguistic approach and RapidMiner if you prefer statistics.

Professionally, if you have a steady analytics job, look for ways you can extend your analyses to textual sources and then for appropriate tools and then, again, just give it a shot. It shouldn't be hard for most data miners to extend their work in this fashion.

Text analytics is on a fast growth curve with interesting challenges and clear business benefits, however you define "business." Now's a good time to get started.

 

Gregory Piatetsky-Shapiro: How did you become interested in computers?

Seth Grimes: To be precise, I'm interested in programming rather than in computers.

For someone like me, programming is a captivating activity and it helps you do challenging things. For example, when I was in high school, in 1976, one of the calculus teachers, not mine, offered an A to any student who could compute 1,000 factorial, the product of the integers from 1 to 1,000. One of my bonehead friends started multiplying on paper so I programmed it. This was in Basic on an HP minicomputer with 16-bit arithmetic and a minuscule amount of memory. I got my program to run in 2 minutes, using log approximations for error checking. It was fun and it showed up my friend. What more could you ask for, aside from getting paid to do this kind of stuff?

I suppose I prefer computing over abstract challenges because you can see your results. Textual analysis is especially concrete because it deals with human language and I find a certain appeal in deciphering and deconstructing text.

I'll blame my need for concrete problems for my neglect of my grad school math work. So I was reading stuff like The Sot-Weed Factor and Gravity's Rainbow when I should have been studying commutative rings. Clever stuff, but reading about a character named Sammy Hilbert-Spaess does not make you king of infinite space despite his name.

At least I did learn about linear algebra and vector spaces, which help if you want to understand techniques applied in text analytics, stuff like singular value decompositions.

 

GPS: What do you like to do when away from a computer?

SG: My second job is (volunteer) president of a public-safety non-profit in the community where I live, and I'm involved in a slew of other local activities. I even ran for mayor of the city where I live, Takoma Park, Maryland, which is next to Washington DC, back in 2005. I got 41% of the vote but lost to an entrenched, eight-year encumbent. Fortunately the mayor who beat me stepped down in 2007, and fortunately I think I have running for office out of my system.

Otherwise, I have an inherited love of travel, I spend time with my family, and I leyn every month or two. Leyning is public Torah reading, a learning exercise that actually gets me involved in a tradition of close textual analysis.

 

GPS: What book are you currently reading?

SG: I'm just about done Vineland by Thomas Pynchon and I'm part-way done Mark Twain's Innocents Abroad. I don't read business books and I only occasionally read technical books.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章