有用的工具

以下工具絕大多數都是開源的,基於GPL、Apache等開源協議,使用時請仔細閱讀各工具的license statement
I. Information Retrieval
1. Lemur/Indri
The Lemur Toolkit for Language Modeling and Information Retrieval
http://www.lemurproject.org/
Indri:
Lemur's latest search engine
2. Lucene/Nutch
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Lucene是apache的頂級開源項目,基於Apache 2.0協議,完全用java編寫,具有perl, c/c++, dotNet等多個port
http://lucene.apache.org/
http://www.nutch.org/
3. WGet
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
http://www.gnu.org/software/wget/wget.html
II. Natural Language Processing
1. EGYPT: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/
包括GIZA等四個工具
2. GIZA++ (Statistical Machine Translation)
http://www.fjoch.com/GIZA++.html
GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.
Franz Josef Och先後在德國Aachen大學,ISI(南加州大學信息科學研究所)和Google工作。GIZA++現已有Windows移植版本,對IBM 的model 1-5有很好支持。
3. PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models
4. OpenNLP:
http://opennlp.sourceforge.net/
包括Maxent等20多個工具
btw: 這些SMT的工具還都喜歡用埃及相關的名字命名,像什麼GIZA、PHARAOH、Cairo等等。Och在ISI時開發了GIZA++,PHARAOH也是由來自ISI的Philipp Koehn 開發的,關係還真是複雜啊
5. MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
binary填一個表後可以免費下載
http://www.cs.ualberta.ca/~lindek/minipar.htm
6. WordNet
http://wordnet.princeton.edu/
WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller (Principal Investigator).
WordNet最新版本是2.1 (for Windows & Unix-like OS),提供bin, src和doc。
WordNet的在線版本是http://wordnet.princeton.edu/perl/webwn
7. HowNet
http://www.keenage.com/
HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents.
由CAS的Zhendong Dong & Qiang Dong開發,是一個類似於WordNet的東東
8. Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models.
9. SRI Language Modeling Toolkit
www.speech.sri.com/projects/srilm/
SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995.
10. ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich Germann. It is a program that translates from one natural languge into another using statistical machine translation.
11. GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
A Java Library for Text Engineering
III. Machine Learning
1. YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html
由Franz Josef Och編寫。此外,OpenNLP項目裏有一個java的MaxEnt工具,使用GIS估計參數,由東北大學的張樂(目前在英國留學)port爲C++版本
2. LibSVM
由國立臺灣大學(ntu)的Chih-Jen Lin開發,有C++,Java,perl,C#等多個語言版本
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). It supports multi-class classification.
3. SVM Light
由cornell的Thorsten Joachims在dortmund大學時開發,成爲LibSVM之後最爲有名的SVM軟件包。開源,用C語言編寫,用於ranking問題
http://svmlight.joachims.org/
4. CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
a software package for clustering low- and high-dimensional datasets
這個軟件包只提供executable/library兩種形式,不提供源代碼下載
5. CRF++
http://chasen.org/~taku/software/CRF++/
Yet Another CRF toolkit for segmenting/labelling sequential data
CRF(Conditional Random Fields),由HMM/MEMM發展起來,廣泛用於IE、IR、NLP領域
6. SVM Struct
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
同SVM Light,均由cornell的Thorsten Joachims開發。
SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping
h: X --> Y
using labeled training examples (x1,y1), ..., (xn,yn).
Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging.
SVMstruct can be thought of as an API for implementing different kinds of complex prediction algorithms. Currently, we have implemented the following learning tasks:
SVMmulticlass: Multi-class classification. Learns to predict one of k mutually exclusive classes. This is probably the simplest possible instance of SVMstruct and serves as a tutorial example of how to use the programming interface.
SVMcfg: Learns a weighted context free grammar from examples. Training examples (e.g. for natural language parsing) specify the sentence along with the correct parse tree. The goal is to predict the parse tree of new sentences.
SVMalign: Learning to align sequences. Given examples of how sequence pairs align, the goal is to learn the substitution matrix as well as the insertion and deletion costs of operations so that one can predict alignments of new sequences.
SVMhmm: Learns a Markov model from examples. Training examples (e.g. for part-of-speech tagging) specify the sequence of words along with the correct assignment of tags (i.e. states). The goal is to predict the tag sequences for new sentences.
IV. Misc:
1. Notepad++: 一個開源編輯器,支持C#,perl,CSS等幾十種語言的關鍵字,功能可與新版的UltraEdit,Visual Studio .NET媲美
http://notepad-plus.sourceforge.net
2. WinMerge: 用於文本內容比較,找出不同版本的兩個程序的差異
winmerge.sourceforge.net/
3. OpenPerlIDE: 開源的perl編輯器,內置編譯、逐行調試功能
open-perl-ide.sourceforge.net/
ps: 論起編輯器偶見過的最好的還是VS .NET了,在每個function前面有+/-號支持expand/collapse,支持區域copy/cut/paste,使用ctrl+c/ctrl+x/ctrl+v可以一次選取一行,使用ctrl+k+c/ctrl+k+u可以comment/uncomment多行,還有還有...... Visual Studio .NET is really kool:D
4. Berkeley DB
http://www.sleepycat.com/
Berkeley DB不是一個關係數據庫,它被稱做是一個嵌入式數據庫:對於c/s模型來說,它的client和server共用一個地址空間。由於數據庫最初是從文件系統中發展起來的,它更像是一個key-value pair的字典型數據庫。而且數據庫文件能夠序列化到硬盤中,所以不受內存大小限制。BDB有個子版本Berkeley DB XML,它是一個xml數據庫:以xml文件形式存儲數據?BDB已被包括microsoft、google、HP、ford、motorola等公司嵌入到自己的產品中去了
Berkeley DB (libdb) is a programmatic toolkit that provides embedded database support for both traditional and client/server applications. It includes b+tree, queue, extended linear hashing, fixed, and variable-length record access methods, transactions, locking, logging, shared memory caching, database recovery, and replication for highly available systems. DB supports C, C++, Java, PHP, and Perl APIs.
It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are "store this value under this key", "check if this key exists" and "retrieve the value for this key" so conceptually it's pretty simple - the complicated stuff all happens under the hood.
case study:
Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet.
Microsoft uses Berkeley DB for the Groove collaboration software
AOL uses Berkeley DB for search tool meta-data and other services.
Hitachi uses Berkeley DB in its directory services server product.
Ford uses Berkeley DB to authenticate partners who access Ford's Web applications.
Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software.
Google uses Berkeley DB High Availability for Google Accounts.
Motorola uses Berkeley DB to track mobile units in its wireless radio network products.
5. LaTeX
LATEX, written as LaTeX in plain text, is a document preparation system for the TeX typesetting program.
It offers programmable desktop publishing features and extensive facilities for automating most aspects of typesetting and desktop publishing, including numbering and cross-referencing, tables and figures, page layout, bibliographies, and much more. LaTeX was originally written in 1984 by Leslie Lamport and has become the dominant method for using TeX—few people write in plain TeX anymore. The current version is LaTeX2ε.
中文套裝可以在http://www.ctex.org找到
http://learn.tsinghua.edu.cn:8080/2001315450/comp.html by王垠
6. EditPlus
http://www.editplus.com/
EditPlus is an Internet-ready 32-bit text editor, HTML editor and programmers editor for Windows. While it can serve as a good replacement for Notepad, it also offers many powerful features for Web page authors and programmers.
EditPlus當前最新版本是2.21,BrE和AmE的spell checker需要單獨下載安裝包安裝
7. GVim: Vi IMproved
http://www.vim.org/index.php
Vim is an advanced text editor that seeks to provide the power of the de-facto Unix editor 'Vi', with a more complete feature set. It's useful whether you're already using vi or using a different editor. Users of Vim 5 should consider upgrading to Vim 6, which is greatly enhanced since Vim 5. Vim is often called a "programmer's editor," and so useful for programming that many consider it an entire IDE. It's not just for programmers, though. Vim is perfect for all kinds of text editing, from composing email to editing configuration files.
普通windows用戶可以從這個鏈接下載 ftp://ftp.vim.org/pub/vim/pc/gvim64.exe
8. Cygwin: GNU + Cygnus + Windows
http://www.cygwin.com/
Cygwin is a Linux-like environment for Windows. It consists of two parts: A DLL (cygwin1.dll) which acts as a Linux API emulation layer providing substantial Linux API functionality. A collection of tools, which provide Linux look and feel.
9. MinGW: Minimalistic GNU for Windows
http://www.mingw.org/
MinGW: A collection of freely available and freely distributable Windows specific header files and import libraries combined with GNU toolsets that allow one to produce native Windows programs that do not rely on any 3rd-party C runtime DLLs.
在windows下編譯、移植unix/linux平臺的軟件。cygwin相當於在windows系統層上模擬了一個POSIX-compliant的layer(庫文件是cygwin1.dll);而mingw則是使用windows自身的庫文件(msvcrt.dll)實現了一些符合POSIX spec的功能,並不是完全POSIX-compliant。mingw其實是cygwin的一個branch,由於它沒有實現linux api的模擬層,所以開銷要比cygwin低些。
10. CutePDF Writer
http://www.cutepdf.com
Portable Document Format (PDF) is the de facto standard for the secure and reliable distribution and exchange of electronic documents and forms around the world.  CutePDF Writer (formerly CutePDF Printer) is the free version of commercial PDF creation software. CutePDF Writer installs itself as a "printer subsystem". This enables virtually any Windows applications (must be able to print) to create professional quality PDF documents - with just a push of a button!
比起acrobat來,一大優點就是它是免費的。而且一般word圖表、公式的轉換效果很好,what you see is what you get,哈哈。可能需要ps2pdf converter,在該站點有鏈接提供下載
11. R
http://www.r-project.org/
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R統計軟件與MatLab類似,都是用在科學計算領域的。不同的是它是開源的東東:)
12. Judy
http://judy.sourceforge.net/
Judy arrays are fast, sometimes even faster than a hash table. And because Judy
arrays are a type of trie, they consume much less memory than hash tables.
Roughly speaking, it is similar to a highly-optimised 256-ary trie data
structure.

發佈了21 篇原創文章 · 獲贊 0 · 訪問量 2萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章