Exploring the Power of Links in Data Mining-韓家煒演講摘錄

 韓家煒(Jiawei Han),數據挖掘的泰斗級人物,大名如雷貫耳,今日有幸能一睹真人風采。見面第一感覺居然是此人年輕時肯定是個帥哥(汗!),當然,現在仍然是個精神矍鑠的智者。

   演講的主題是:Exploring the Power of Links in Data Mining。報告主要講了四篇論文,都是他的博士研究生Xiaoxin Yin完成。這些工作,大多是受到PageRank算法HITS等的影響導出的。利用數據間的連接關係,我們可以更有效的得出我們所關注的信息。這四篇論文提出的算法,在與其他相關算法的比較中,均顯示出了較強的優越性。

   1.CrossMine:在連接傳播過程中,採用的是有控制的傳播,有些比較弱的連接不考慮,這樣,能在很好保持準確率的情況下,大大提高時間效率。在Relation少的時候,這種優勢不明顯,但當Relation多時,顯示了強大的優越性。

   2.User-Guided Clustering:類似於半監督的學習,用戶提供認爲重要的特徵,然後再分類。這裏把整個feature的一列作爲特徵考慮。而這個提供的特徵只是作爲soft hint,作爲一種參考,我們還需要考慮其它的因素。

   3.LinkClus:可以通過人們發的paper,找出各個會議間的相關性。同一個author發的不同會議間的聯繫強。原有的算法時間效率很差,這裏利用了power law distribution of links。找出密集的links,因爲密集的links比較少,所以只分析這些會有很大的效率提高。同時,絕大多數的性息被包含在這些密集的links中了,所以準確率也很好。

   4.同名人發的paper怎麼區分?特別是中國人,名稱翻譯成英文後,重名的很多,如王偉,有14個之多,如何區分他們,成了問題。這邊用到了論文中合作者的信息(共同作者),首先訓練的是那些很難重名的人,作爲clean data。從他們出發,分類其它的。

    最後講了Xiaoxin Yin最近的研究方向:辨別網頁上信息的真假。利用的是這樣一個假設,真的信息只有一個,假的信息千變萬化。

    最後,再次向牛人致敬!

    貼一下講座的摘要,以及韓老的簡歷:

ABSTRACT
Algorithms like PageRank and HITS have been developed in late 1990s to
explore links among Web pages to discover authoritative pages and hubs.
Links have also been popularly used in citation analysis and social network
analysis.  We show that the power of links can be explored thoroughly at
data mining in classification, clustering, information integration, and
other interesting tasks.  Some recent results of our research that explore
the crucial information hidden in links will be introduced, including (1)
multi-relational classification, (2) user-guided clustering, (3) link-based
clustering, and (4) object distinction analysis.  The power of links in
other analysis tasks will also be discussed in the talk.
------------------------
Short bio:
Jiawei Han, Professor, Department of Computer Science, University of
Illinois at Urbana-Champaign.  He has been working on research into data
mining, data warehousing, database systems, data mining from spatiotemporal
data, multimedia data, stream and RFID data, Web data, social network data,
and biological data, with over 300 journal and conference publications.  He
has chaired or served on over 100 program committees of international
conferences and workshops, including PC co-chair of 2005 (IEEE)
International Conference on Data Mining (ICDM), Americas Coordinator of
2006 International Conference on Very Large Data Bases (VLDB).  He is also
serving as the founding Editor-In-Chief of ACM Transactions on Knowledge
Discovery from Data.  He is an ACM Fellow and has received 2004 ACM SIGKDD
Innovations Award and 2005 IEEE Computer Society Technical Achievement
Award. His book "Data Mining: Concepts and Techniques" (2nd ed., Morgan
Kaufmann, 2006) has been popularly used as a textbook worldwide.

韓老的Home page:

http://www-faculty.cs.uiuc.edu/~hanj/

講的四篇paper:

CrossMine: Efficient Classification from Multiple Heterogeneous Databases

Cross-Relational Clustering with User's Guidance

LinkClus: Efficient Clustering via Heterogeneous Semantic Links

Object Distinction: Distinguishing Objects with Identical Names by Link Analysis

他作的另一個演講記錄:

http://users.ir-lab.org/~bill_lang/blog10/archives/001166.html

 
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章