[ChimeText] (note special time and venue) Hang LI (Microsoft Research Asia) / Statistical Learning Methods for Information Retrieval
Min-Yen Kan
knmnyn at gmail.com
Mon Apr 10 21:58:57 SGT 2006
Hi all;
We are having another special session of CHIME Text Processing seminar
at our usual date but at a different time and venue. Please attend
Hang Li's special talk on his broad work in MSRA.
Best,
Min
TITLE : Statistical Learning Methods for Information Retrieval
SPEAKER : Dr. Hang LI
Microsoft Research Asia, China
TIME : April 12, 2006, 2:30pm - 3:30pm, Wed
VENUE : SR 5 - S16 Level 4, Room 31
Chaired by Dr Ng Hwee Tou (nght at comp.nus.edu.sg)
ABSTRACT:
In this talk, I will introduce two statistical learning methods for
information retrieval. One is about "expert search", and the other
"learning to rank". Expert search is a search task where the user
types a query representing a topic and the search system returns a
ranked list of people who are considered experts on the topic.
Previous studies employed profile-based methods, where the expert
ranking is based only on co-occurrence between people and terms in
documents. We propose a new approach capable of employing many types
of association relationships among query terms, documents and people
(experts). These include relevance between query terms and documents,
co-occurrence between people and terms in documents, co-occurrence
between people and terms in authors and title fields, and
co-occurrence between people and people. We employ a new statistical
model, referred to as the two-stage expert search model, to combine
all the association information in a unified and theoretically sound
way. We used the data in TREC 2005 expert search task and the data
from an industrial research lab to verify the effectiveness of our
proposal. Our experimental results show that the two-stage model can
significantly outperform the profile-based method.
Learning to rank is an important topic in document retrieval. One
approach to the task is to formalize the problem as ordinal
regression. 'Ranking SVM' is such a method. We point out that there
are two factors one must consider when applying ordinal regression to
document retrieval. First, correctly ranking documents on the top is
crucial for an IR system. One must conduct training in a way such that
the top ranked results are very accurate. Second, the numbers of
relevant documents can vary from query to query. One must avoid
training a model biased toward queries with many relevant documents.
Previously, when existing methods including Ranking SVM were applied
to document retrieval, none of the two factors were taken into
consideration. In our work, we demonstrate that it is possible to
define a new loss function for document retrieval. The loss function
is a natural extension of the conventional 'Hinge Loss' used in
Ranking SVM. With the new loss function, we can overcome the drawbacks
which plague Ranking SVM. We employ two optimization methods to
minimize the loss function: gradient descent and quadratic
programming. Experimental results show that our method can outperform
Ranking SVM and other existing methods for document retrieval in two
data sets.
BIODATA:
Hang Li is a researcher and project leader at Microsoft Research Asia.
He is also adjunct professor at Peking University, Xian Jiaotong
University and Nankai University. His research interests include
statistical language learning, natural language processing,
information retrieval, and data mining. He earned a PhD in computer
science from the University of Tokyo. Hang has many publications in
international journals and conferences. He is in editorial board of
'Journal for Computer and Science Technology' and 'Computational
Linguistics and Chinese Language Processing'. His recent academic
activities include area chair of ACL'05 and program committee member
of IJCAI'05. Hang has been working on development of several text
mining systems or tools. These include NEC TopicScope, Microsoft
internal tool: TextMiner, Microsoft SQL Server Text Mining, and Office
2006 metadata extraction.
More information about the ChimeText
mailing list