WING paper reading
2004/07/30
 
Linear Text Segmentation using a Dynamic Programming Algorithm
Athanasios Kehagias, Fragkou Pavlina, Vassilios Petridis
EACL 03

- Also see: http://citeseer.ist.psu.edu/580132.html, and http://acl.ldc.upenn.edu/eacl2003/html/main.htm, and their Int'l J of Intelligent Systems paper
- Relevant to: text segmentation, partition product models.
- Printed, highlighted and filed

An linear segmentation approach that takes into account within-segment cohesion (they call this homogenity) as well as global information about segment length.  Characterize global segment length as a normal distribution with (mu, sigma).  Uses a balancing parameter, gamma, to weight the two sources of evidence.

They experimented with Choi's dataset (who consolidated much earlier work on segmentation) and use Beeferman's Pk to rate its success.  Note that Pk penalizes segmentation mistakes near true boundaries much less than other mistakes (definitely a good thing).

As this approach has a model component that relies on the global text segment length information, it is important to see whether its performance could be an incidental result of using a training/testing corpus that is rather uniform.  From the data in Table 1 it appears that this might be the case, as the average segment length is quite uniform in sets 1, 2 and 3.

From the results, the optimal gamma is set quite low, between .08 and .4, indicating that the segment length model is used as a refinement to the stronger cohesion/homogenity component. 

Two other results of the paper are intersting and need to be investigated in more depth.  One is that the generalized density (parameter r) seems to improve performance.  Why this might be isn't immediately obvious to me.  Secondly, they note that the segment length is better modeled by words rather than sentences.  This intuitively makes sense to me, but again, its not obvious why sentences perform significantly worse.

Finally, the authors' end with a memorable sentences which should be considered: Choi uses local optimization of global information, and Heinonen uses global optimiziation of local information.

Relevance:
- how do we adapt this segmentation algorithm to do hierarchical segmentation for documents on the web?
- what exactly is a PPM role in the algorithm?


 
An Experiment Using Coordinate Title Word Searches
Frederick G. Kilgour
JASIST 51(1):74-80, 2004

- Relevant to: Known Item queries, query rewriting
- Printed and Filed
- Available at: LINC

This paper goes into historical detail on past query retrieval studies on known items.  Kilgour investigates known-item query studies from the era of card catalogs.  Some notable results distilled from this survey of earlier work includes facts useful for our current study of known item queries: Tagliacozza et al. (1970) notes that users had a higher likelihood of having correct title information rather than correct author information.  Also that title searches are more common in today's OPAC than in the older card catalog systems, although I concur with Kilgour that this is largely an artifact of only having limited title entries in the card catalog system.


2004/07/29
 
Known-Item Online SEarches Employed by Scholars Using Surname plus First, or Last, or First and Last Title Words
Frederick G. Kilgour
JASIST 52(14):1203-1209 2001.

- Available through LINC
- Printed and Filed
- Relevant to: known item query project

Part of a long series (I think six) articles concerning retrievability of book titles in OPACs using various approaches.  This constitutes known item searches.  Kilgour and his colleagues are trying to identify and prescribe a useful pattern to use to perform known item query searches.

This paper redoes an earlier experiment when Kilgour used a normal keyword search to do a retrieval experiment using "surname plus first and last title words (not including stop words)" to retrieve books.  The main finding of the paper suggests that in 98% of the cases in which a monograph (single author's work) is sought the surname + title word retrieves the item's record (if it does exist in the catalog / database).

In this later experiment, Kilgour uses limits on the fields in which the words can be matched by using MARC field restrictions.  The new experiments concur with the first, and do not show additional benefit.  As such there is little that is new in the experimental results. 

A limitation of the first experiment is that the work only examines monographs.  Kilgour addresses this by examining multiple authored / edited works.  Surprisingly it is shown in an exploratory experiment that the additional author surnames do not assist in retrieval (Table 5).
Kilgour does suggest, as I have also be musing about, that the search results and the record display (Kilgour uses the terms first and second screens) can be combined in certain cases.  It takes only a little bit of inference to see that known item query searches are such cases.

To do: would be good to look at our local LINC transaction logs to see how many of our queries match the prescribed patterns.

N.B.: Kilgour uses the NOTIS system, different than our local INNOPAC and different from Slone's DYNIX study.


2004/07/23
 
Encounters with the OPAC: On-line Searching in Public Libraries
Debra J. Slone

- Available from JASIS 51 (8):757-773, 2000
- Printed and Filed (with known-item project bibliography)
- Relevant to: Known Item searching, OPAC, query strategies, user studies.

This paper examines the searching strategies of library users by performing a study and questionnaire of 32 library patrons.  Slone examines three major types of library queries: area searches, known-item searches, and unknown-item searches. 

My summary only concentrates on the known-item searches, since that is the focus of our current research.  In the abstract, Slone writes that known-item searches experience "the most disappointment" and are characterised by "simplicity".

1. Finding results in the OPAC doesn't mean that the known-item query is satisfied. Slone shows that even in cases where a known-item search finds a resource, it may not be the desired one.  A question then is "how do we figure out what the proportion of correct answers is?"

2. Not satisfied with other material.  The patron wants this resource specifically and not others.  Also dissatisfaction when the book is unavailable.  This leads to critical and negative opinion of the library OPAC.  My opinion: this suggests that OPACS that can infer that the current query is a known-item one should present circulation information right away (and suggest alternatives for finding another copy if one is not available, e.g. hold, ILL, substitution for an area search, alternative titles). 

3. Known items may not be retrieved because of spelling errors.  Children in public libraries may be most affected by this.  Also spellings of author names.

4. Confidence and frustration levels differ for known-item vs unknown-item search.  Confidence levels are higher and frusteration lower for known-item searching. 




 
Scenario Management: An Interdisciplinary Approach
Matthias Jarke, X. Tung Bui and John M. Carroll

- Available at: Requirements Engineering (1998) 3:155-173
- Printed and Filed
- Relevant to: Scenario, Scenario Management

Examines three case studies of the use of scenario management: HCI, RE, and SM.  Not read in detail.  Posits four different views (facets) of the use of scenarios: form, content, purpose and life cycle views. 

 
Formal Approach to Scenario Analysis
Pei Hsia, Jayarajan Samuel, Jerry Gao, David Kung, Yasufumi Toyoshima and Cris Chen 
IEEE Software

- Available through LINC
- Printed and filed
- Relavant to: scenario analysis.

Discusses scenario engineering for system design.  Gives an example of a PBX switch.  An okay introduction to scenario analysis for an outsider.  Uses FSM notation to encode a scenario (that is the "formal" side of the paper).
 
I had to read this for an MSc defense to get an introductory view of the area.





Powered by Blogger