WING paper reading
2004/06/22
 
Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization
Regina Barzilay and Lillian Lee

- Available at: http://www.cs.cornell.edu/home/llee/papers/textstruct.home.html.
Also see: http://www.ling.ohio-state.edu/~rytting/Barzilay_Lee_NAACL2k4.ppt
- Relevant to: text segmentation, topic ordering, sentence ordering, summarization, NLG, ATS, Hidden Markov Model (HMM)
- Printed and filed

Applies HMMs to topic modeling. Topics represent hidden states, and the observations correspond to text span (in the paper, sentences are used). Learns HMM structure from un-annotated documents (unsupervised). Applies model to information ordering (an NLG task) as well as automatic summarization, both with good results. Key points of their paper and algorithm given in first few pages of their paper.

The HMM construction model here uses an iterative approach which is different from the standard EM approach. The algorithm starts by performing complete-link clustering of the text spans into k clusters (k is user specified). Clustering is done using a complete-link approach that uses bigram similarity via cosine model. Once that is complete, clusters smaller than some size are lumped together to form an etcetera cluster; this is the cluster that contains all of the outlier sentences. Note that this step (clustering) does not take into account any positional information. As some clusters of size smaller than k are lumped the resulting number of clusters is less than k. They call the final number of clusters m.

From the clustering, the second step builds a content model (as an HMM). For each cluster in the m clusters, a corresponding state is built. The transition from the hidden state to the observation is based on smoothed counts from the cluster. This is rather straightforward; however, the interesting point is what is done for the etcetera cluster. Here they set the language model to be complementary to all of the other states (that are already modeled by the other clusters). The language model for the etcetera cluster ignores all information about the text spans in the cluster. We can think of this insertion state (etcetera cluster) as modeling all bigrams that are not well-modeled by the other clusters.

The transition between hidden states (the topic shift transitions) do take into account locality and sentence ordering. Here the transition probability is naturally modeled by the probability of seeing text spans from cluster a preceding cluster b. This probability gives a local (bigram in terms of text span) transition probability. Note that self-loops (keeping within the same topic) are a natural consequence of the model.

Finally, the model uses Viterbi re-estimation to have the content-model regenerate the clusters. Note here that the complete-link algorithm was only used to bootstrap the process, and is not re-used in subsequent iterations.

Weaknesses and observations:
- I'm not yet clear exactly how the re-estimation of the clusters is done. Can the number of clusters change from m in the subsequent iterations?

- k needs to be specified
- word bigram for sentences would seem to me to be pretty sparse. Surprised that this works well. Seems like that this method would need a back-off model to unigram probabilities for other types of data besides newswire (where it is well-known that AP/UPI feeds are often edited to create the final published article in different publications).
- the choice of k and the threshold t is crucial. With too high a k, the transition probability between topics in the second step will become too sparse.
- sentences directly before and after a topic shift may give a stronger indication of topic shift (which is currently ignored by their model). This point may push for a model in which multiple states represent a topic, as in NP chunking (e.g., topic_begin, topic_continue, topic_end).


2004/06/21
 
Hierarchically classifying documents using very few words
Daphne Koller and Mehran Sahami
ML 97

- Available at: robotics.stanford.edu/~koller/papers/ml97.html
- Printed and filed
- Relevant to: text classification, hierarhical text classification, clustering, naive bayes, KDB, Reuters-22173, meURLin project

Discusses text classification with a subset of word features. Instead of using all word features for classification, they use a Zipfian filter to remove very frequent and very infrequent words before sending to the bayesian classifier. They experiment with both Naive Bayes and KDB (which models some limited salient inderdependencies between features, unlike Naive Bayes).

The paper discusses hierarchical text classification, in which they induced a hierarchy from the Reuters dataset. Each level in the classification employs separate classifiers (one for each path in the parent. Comparison to naive Bayes shows that NB actually does quite well on this task. The incremental gains by using stronger models of Bayesian networks is signficant, especially in the hierarchical classification model.

Observations:
- topic hierarchy has many documents, nodes are well-populated. Doesn't really discuss whether documents are well-distributed among the nodes in the tree.

Weaknesses:
- topic hierarchy of the paper induced from a flat classification, not really enough details here, and the hierarchy is artificially created. we might expect better performance on categories that are actually constructed by hand or at least mutually exclusive.

2004/06/18
 
Learning Hidden Markov Model Structure for Information Extraction
Kristie Seymore, Andrew McCallum and Ronald Rosenfeld
AAAI 1999 Workshop on Machine Learning in Information Extraction

- Printed and Filed
- Relevant to: citation parsing project, hidden markov models (HMM), structure learning, header parsing, information extraction (IE)
- Available at: www.informatik.uni-freiburg.de/ ~ml/teaching/ws03/pll/hmm_ie.pdf, also http://www.informatik.uni-freiburg.de/~ml/teaching/ws03/pll/hmm_ie.ppt

Discusses how to learn the HMM structure from data. Proposes two models for inducing the structure from a maximally specific model (in which each training instance corresponds to a separate path in the HMM model). Neighbor-merging collapses transitions between adjacent two states that have the same hidden label. V-merging collapses two parallel states with the same label if they share transitions from or to common states. See the presentation above for details and Stolcke '94. Both merging techniques can be applied iteratively until the topology of the HMM does not change further.

Another contribution of this paper is that of the integration of labeled with distantly labeled training data. They show that the combination of distantly-labeled and labeled training data is helpful but using an interpolated model using mixture weights. The straightforward application of using the distantly-labeled model proves not to work in their case.

They also examined re-estimation of the transition probabilities based on EM (Baum-Welch) using the unlabeled instances. This does not lead to improvements. They have explained it by saying that the process maximizes the likelihood of the unlabeled data, but not the classification accuracy, a shortcoming of the HMM.

This paper is background to the group's later work with CRFs and gives a basis for understanding their experimental results on which they compare with.

In the conclusion, they propose an internal model structure for certain states that need duration modelling. Contrast this with the approach for NP chunking with NP_start, NP_end, NP_cont and NP_island states, as well as the approaches for duration modeling discussed in Durbin et al.'s book on biological sequence analysis.



2004/06/15
 
Enhancing Maintainability of Source Programs Through Disabbreviation
Kari Laitinen and Jorma Taramaa and Marukku Heikkila and Neil C Rowe
Journal of Systems Software, 1997, 37:117-128

- Available at: library
- printed and filed.
- relevant to: javascript software categorization project, abbreviation expansion, abbreviation, compaction, non-standard words (nsw)

Discusses the disabbreviation (expansion) of variable names in computer programs. Uses orthographic and space characters to do most splitting; the second part learns from user given substitutions, using simple rules similar to the meurlin system; the third part uses both 1-line and multiline comments to yield possible expansions for variables.

Strengths and observations.
- Interesting observation that morphology in computer variables is limited.
- Argues for a domain-specific (programming-specific) dictionary.


 
Abbreviated Text Input
Stuart Shieber and Ellie Baker
IUI 2003

- Relevant to: SMS input project, text compaction, abbreviation expansion, non-standard words, HCI
- Available at: http://www.iuiconf.org/03pdf/2003-001-0064.pdf
- Filed

Discussed duality between compacted and full text (they use the term compressed for compaction). Their word abbreviation system uses vowel dropping (except for in the initial position. The results show that this would be unacceptable (3% error rate) for input of SMS messages. Argues for HCI-based evaluation of
Points to Goodenough-Trepanier and Rosen's work on slowness/unacceptability of prediction methods. Has a useful and good primer bibliography on this work.

Strengths: framework uses a FST model, which is easy to compose. Also pays particular attention to HCI issues.

Weaknesses:
- corpus studied is newswire, not actual input text that people tend to use for their domain (PDA).
- how to effectively handle OOV and other NEs.


2004/06/14
 
The Phrasal Verb in English
Dwight Bolinger
Harvard University Press, 1971

- Available at: NUS Central Library PE 1319 Bol
- Relevant to: phrasal verbs, verb-particle constructions

Treats phrasal verbs (VPCs) in terms of syntax. Examines 9 criteria for their differentiation, especially the definite noun phrase test (a separate chapter, #5). Discusses the semantic features of the particles associated with such phrasal verbs and how they relate to aspect (chapter 8). A key resource for reference in papers.

I re-type the nine criteria for phrasal verbs as listed in Chapter 1 for quick reference:
  1. replaceability by a simple verb
  2. If transitive, the combination should passivize.
  3. If transitive, the combination should yield an action nominal.
  4. If transitive, the particle can either precede or follow the noun object.
  5. If transitive, pronouns usually precede the particle.
  6. Adverbs cannot intervene between the verb and the particle unless the latter appears in its most literal sense.
  7. An adverb can be accented.
  8. If transitive, the particle can precede a simple definite noun phrase (a proper name or "the" plus a common noun) without taking it as its object.
  9. Phrasal verbs can be defined by simply listing them.

Note for #9 he doesn't suggest that such a list could be exhaustive, rather, he notes that the list is continually being added to.


2004/06/10
 
Deletion of Support Verbs During Abstracting
Choy-Kim Chuah
SNLP-O-COCOSDA 2002

- Relevant to: support verbs, abstracting, light verbs, summarization
- Printed and filed
- Available at: http://utmk.cs.usm.my/pdf/snlp2002.pdf

A manual study of 57 scientific articles. Studied correspondence between abstracts and full-text sentences. Found corresponding sentences that expressed similar semantic content. Analyzed these sentences in terms of support verb and other verbal phenomenon.

Found the case that more support verbs were deleted than added during the abstraction process. Substitutions of support verbs came out to be troponyms occasionally.

Weaknesses:
- sample a bit too small to make extensive conclusions

Extensions:
- generative model of abstraction process (a la Jing) in terms of support verb deletion/insertion/substitution
- use paraphrase model to locate support verb and nominal pairs (see Grefenstette and Teufel 1995)
2004/06/08
 
Automatic Identification of Support Verbs: A Step Towards a Definition of Semantic Weight
Mark Dras
partially from IJCAI 1995

- Available at: arxiv.org/abs/cmp-lg/9510007
- Relevant to: support verbs, light verbs, semantic weight, lexical density

Lexical density = how much meaning per word (after Halliday 1985)
Closed class words have low semantic weight, open class words have high(er) semantic weight. Points out it is really a cline: support verbs and light verbs come somewhere in between.

Dras builds on Grefenstette and Teufel (1995) to automatically identify support verbs for nominalizations. Dras incorporates "global information": that "make", "have" and "do" are verbs that often indicate support / lightness. Grefenstette and Teufel (1995) on the other hand, use only "local information", i.e., co-occurrence information on the target verb-noun word itself.

The author would like to weight verbs such as "make", "have" and "do" as more likely to be support verbs and add this as evidence to the local information compiled by Grefenstette and Teufel.

Weaknesses:
- this paper does not show exactly how the formulas for global information was calculated
- also does not show how the formulas for global and local information are combined
 
Corpus-based Method for Automatic Identification of Support Verbs for Nominalizations
Gregory Grefenstette and Simone Teufel
ACL 95

- Available at: arxiv.org/abs/cmp-lg/9503010
- Relevant to: verbs, light verbs, support verbs, nominalizations
- Printed and filed

Presents a method for finding support verbs in a corpus. A support verb is a verb "to be used with a nominalized predicate structure": e.g., have a look, take a look. Their technique is spelled out on page 3 of the text. Basically, we:

1. preprocess a corpus to find all sentences that have an occurrence of either a noun/verb forms of a target word (e.g., "propose")
2. parse them to get syntactic trees
3. for noun uses, extract cases in which the target word was used as a direct object to a verb (e.g., "make a proposal"). Calculate the occurrences of these verbs over a large corpus
4. rank the verb occurrences to recover the candidate support verbs.

Problem: the approach conflates concrete uses (reject a proposal) and nominalized (make a proposal) uses which causes problems. Need to differentiate one from the other. This is the problem addressed in the second half of the paper. They do this by claiming that "nominalizations (versus concrete uses) retain some of the argument/adjunct structure of the verbal predicate". That is nominal uses retain some of the same structure as the original verbal ones. So they examine the most common prepositions of the noun forms and use them to look for verbal forms with the same prepositions as well.
Compare "make a decision in an hour" and "decide in an hour" both take the same preposition.

Also examines in the related work the idea of gradedness (or "cline") of different sorts of nominalizations, including verbal nouns and deverbal nouns. Deverbal nouns "are defined as records of the action having taken place rather than as description of the action themselves" (verbal nouns). E.g., "the advice" (deverbal) w.r.t. "the advising" (verbal).


2004/06/04
 
Accelerated Focused Crawling through Online Relevance Feedback
Soumen Chakrabarti, Kunal Punera, Mallela Subramanyam
WWW 2002

- Available at: http://citeseer.ist.psu.edu/chakrabarti02accelerated.html
- Relevant to: focused crawling, crawling projects

Introduction of reinforcement learning to crawling. The classifier that prioritizes webpages on the frontier is now cut up into two parts. The first part, the apprentice, calculates the relevance of a link on a visited page towards finding another relevance resource. The second part, the critic, provides training examples to try to get in an online fashion. The critic provides an online answer to what content is desired and the apprentice builds a model for how that content is going to be found.

The author's version of the apprentice takes into account certain types of features from the DOM tree of the web page. Their system simplifies the DOM tree and uses co-citation (and some other link structure attributes) in its inventory of features for crawling. They use a naive bayes learner, which seems to be sufficient (with small caveats).

Their experiments ran on the web, so recall not really possible. Use loss (wasted bandwidth) to measure effectiveness.

BUG: Supposedly their crawler is available.


 
The shark-search algorithm - an application: tailored web site mapping
Michael Hersovicia, Michal Jacovia Yoelle S. Maareka, Dan Pellegb Menachem Shtalhaima, and Sigalit Ura
WWW 7, 1998.

- Available at: http://www7.scu.edu.au/programme/fullpapers/1849/com1849.htm
- Relevant to: focused crawling, crawling

Enhancement of basic "Fish search" algorithm by De Bra et al. (1994). De Bra's work seems to have introduced priority queues for crawling. The shark-search allows relevance to be calculated probabilistically, rather than by using the Boolean scores used in Fish search. Relevance scores are inherited from parent nodes.

A second contribution is the use of anchor text in doing this relevance score calculation.

The authors claim that these two improvements suffice to throttle the search effectively such that the width parameter in the original fish search is unnecessary.

Both algorithms implemented, evaluated in the context of the Mappuccino site mapping tool (IBM's product)

Weaknesses:
- not really on topic for focused crawling for structural cues.





 
Understanding Interleaved Code
Rugaber S., Stirewalt K., Wells L.
Journal Automated Software Eng.

-Available at: http://www.cc.gatech.edu/reverse/repository/jase.ps
- Relevant to: javascript program tracing project.
- Printed and filed.

Defines interleaving of code and discusses conditions for why it happens. Does not really discuss how to detect this. Not particularly relevant for our projects.
 
Automatic Extraction of Opinion Propositions and their Holders
Bethard S. et al.
AAAI Spring Symposium 2004

- Available at: http://www.stanford.edu/~jurafsky/SS404BethardS.pdf
- Relevant to: Opinion QA, QA, use of Propbank, EVCA, FrameNet
- Printed, commented, and filed.

Attempts opinion identification, and subsequent identification of opinion clause and holder of opinion. Uses both a one-shot (one tier) approach as well as a 2-tier approach. The two-tier approach tries to identify propositions first and then determine whether they are opinions or not. Approach gets about .5 F1.

Uses opinion word list coming from Wiebe and Levin augmented using WordNet. This consist of strong and weak opinion words. Manually-annotated data was selected from FrameNet and PropBank. The trained algorithm then passes judgment on each constituent in a parse tree to decide whether an opinion is detected or not.

Has an interesting feature ADJP, which looks for complex ADJ phrases. Reported that this ADJP feature does extremely well, although the analysis is not detailed at this point.

Weaknesses:
- doesn't really say what strong and weak words are.
- why does ADJP work well.
- does this generalize outside of the news domain?
 
Focused Crawling Using Context Graphs
Diligenti M., Coetzee F., Lawrence S., Giles C., Gori M.
VLDB 00

- Available at: http://citeseer.ist.psu.edu/diligenti00focused.html
- Relevant to: focused crawling, structural crawling.
- Printed and filed.

Uses google to create backlink neighbors around target page(s). All pages in each layer in backlink neighborhood is used to create a layer model using TFIDF vector space model.

Once the training is completed the models are used to crawl. Given a starting page, all outgoing links are fetched and classified by the classifier into the layers. The page with the least layer number is used as the next page to crawl.

Weaknesses:
- all outgoing links are fetched, why not fewer?
- training requires fetching of large neighborhood

Powered by Blogger