Supervised Categorization of JavaScript using Program Analysis Features

Paper

This is the homepage for the paper titled "Supervised Categorization of JavaScript using Program Analysis Features", which appeared in  the 2nd Asian Information Retrieval Symposium (AIRS 05). A copy of the paper can be downloaded from here.


Presentation

Presentation slides can be found here.


Dataset

Our dataset is derived from the standard WT10G corpora (More details on the WT10G corpora can be found on the TREC website and at the CSIRO website Anyone wishing to use this corpus must sign an individual license agreement before proceeding).

A complete dataset containing 18,683 documents can be downloaded from here. You can also download these documents grouped by servers from here.

We have selected 1,637 instances (also considered as "JavaScript units") from these documents for experiments. This is the list of the units. All the JavaScript units are constructed dynamically by our system. The list file looks something like this:

SERVER1168_WTX098-B36-310_N_0_0$form_validation $

SERVER1168_WTX098-B36-310_N_0_0$form_validation $

This line means the following: this JavaScript unit is extracted from a document named "WTX098-B36-310" in the WT10G corpora, and it is associated with a server of ID "SERVER1168". It is an interactive unit ("N": interactive unit, "D": non-interactive unit), with unit ID "0" (the next digit "0" is for verification purpose). This JavaScript unit is annotated as a "form validation" class.


System

You may construct JavaScript units using our system. Our system is written in Java and is under GNU license. You can download the source code from here (updated 7th July). You should re-compile the system by yourself. Sample files for utilizing this system can also be found in the source package.

The system contain partial modified codes from the Rhino 1.6  and HtmlUnit 1.4 (Please read their license agreement before proceeding). Modified codes are included in the system and you should NOT download the original codes. Some external libraries are also required. You may also find weka helpful in doing evaluation tasks.

The system was developed and tested with J2SDK 1.50.

Required external libraries (this list may be incomplete):

jaxen-1.0-FCS-full.jar XPath support
saxpath-1.0-FCS.jar XPath support
commons-collections-3.1.jar Collection classes
commons-lang-2.0.jar Core Language Utilities
commons-httpclient-3.0-rc2.jar Provides the actual http support
commons-codec-1.3.jar Provides common encoders
xercesImpl-2.6.2.jar XML Parser
xmlParserAPIs-2.2.1.jar XML Parser
nekohtml-0.9.4.jar Converts html into an XML DOM model
commons-logging-1.0.4.jar Logging support
commons-io-1.0.jar IO utilities

This page is maintained by LU Wei (Mr.) Last updated: Thu, July 07, 2005, 00:50:17