Goal
Extracting all the author & affiliation names from an article
and save the relationships between them. Please check the format section and follow the guideline closely to complete the task.
Format
- All the articles are in PDF format, you will need a PDF reader
to open them. For example, Evince Reader
- All the output are normal text file. Any text editor is
sufficient. For example, Notepad++
- For an article <file
name>.pdf (See Example 1)
- The list of authors and their affiliations is stored in <file name>.txt.
- One author per line.
- The colon character ":"
is used as field separator.
- The first field is the name of the author. The following
fields are his corresponding affiliations (could be more
than one).
- There is no duplicated author in an article.
- The list of affiliations is stored in <file name>-aff.txt.
- One affiliation per line.
- If one affiliation appears more than one inside an article
(a duplicated affiliation), all of the duplications must be
included (See Example 2).
Guideline
- For extracting affiliation
- Remove email address (if available)
- Remove phone number (if available)
- Keep the affiliation as close as possible to the
original text, however, ascify some characters is acceptable
e.g. université
→ universite
- For extracting author
- Remove name title e.g. PhD, MsC
- Keep the name as close as possible to the original text,
however, ascify some characters is acceptable e.g. daumé → daume
Download
- Download the sample collection here
- Download the full collection
- Download the clean collection
Example
Example 1

- List of authors and their corresponding affiliations
Example 2

Upload Tagged Data
Please upload the result as a zip file using the same name and directory structure as the file you downloaded