LOCKSS Mini Tutorials

 

Introduction to LOCKSS

The LOCKSS 'Plugin Definer'

- Plugin Name

- Plugin ID

- Plugin Version

- Configuration Parameters

- AU Name Template

- Start URL Template

- Crawl Rules

- Pause time between fetches

- New Content Crawl Interval

Results

References

Writing OAI Plugins

Case Study: VTech ETDs plugin

 

Crawl Rules

These rules define the boundaries of an AU in the journal's web site. An AU is normally a year's run or a volume of the journal.

  • In the LOCKSS plugin definer window, click on the ‘…’ beside Starting URL Name template. A new window as shown below would pop up.

  • In this Crawl Rule editor, you must start entering your rules for crawling. You might need to understand a few things about Regular Expressions. Add a new rule by clicking on the ‘Add’ button. Now, click on the pattern ‘NONE’ beside the action ‘Include’.

  • A new window as shown below would pop up, allowing you to enter or define the crawl rule. Select BASE_URL from the combo box and do an Insert Parameter. Then, as before type in the String Literal (exact location in which the publisher’s manifest page is located) in the dialog box and save the changes.

  • Now, let us see how to include rules for a complex structure containing volume numbers. In the URL for Virginia Libraries (http://scholar.lib.vt.edu/ejournals/VALib/), consider the PDF file which is available in the location given by http://scholar.lib.vt.edu/ejournals/VALib/v50_n2/v50n2.pdf.
  • In the above PDF location, the BASE_URL is followed by a series of characters, which can be represented using regular expressions. First select the BASE_URL from the combo box and insert the parameter.
  • The second step is to enter a String Literal ‘v’, followed by the insertion of the Volume Number parameter. (When you insert the Volume number, a new window would pop up, asking you to specify the padding value. Give OK for the default padding value of ‘0’.)
  • The third step is to continue the rule – Having an underscore (‘_’) followed by ‘n’ as the string literal. After inserting the String Literal, select ‘Any Number’ from the combo box and give an insert match. Now, [0-9]+ will appear on the window, which is the regular expression format for Any Number. This means that the ‘_n’ can be followed by any integer number from 0 to 9.

  • Finally, the string literals ‘/v’ followed by insertion of Volume Number, the string literal ‘n’, and ‘any number’ match has to be specified again, because the complete URL is http://scholar.lib.vt.edu/ejournals/VALib/v50_n2/v50n2.pdf. Now, enter the String literal ‘.pdf’ to specify the file type that has to be crawled. You will notice that a backslash is introduced before the dot. This is the representation of a dot in Regular Expressions format.

  • Now, continue creating other rules to complete the Crawl rules section of the plugin. The final Crawl rule template editor would look like the figure shown below.

  • In the last rule shown above, we have added a $ to the end of that pattern (which will match the end of a string). If this is $ is not given, the locations of all the files would be fetched irrespective of the volume number given during the test run.

Copyright © 2006 Kamini Santhanagopalan, Virginia Tech