|
Introduction to LOCKSS
The LOCKSS 'Plugin Definer'
- Plugin Name
- Plugin ID
- Plugin Version
- Configuration Parameters
- AU Name Template
- Start URL Template
- Crawl Rules
- Pause time between fetches
- New Content Crawl Interval
Results
References
Writing OAI Plugins
Case Study: VTech ETDs plugin
|
Crawl Rules
These rules define the boundaries of an AU in the journal's web site.
An AU is normally a year's run or a volume of the journal.
- In the LOCKSS plugin definer window, click on the ‘…’
beside Starting URL Name template. A new window as shown below would
pop up.
- In this Crawl Rule editor, you must start entering your rules for
crawling. You might need to understand a few things about Regular Expressions.
Add a new rule by clicking on the ‘Add’ button. Now, click
on the pattern ‘NONE’ beside the action ‘Include’.
- A new window as shown below would pop up, allowing you to enter or
define the crawl rule. Select BASE_URL from the combo box and do an
Insert Parameter. Then, as before type in the String Literal (exact
location in which the publisher’s manifest page is located) in
the dialog box and save the changes.
- Now, let us see how to include rules for a complex structure containing
volume numbers. In the URL for Virginia Libraries (http://scholar.lib.vt.edu/ejournals/VALib/),
consider the PDF file which is available in the location given by http://scholar.lib.vt.edu/ejournals/VALib/v50_n2/v50n2.pdf.
- In the above PDF location, the BASE_URL is followed by a series of
characters, which can be represented using regular expressions. First
select the BASE_URL from the combo box and insert the parameter.
- The second step is to enter a String Literal ‘v’, followed
by the insertion of the Volume Number parameter. (When you insert the
Volume number, a new window would pop up, asking you to specify the
padding value. Give OK for the default padding value of ‘0’.)
- The third step is to continue the rule – Having an underscore
(‘_’) followed by ‘n’ as the string literal.
After inserting the String Literal, select ‘Any Number’
from the combo box and give an insert match. Now, [0-9]+ will appear
on the window, which is the regular expression format for Any Number.
This means that the ‘_n’ can be followed by any integer
number from 0 to 9.
- Finally, the string literals ‘/v’ followed by insertion
of Volume Number, the string literal ‘n’, and ‘any
number’ match has to be specified again, because the complete
URL is http://scholar.lib.vt.edu/ejournals/VALib/v50_n2/v50n2.pdf. Now,
enter the String literal ‘.pdf’ to specify the file type
that has to be crawled. You will notice that a backslash is introduced
before the dot. This is the representation of a dot in Regular Expressions
format.
- Now, continue creating other rules to complete the Crawl rules section
of the plugin. The final Crawl rule template editor would look like
the figure shown below.
- In the last rule shown above, we have added a $ to the end of that
pattern (which will match the end of a string). If this is $ is not
given, the locations of all the files would be fetched irrespective
of the volume number given during the test run.
|