Wiki

DEVPLAN

From TARDIS

Jump to: navigation, search

Contents

[edit] TARDIS 2.0 Software Development Plan

After testing the limitations of SWORD, examining the Australian METS profile, and pondering new features, Steve Androulakis has written a technical software development plan, proposing how the technologies behind TARDIS 2.0 will function.

The entire workflow of TARDIS 2.0, from packaging data, to having it indexed by the TARDIS web site is explained below, in sections.

[edit] Packaging Data For Upload

Java code will be produced to write a METS.xml file based on a local folder structure of data, or alternatively through the XDMS data heirarchy. Local folder structure will resemble:

  • /Experiment
    • /Dataset1
      Compressed files (split bzipped archives, using datasetpackager)
      XML dataset information (extracted using datasetpackager)
    • /Dataset2
      Compressed files
      XML dataset information
    • /Dataset 3
      Ancillary files

METS will be created as follows:

  • Using the Harvard METS Java Toolkit
  • METS will follow the Australian METS Profile. See the Crystallography SIP (ingestion package) for an example.
  • A METS.xml file is produced, including references to all included data, and structure map describing the heirarchy of the data and inline MODS/custom metadata.
  • All generic annotations for datasets and the experiment, are written from the program (user-input) straight to the METS.xml in MODS format, or alternatively sourced from fields in the XDMS data management system.
  • File locations coded into the METS.xml point to locations inside the zip file to facilitate repository upload.
  • This METS.xml file is zipped up with the rest of the actual data (can exceed 4GB, so requires web server/client machine to support file systems better than FAT32.. other considerations? web server jvm memory?).

Contents of zip becomes:

metspackage.zip:

  • /Experiment
    METS.xml
    • /Dataset1
      Compressed files
    • /Dataset2
      Compressed files
    • /Dataset 3
      Ancillary files (or workflow data files etc)

[edit] Sending Data to the Repository

The zipped METS package is sent to a SWORD server with a custom file handler to correctly upload information to the repository.

The zip is opened by the SWORD server, and the METS.xml is interpereted using Java/Xpath expressions:

The file list is traversed..

  • The top level element in the structure map (experiment) is examined. The MODS description data associated with it is translated to Dublin Core (in Fedora's case) and uploaded as a DC stream, used for creation of the top-level repository object (experiment). This information will be used to represent the experiment outside of TARDIS (ie, in Arrow, Fez)
  • Files are uploaded from data within the zip to the repository.
  • The file pointer in the METS.xml is changed to the repository's url link to the file.
  • The METS.xml, with its changed file pointers, is uploaded to the repository as a METS.xml object for dissemination.
  • The changed METS.xml is exposed through OAI-PMH for harvesting.

[edit] Harvesting the data for the TARDIS website

Once the METS.xml dissemination data is exposed from the repository, the harvester can parse it and add information to a relational database:

The structure map is traversed..

  • The experiment (top level) element is examined, and any MODS metadata associated with it extracted and added to the MODS database in TARDIS as a new experiment
  • The dataset div elements are traversed, with the metadata they reference extracted (oai_ds and MODS) and added to the TARDIS database.
  • URL links to the files within the repository, are taken out of the METS.xml and added to the database.

[edit] Data Display

The TARDIS website displays information from its database in the following structure (much like current implementation):

  • Experiment
    Generic information
    • Dataset
      Technical data
      Generic information (title, etc)
      File links
    • Dataset
      Generic information (this dataset is a workflow, so relevant information)
      File links

[edit] Current development status

An initial SWORD Crystallography METS file handler has been developed for the SWORD Fedora server.

A command-line program has been written that allows a user to define the experiment information, dataset details/file locations on their local machine. All information is gathered, diffraction image datasets are packaged, with metadata extracted, and everything is written into METS for ingestion.

Large scale data testing of the file processing/sword depositions are being done on production-level servers.