Preservation of web pages

Sue Steele
Student: 11258381
Subject: LAR5410
Assignment 2
Due: 8 May 1998
Co-ordinator: Ross Harvey
Dept. of Librarianship Archives & Records
Monash University




Introduction

Much of the Digital Library work and research currently underway revolves around the digitisation of existing materials such as books and photographs [1]. Relatively little of this work seems to consider ongoing preservation beyond the initial digitisation. Even less emphasis is placed on the preservation and long term archiving of web pages. Web pages are a relatively new phenomenon. Very few of them date before 1993. However, there are now over 300 million of them [2]. Initially much of the data on the web was a mirror of traditionally published information. This has changed and an increasing amount of information is only made available in online form. Unless steps are taken to preserve and archive this data, it may eventually be lost.

Unlike books and other physical materials, web sites and web pages are prone to changes in content, to move from one host to another or worse, to disappear completely without trace. First-line preservation of web-based materials isn't just about the longevity of the storage medium, it's also about ensuring continued access to materials, providing methods for consistency of linkages and citations and adhering to standards to facilitate access and possible re-formatting for preservation.

This paper is presented in two sections. The first deals with what can be termed routine day-to-day maintenance of web pages and web sites, that is the steps to do with preserving access to the resources. The second deals with the longer term preservation and archival issues relating to digital materials.

Preserving access

Standards

A basic awareness and understanding of the standards underpinning the world wide web are a necessary starting point for any serious discussion about the web, even one involving preservation and archiving.

World Wide Web Consortium (W3C)
http://www.w3c.org/

W3C is the peak body for the development of WWW standards (in this case they are called W3C recommendations). Standards relating to markup languages ( for example HTML, XML, Mathml) and metadata ( for example PICS, RDF) are the most relevant for access and preservation issues. The W3C website contains descriptive and technical information on all aspects of W3C activities, including standards.

Naming and Addressing

Consistent naming and addressing is critical for resource discovery and access to internet-based information. An addressing scheme (URI/URL) is one of the three essential building-blocks of the web (the other two are HTTP and HTML). Within a library, items will generally be given a call number so that users may locate materials. On the internet, an items address (a hostname for a machine and a URL for a web resource) provides the main access point.

URIs, URLs and URNs

The URI (Uniform Resource Identifier) specification is the standard for addressing notation. It outlines the basic syntax, and acceptable character set, so that humans and machines can uniquely identify resources.

Web naming and addressing overview
http://www.w3c.org/Addressing/Addressing.html

This document is a typical example of the W3C approach to standards information. It contains a simple definition and description of URIs URLs and URNs and a useful list of links to more technical resources, work in progress and standards.

URLs (Uniform Resource Locators), used throughout the web, are a form of URI. A URN (Uniform Resource Name), rarely used and only loosely defined after several years of standards discussion, is a more enduring URI, where an institution or organisation has committed to maintaining the availability of a resource for some time to come.

Persistent URL homepage
http://purl.nla.gov.au

A persistent URL (PURL) approximates a URN. Technically a PURL is a URL which is 'guaranteed' to be around for some time and it points to the actual URL of the web-resource. Thus, if the URL changes, the PURL is amended to point to the new location without any other links becoming broken.

PURLS are an OCLC innovation. The National Library of Australia also maintains a PURL service for the Australian community. One drawback (and possibly a reason why they have not proved popular) is that the PURL address is that of the PURL service (in our case the NLA) rather than that of the information provider.

Live links

Broken links pointing to non-existent or no-longer-at-this-address URLs are a constant source of frustration to web-maintainers and web-users.

PURLs are one possible solution to this type of problem but only when the information provider has chosen to use them, and continues to maintain them. A more practical solution is to make use of a link-checker, a robot which seeks out all of the links on a web page (or site) and reports on ones which are broken or re-directed. Unfortunately considerable manual effort may be required to re-locate any missing URLs.

NetMechanic link checking and validation
http://www.netmechanic.com/link_check.htm

NetMechanic is a public link-checker. Any web-user can submit a URL for link-checking (either of one page or an entire site). Upon completion of the check, the user is emailed the URL of a results page. This service is excellent for checking a few pages of links, or for small site maintainers who are unable to install such tools locally.

Within Monash, a similar service is available from http://www.monash.edu.au/wwwdev/html/validate/momspiderif.html.

Website test/management tools
http://www.charm.net/~dmg/qatest/qatweb1.html

This is a good list of web-management tools. It includes a long list of link-checkers (some freeware) which can be installed and run on local websites.

Version control

Before the web can become a 'serious' repository of scholarly information the problem of rapidly changing versions needs to be addressed. It is important that a document cited can remain in that form. Later versions of a document may be radically different. This is in addition to the problem of broken and missing links.

Version augmented URIs for reference permanence via an Apache module design
http://web/engr.uark.edu/~djb/me/resume/papers/195.html

Relatively little work has been done to provide a consistent approach to versioning web pages. This paper proposes an extension to the URI standard to include a revision mark (such as revision date) be appended to the URI (URL). The authors have developed an Apache module to parse the extended URIs (Apache is the most widely used web-server software and is available for most Unix and Windows NT systems).

Metadata schemes

Metadata is a relatively new term. In essence it is 'data about data'. In its simplest terms a library catalogue is a metadata repository because it provides data about the library's holdings and can be used as a finding tool. The explosion in the number of web pages and the limits of the existing fulltext indexing services highlight the need for a consistent approach to the description of web-resources, ie the need for metadata.

Metadata: the foundations of resource description
http://www.dlib.org/dlib/July95/07weibel.html

Stuart Weibel's article provides a good introduction to the concepts of metadata and the background to the development of the Dublin Core set of metadata elements. It summarises the first metadata conference which established a framework for future work in the area.

DSTC - Resource Discovery Unit
http://www.dstc.edu.au/RDU/

The Resource Discovery Unit (RDU) of the Digital Systems Technology Centre (DSTC) is an Australian group at the forefront of research and development in resource discovery and metadata. They provide links to metadata information and resources and working tools such as a metadata editor and search engine.

RLG Working Group on Preservation Issues of Metadata: Preliminary Report
http://lyra.rlg.org/preserv/presmeta.html

The report lists a series of metadata elements considered essential for providing preservation information about digital materials. A parallel group is working on preservation information enhancements for USMARC.

Preservation and archiving

The references in this section refer to general digital preservation resources, rather than to those specifically aimed at web pages. There are two reasons for this. First: web pages are digital materials and thus digital preservation resources are relevant for the preservation of web pages. Second: there are no resources dealing specifically with the preservation of web pages. A couple of pilot projects are trying to archive internet resources (which translates mostly to web pages).

This is a rather select list of the types of resources available. More comprehensive lists can be found at many of the listed sites.

Electronic publishing: library and archival issues
http://www.adfa.oz.au/EPub/key/Library.html

A chapter in the AVCC commissioned report Key issues in electronic publishing, this document provides a brief overview of library and archival issues surrounding electronic publications.

Preserving digital information
http://www.rlg.org/ArchTF/tfadi.index.htm

In 1996 the Task Force on Archiving Digital Information produced this report which is a excellent summary of and introduction to the issues surrounding the preservation and archiving of digital objects. One major finding is the recommendation for 'data migration' as a preservation strategy.

Guide for managing electronic records from an archival perspective
http://www.archives.ca/ica/cer/guide_0.html

The International Council on Archives Committee on Electronic Records published this comprehensive guide in 1997. It is "designed to help archival institutions reposition themselves to address the management of archival electronic records" . It provides an overview of the technologies and concepts involved and begins to address the issues raised. The intention is to revise the guide as better strategies are developed.

Digital preservation policies

Preservation of and long-term access to Australian digital objects
http://www.nla.gov/archive/npo/natco/princ.html

This top-level policy document recognises the importance of digital material to our cultural heritage and stresses the need for co-operative approaches to preservation from information providers, local, state and national bodies.

Keeping electronic records
http://www.aa.gov.au/AA_WWW/AA_Issues/KER/KeepingER.html

The Australian Archives published its policy for electronic record keeping in the Commonwealth Government in 1995. It devolves archival responsibility to the individual government agencies which create the information and outlines policies for managing the creation, retention and provision of access to electronic records.

Desktop management
http://www.records.nsw.gov.au/erk/edm/httoc.htm

The Archives Authority of New South Wales has been active in the area of digital records policy. This policy covers document management for individuals' desktop machines, which are not generally dealt with in more traditional archival policies.

Digital Library Sunsite collection policy
http://wunsite.berkeley.edu/Admin/collection.html

This policy, an adjunct to a print-based collection development policy, outlines some additional collection and preservation designations specific to digital materials. Of particular interest is the definition of 4 levels of 'preservation' designations archived, served, mirrored and linked.

Organisations involved in digital preservation activities

Commission on Preservation and Access (CPA)
http://www.clir.org/cpa/

The CPA is part of the US based Council on Library and Information Resources (CLIR). It is an umbrella group concerned with all aspects of preservation. Of course, this includes a substantial digital component. There are also newsletters and reports. Unfortunately the reports are not available online.

European Commission on Preservation and Access (ECPA)
http://www.knaw.nl/ecpa/

The ECPA was formed to deal with the problems of brittle paper. It is also interested in digital preservation. It works in close co-operation with other preservation bodies such as the Commission for Preservation. It's lists of conferences and reports provide a less US-centric approach.

RLG preservation program
http://lyra.rlg.org/preserv/

RLG's PRESERV program, like many preservation groups, is not exclusively for digital preservation. It has a number of digital programs and resources such as a working group on preserving digital resources.

National Library of Australia: preservation activities
http://www.nla.gov.au/niac/pres.html

The NLA engages in a number of preservation activities. Some of them relate to digital resources. For example, there was a National Consultative Meeting on the management of Physical Format Publications. Physical format publications include floppy discs and CD ROMS.

Digital preservation resources online

PADI: preserving access to digital information
http://www.nla.gov.au/padi/

PADI is a good starting point for information on digital preservation. It includes lists of bibliographies, conferences, electronic journals, mailing lists, policies and organisations pertaining to preservation.

Conservation Online
http://palimpsest.stanford.edu

The CoOL (Conservation Online) site is maintained by the Preservation Department of Stanford University Libraries. It contains links to all kinds of conservation resources including sections for electronic media and electronic records.

Heritage conservation & historic preservation
http://home.vicnet.net.au/~conserv/hp-hc.htm

Vicnet maintains a number of pages relating to conservation and preservation, including a section relating to digital preservation. The emphasis is on culturally significant materials.

Bibliographies

Bibliography on electronic library / digital library issues
http://aultnis.rutgers.edu/texts/ElectLibBib.html

Peter Graham's bibliography includes sections on Metadata and preservation issues. It contains many links to online resources. It is regularly updated. Some entries are annotated.

Preservation of electronic information: a bibliography
http://http://homes.ukoln.ac.uk/~lismd/preservation.html

Michael Day maintains this rather long list. It includes both online and more traditional resources. Day tries to be more inclusive than Graham.

Information conversion, integrity and preservation
http://info.lib.uh.edu/sepb/lbinteg.htm

This is the relevant section of Charles Bailey's excellent scholarly electronic publishing bibliography. Bailey lists selected works on various aspects of electronic publishing.

Electronic journals

RLG DigiNews
http://lyra.rlg.org/preserv/diginews/

DigiNews focuses on issues and projects relating to digitil initiatives with a preservation component. Regular features include calendars of events, announcements and highlighted web pages. Longer articles on particular preservation issues are also included.

Ariadne
http://www.ariadne.ac.uk/

Ariadne is a British ejournal focusing on electronic library issues and projects. Some articles are of relevance to those interested in preservation issues.

D-Lib magazine
http://www.dlib.org/

D-Lib is a magazine focusing on digital library research. Some articles are relevant for those interested in preservation issues.

Digital preservation projects

Arts and Humanities Data Service (AHDS)
http://www.ahds.ac.uk/

The AHDS is a national UK service designed to collect, preserve and provide access to electronic data in the arts and humanities fields. The service has a staff of about 50 and receives datasets from all over the UK. Data is converted into the AHDS preferred format, to assist current and future archiving. The site also provides general information about managing digital collections.

Arches - Archival server and test-bed
http://www.rlg.org/strat/projarch.html

"Project: To build the conceptual and physical infrastructure to contain information in digital form, provide tools for meaningful access, and ensure long-term availability to the information and the means to access it, The Arches project involves both an online repository for digital resources of all types and the software environment that makes this repository internationally accessible and responsibly maintainable. It creates the foundation for a variety of resources and services for researchers. "

CEDARS project
http://www.ukoln.ac.uk/services/elib/projects/cedars/

CEDARS is a project of the UK Online Libraries Network (UKOLN). The project aims to provide a model for best practise in digital preservation.

KULTURARW3
http://kulturarw3.kb.se/

The Kulturarwł Heritage Project for long term preservation of published electronic documents aims to collect and preserve Swedish electronic documents, including all Swedish web pages.

PANDORA
http://www.nla.gov.au/policy/plan/pandora.html

"PANDORA is a project initiated by the National Library of Australia to investigate strategies for the storage, preservation and access to digital data in the context of the creation of an electronic archive of library materials." the approach is to selectively archive a range of digital materials designed to provide the 'flavour' of Australian electronic materials for future researchers.


Made
with vi Notes:
1. For example the Digitial Libraries Initiative.
2. S. Lawrence, C. Giles. Searching the world wide web. Science. 280(3), p.98. 1998.