Preservation of web pages
Sue Steele
Student: 11258381
Subject: LAR5410
Assignment 2
Due: 8 May 1998
Co-ordinator: Ross Harvey
Dept. of Librarianship Archives & Records
Monash University
Introduction
Much of the Digital Library work and research currently underway revolves around the digitisation of existing materials such as books and photographs [1]. Relatively little of this work seems to consider ongoing preservation beyond the initial digitisation. Even less emphasis is placed on the preservation and long
term archiving of web pages. Web pages are a relatively new phenomenon. Very
few of them date before 1993. However, there are now over 300 million of them
[2]. Initially much of the data on the web was a mirror of traditionally published information. This has changed and an increasing amount of information is only made available in online form. Unless steps are taken to
preserve and archive this data, it may eventually be lost.
Unlike books and other physical materials, web sites and web pages are
prone to changes in content, to move from one host to another or worse, to
disappear completely without trace. First-line preservation of web-based
materials isn't just about the longevity of the storage medium, it's
also about ensuring continued access to materials, providing
methods for consistency of linkages and citations and adhering to standards
to facilitate access and possible re-formatting for preservation.
This paper is presented in two sections. The first deals with what
can be termed routine day-to-day maintenance of web pages and web sites, that is the steps to do with preserving access to the resources. The second deals with the longer term preservation and archival issues relating to digital materials.
Preserving access
Standards
A basic awareness and understanding of the standards underpinning the world wide web are a necessary starting point for any serious discussion about the web, even one involving preservation and archiving.
W3C is the peak body for the development of WWW standards (in this case
they are called W3C recommendations). Standards relating to markup
languages ( for example HTML, XML, Mathml) and metadata ( for example PICS, RDF) are
the most relevant for access and preservation issues. The W3C website contains descriptive and technical information on all aspects of W3C activities, including standards.
Naming and Addressing
Consistent naming and addressing is critical for resource discovery and
access to internet-based information. An addressing scheme (URI/URL)
is one of the three essential building-blocks of the web (the other
two are HTTP and HTML). Within a library, items will generally be
given a call number so that users may locate materials. On the internet,
an items address (a hostname for a machine
and a URL for a web resource) provides the main
access point.
URIs, URLs and URNs
The URI (Uniform Resource Identifier) specification is the standard for
addressing notation. It outlines the basic syntax, and acceptable character
set, so that humans and machines can uniquely identify resources.
This document is a typical example of the W3C approach to standards information. It contains a simple definition and description of URIs URLs and URNs
and a useful list of links to more technical resources, work in progress and standards.
URLs (Uniform Resource Locators), used throughout the web, are a form of URI.
A URN (Uniform Resource Name), rarely used and only loosely defined after
several years of standards discussion, is a more enduring URI, where
an institution or organisation has committed to maintaining the availability
of a resource for some time to come.
A persistent URL (PURL) approximates a URN. Technically a PURL is a URL which
is 'guaranteed' to be around for some time and it points to the actual
URL of the web-resource. Thus, if the URL changes, the PURL is amended to
point to the new location without any other links becoming broken.
PURLS are an OCLC innovation. The National Library of Australia also
maintains a PURL service for the Australian community. One drawback
(and possibly a reason why they have not proved popular) is that
the PURL address is that of the PURL service (in our case the NLA)
rather than that of the information provider.
Live links
Broken links pointing to non-existent or no-longer-at-this-address URLs are a constant source of frustration to web-maintainers and web-users.
PURLs
are one possible solution to this type of problem but only when the
information provider has chosen to use them, and continues to maintain
them.
A more practical solution is to make use of a link-checker, a robot which
seeks out all of the links on a web page (or site) and reports on
ones which are broken or re-directed. Unfortunately considerable
manual effort may be required to re-locate any missing URLs.
NetMechanic is a public link-checker. Any web-user can submit a URL
for link-checking (either of one page or an entire site). Upon
completion of the check, the user is emailed the URL of a results page.
This service is excellent for checking a few pages of links, or for
small site maintainers who are unable to install such tools locally.
Within Monash, a similar service is available from
http://www.monash.edu.au/wwwdev/html/validate/momspiderif.html.
This is a good list of web-management tools. It includes a long list of link-checkers (some freeware) which can be installed and run on local websites.
Version control
Before the web can become a 'serious' repository of scholarly information the problem of rapidly changing versions needs to be addressed. It is important
that a document cited can remain in that form. Later versions of a document
may be radically different. This is in addition to the problem of
broken and missing links.
Relatively little work has been done to provide a consistent approach
to versioning web pages.
This paper proposes an extension to the URI standard to include a
revision mark (such as revision date) be appended to the URI (URL).
The authors have developed an Apache module to parse the extended
URIs (Apache is the most widely used web-server software and is available for most Unix and Windows NT systems).
Metadata schemes
Metadata is a relatively new term. In essence it is 'data about data'. In
its simplest terms a library catalogue is a metadata repository because
it provides data about the library's holdings and can be used as a
finding tool. The explosion in the number of web pages and the limits
of the existing fulltext indexing services highlight the need for
a consistent approach to the description of web-resources, ie
the need for metadata.
Stuart Weibel's article provides a good introduction to the concepts of metadata and the background to the development of the Dublin Core set of metadata elements. It summarises the first metadata conference which established a framework for future work in the area.
The Resource Discovery Unit (RDU) of the Digital Systems Technology Centre (DSTC) is an Australian group at the forefront of research and development in
resource discovery and metadata. They provide links to metadata information
and resources
and working tools such as a metadata editor and search engine.
The report lists a series of metadata elements considered essential for
providing preservation information about digital materials. A parallel
group is working on preservation information enhancements for USMARC.
Preservation and archiving
The references in this section refer to general digital preservation resources,
rather than to those specifically aimed at web pages. There are two reasons
for this. First: web pages are digital materials and thus digital preservation
resources are relevant for the preservation of web pages. Second: there are
no resources dealing specifically with the preservation of web pages. A
couple of pilot projects are trying to archive internet resources (which
translates mostly to web pages).
This is a rather select list of the types of resources available. More
comprehensive lists can be found at many of the listed sites.
A chapter in the AVCC commissioned report Key issues in electronic publishing, this document provides
a brief overview of library and archival issues surrounding electronic publications.
In 1996 the Task Force on Archiving Digital Information produced this report which is a excellent summary of and introduction to the issues surrounding
the preservation and archiving of digital objects. One major finding is the
recommendation for 'data migration' as a preservation strategy.
The International Council on Archives Committee on Electronic Records published this comprehensive guide in 1997. It is "designed to help archival institutions reposition themselves to address the management of archival electronic records" . It provides an overview of the technologies and concepts involved and begins to address the issues raised. The intention is to revise the guide as better strategies are developed.
Digital preservation policies
This top-level policy document recognises the importance of digital material to our cultural heritage and stresses the need for co-operative approaches to preservation from information providers, local, state and national bodies.
The Australian Archives published its policy for electronic record keeping in the Commonwealth Government in 1995. It devolves archival responsibility
to the individual government agencies which create the information and
outlines policies for managing the creation, retention and provision of
access to electronic records.
The Archives Authority of New South Wales has been active in the area of digital records policy. This policy covers document management for individuals' desktop machines, which are
not generally dealt with in more traditional archival policies.
This policy, an adjunct to a print-based collection development policy, outlines some additional collection and preservation designations specific to digital materials. Of particular interest is the definition of 4 levels of 'preservation' designations
archived, served, mirrored and linked.
Organisations involved in digital preservation activities
The CPA is part of the US based Council on Library and Information
Resources (CLIR). It is an umbrella group concerned with all aspects
of preservation. Of course, this includes a substantial digital
component. There are also newsletters and reports. Unfortunately
the reports are not available online.
European Commission on Preservation and Access (ECPA)
http://www.knaw.nl/ecpa/
The ECPA was formed to deal with the problems of brittle paper. It is also
interested in digital preservation. It works in close co-operation with
other preservation bodies such as the Commission for Preservation. It's
lists of conferences and reports provide a less US-centric approach.
RLG's PRESERV program, like many preservation groups, is not exclusively for
digital preservation. It has a number of digital programs and resources
such as a working group on preserving digital resources.
The NLA engages in a number of preservation activities. Some of them
relate to digital resources. For example, there was a National Consultative Meeting on the management of Physical Format Publications. Physical format publications include
floppy discs and CD ROMS.
Digital preservation resources online
PADI is a good starting point for information on digital preservation. It
includes lists of bibliographies, conferences, electronic journals, mailing lists, policies
and organisations pertaining to preservation.
The CoOL (Conservation Online) site is maintained by the Preservation Department of Stanford University Libraries. It contains links to all kinds of conservation resources including sections for electronic media and electronic records.
Vicnet maintains a number of pages relating to conservation and preservation,
including a section relating to digital preservation. The emphasis is on
culturally significant materials.
Bibliographies
Peter Graham's bibliography includes sections on Metadata and preservation issues. It contains many links to online resources. It is regularly updated. Some entries are annotated.
Michael Day maintains this rather long list. It includes both online and more traditional resources. Day tries to be more inclusive than Graham.
This is the relevant section of Charles Bailey's excellent scholarly electronic publishing bibliography. Bailey lists selected works on various aspects of electronic publishing.
Electronic journals
DigiNews focuses on issues and projects relating to digitil initiatives
with a preservation component. Regular features include calendars of events,
announcements and highlighted web pages. Longer articles on particular preservation issues are also included.
Ariadne is a British ejournal focusing on electronic library issues and
projects. Some articles are of relevance to those interested in
preservation issues.
D-Lib is a magazine focusing on digital library research. Some articles
are relevant for those interested in preservation issues.
Digital preservation projects
The AHDS is a national UK service designed to collect, preserve and provide
access to electronic data in the arts and humanities fields. The service
has a staff of about 50 and receives datasets from all over the UK. Data
is converted into the AHDS preferred format, to assist current and
future archiving. The site also provides general information about
managing digital collections.
"Project:
To build the conceptual and physical infrastructure to contain information in digital form, provide tools for
meaningful access, and ensure long-term availability to the information and the means to access it, The
Arches project involves both an online repository for digital resources of all types and the software
environment that makes this repository internationally accessible and responsibly maintainable. It creates the
foundation for a variety of resources and services for researchers. "
CEDARS is a project of the UK Online Libraries Network (UKOLN). The project
aims to provide a model for best practise in digital preservation.
The Kulturarwł Heritage Project for long term preservation of published
electronic documents aims to collect and preserve Swedish electronic documents,
including all Swedish web pages.
"PANDORA is a project initiated by the National Library of Australia to investigate strategies for the storage, preservation
and access to digital data in the context of the creation of an electronic archive of library materials." the approach is to selectively archive a range
of digital materials designed to provide the 'flavour' of Australian
electronic materials for future researchers.
Notes:
1. For example the Digitial Libraries Initiative.
2. S. Lawrence, C. Giles. Searching the world wide web. Science. 280(3), p.98. 1998.