Electronic Records: Preservation and Access
October 6, 2005
Dr. Charles Dollar
Seventh in the Missouri Electronic Records Education and Training Initiative (MERETI) workshop series
Note: Click on the to watch the instructor discuss key points. The number refers to the corresponding slide in the accompanying PowerPoint presentation and handout.
In this advanced workshop, Dr. Dollar explores the many elements involved in the long term preservation and use of electronic records. He explains what digital archiving is, and why it is so difficult. He discusses standards that are available for guidance, and talks about the significant problems posed by technology obsolescence. Dr. Dollar discusses in depth the important considerations of storage media, file formats, and metadata for long term preservation and access. He cites case studies of currently operating digital archiving programs, and talks about new initiatives to watch. Finally, he talks about analog alternatives or backups as part of a long term digital archiving strategy, and wraps up the workshop with a summary of the key points.
Dr. Dollar based this workshop primarily on material contained in his book Authentic Electronic Records: Strategies for Long-Term Access, Cohasset Associates, 1999. He is in the process of updating this book, and expects a new edition to be published in May, 2006. Material is also drawn from the ISO Technical Report 18492:2005, Long-term preservation of electronic document-based information, a discussion draft copy of which was provided to class attendees.
Digital objects contain three attributes: the physical, the logical, and the conceptual. Physically, digital objects are made up of a string of binary signals recorded on a storage medium; they have no meaning by themselves. Operating software must be employed to provide a logical organization to the binary data and recognize it as a logical object based on data type. Finally, application software provides a conceptual meaning to the data objects, rendering data in human understandable form, and giving it content, context, and structure. All of these attributes must be considered when planning for digital archiving and access.
Digital archiving incorporates many activities and considerations. Electronic records must be protected from loss, alteration, and corruption. Their accessibility must be assured across organizational boundaries and across multiple technology changes and environments. Future users must be able to use the records in multiple ways and for many purposes, while retaining the record’s meaning and authenticity. These goals must be accomplished despite ongoing changes over time in recording media, operating systems, file types and specifications, data coding systems, and metadata. 17
The process of digital archiving is made difficult because of the digital nature of the records versus traditional physical records. Physical records are easy to see, touch, understand, and manage, compared to digital records which require hardware and software to give them their logical meaning and interpretation, and provide them storage and retrievability. Electronic records require a software “interpreter” to make them understandable to humans. They are dependent on both the system operating software that makes the computer function and the application software, and they require the user to have a computer to use them. 24 Digital records can easily be rendered unusable by technological obsolescence, which is inevitable and irreversible.
The International Organization of Standards (ISO) has published several standards related to long-term preservation of electronic records and data. In particular, ISO 15489, Part 1 and 2, Records Management, provides the framework for an effective records management program. ISO 14721:2003, Open archival information system – Reference model, describes a high-level model for any electronic records repository. It sets standards for processes of data ingest, archival storage, data management, preservation planning, and access. 32 The OAIS model establishes a shared view of requirements that can lead to an interoperable network of digital archives, a key component in grid computing. 36 43 ISO 18492:2005, Document management applications – Long-term preservation of electronic document-based information, provides methodological guidance for the long-term preservation and retrieval of authentic electronic document-based information, when the retention period exceeds the expected life of the technology used to create and maintain the information. It sets long-term preservation goals to ensure information is readable, intelligible, identifiable, retrievable, understandable, and authentic. 46 49-50
Digital preservation requires that we deal with problems caused by technology obsolescence. For currently active electronic records, this will involve media renewal (or “refreshing”) and conversion. Media renewal is the process of reformatting or copying data to new storage media to ensure its continued readability. Conversion involves the shift from one technology environment to another, such as from one version of software to a newer version, or from one software (e.g. Word Perfect) to another (Word), while maintaining the essential qualities of the electronic record. During conversion, we must maintain the processibility of the active records. 65 66
For sets of legacy records which we wish to preserve, technology obsolescence will require us to perform not only media renewal to preserve the data, but other long-term strategies as well. The Data Archaeology strategy represents the minimalist approach, in which we would keep the original data bit stream viable, and use reverse engineering in the future to devise a method to access and use that data using then-current technologies. Similar to that is the Museum Perspective, in which original hardware equipment and software versions are saved in operational condition, to be able to utilize legacy data. The Jet Propulsion Laboratory and the Washington State Digital Archives have taken this approach. For some types of evidential/informational records, Viewer Technology may provide access to images of records, without providing full functionality.
Emulation is the process of using today’s computers and software to create a replica of another computer with such fidelity that it can operate in place of the other computer. Dr. Dollar discussed a number of projects designed to demonstrate the feasibility of emulation to provide access to legacy records.
Migration is an essential component of a digital preservation program. It’s purpose is to ensure usable and trustworthy electronic records for as long as necessary without regard for the computer technology platform. It presumes that the bit stream remains readable through media refreshment 92 and, whenever possible, involves converting electronic records to technology neutral file formats. It should provide backward compatibility and should preserve the processibility of records. Risks associated with migration include possible alteration of the “look and feel” of records, possible loss of some data values, potential to introduce errors without good quality control, difficulty and cost of migrating complex interactive digital records, and the likelihood that the process will be never ending. Past migration efforts have shown that projects usually take longer and cost more than planned. 97
When determining the appropriate storage media for large quantities of electronic records, one must consider the speed (data transfer rate) of the selected medium, as well as its cost, capacity, and durability. 104 112 120 After discussing each storage medium in depth, Dr. Dollar concludes that magnetic media is more robust than optical, that magnetic tape holds advantages over “spinning disk” storage, and that a high data transfer rate is a vital consideration for storage and migration of huge quantities of electronic data.
File formats tell the operating system how to interpret the 0s and 1s that comprise the electronic file. They specify the internal logical arrangement of data within digital objects, and provide special instructions such as compression algorithms. Formats also provide information understood by specific application software.
Two considerations when determining the file format for preserving electronic records include which format to use for specific information content, and whether to choose proprietary or non-proprietary formats. There are several types of electronic files, each of which have multiple formats from which to choose. Types of files include text, vector graphics, graphic images, compressed graphic images, databases, video, and audio, among others. The concern regarding proprietary formats is that the owner of the format may restrict access to the format, or possibly go out of business and not be able to support the format in the future. Non-proprietary, open-source, widely-used formats provide a higher degree of possibility that the format will continue into the future.
An ideal file format, from a preservation point of view would have these properties:
- Device independence, without regard to the hardware/software platform
- Self-contained, containing all the resources necessary for rendering
- Self-documenting, containing its own description
- Transparency, capable of direct analysis with basic tools
- Absence of technical protection mechanisms, such as encryption, passwords, etc.
- Disclosure, with an authoritative specification publicly available
- Adoption, with widespread use being the best deterrent to obsolescence.
The recently approved PDF/A file format standard, ISO 19005, specifies how to use the Portable Document Format (PDF) 1.4 for long-term preservation of documents (/Archives). It addresses three primary issues: defining a file format that preserves the static visual appearance of electronic documents over time, providing a framework for recording metadata about electronic documents, and providing a framework for defining the logical structure and semantic properties of electronic documents. 143
In summarizing the file format discussion, Dr. Dollar recommends: 161
- Choose file formats based on recordkeeping requirements, such as integrity and processibility
- Avoid proprietary single vendor products
- Use main stream technology products
- Require transferability functionality to facilitate migration
- Consider XML, PDF, and PDF/A as good choices.
Metadata for electronic records must be captured which provides technical, business, and contextual information about the records. Technical information includes data about the creation and use of the record, the software application, and the file formats. Business information includes applicable business rules, integrity rules, and access/authorization rights. Contextual information describes “who, what, when, why”, the linkage between and among records, preservation information, and offers an audit trail. Metadata is best captured at the creation or receipt of the record. While the system can provide much metadata, often users are required to key in small to large amounts of metadata.
There are presently only three operational digital archives. 170 The OCLC Digital Archives is a fee-based repository service for libraries and other institutions. Institutions can transfer electronic items to OCLC, which will preserve them and provide on-line reference services. DSpace is a digital repository system that captures, stores, indexes, preserves, and redistributes digital research materials. It is designed for academic library repositories, and requires customization to accommodate archives.
The Washington State Digital Archives is the only operating state digital archives. Planning began in 1999 and the facility opened in 2004, at an initial cost of $14.8 million. The concept is based on a well-developed feasibility study, and identifies state agency partners in terms of their level of technological sophistication and ability to transfer archival records in appropriate original formats. The project benefits from funding from a $1.00 recording fee on all filing transactions and additional support from Microsoft, a Washington corporation, and may not be easily duplicated in other states.
Besides the Washington Digital Archives, other projects under development include a demonstration project being undertaken by the Georgia State Archives with NHPRC funding, a collaboration project between the Smithsonian Institution and Rockefeller Archives Center, and the National Archives and Records Administration’s large-scale Electronic Records Archives project.
For records that must be kept long-term, consideration should be given to capturing them in microfilm or other analog format (paper) as well as digital form. The Digital Archive Writer from Kodak produces black and white microfilm from document images, and other equipment can produce color and larger-sized microfilm. A new technology, Datasurance, captures digital images in a non-proprietary 2-D barcode format, and encloses human-readable information on how to decode the barcodes, and incorporates it all on microfilm. When decoded, the barcodes recreate the original digital image. Microfilm capture should be considered for records where reproducibility, rather than processibility, will satisfy your regulatory compliance, business needs, and historical accountability.
Any organization preserving digital records must prepare a mission statement to define its purpose. It needs to define its preservation policy, describing how the mission will be carried out and specifying what activities will be done (and not done) in various circumstances. In determining the strategies to adopt, the organization looks at the convergence of available technology with its policies, as well as applicable published standards to guide them. 215 It must then identify existing best practices which it can adopt.
The threshold issues in digital preservation are to keep digital records readable, and ensure their integrity and trustworthiness over time. We cannot try to preserve everything, we must not substitute quick fixes in lieu of long-term solutions, and we should not implement technologies that are in the fringe of the marketplace.