The Web-at-Risk:
A Distributed Approach to Preserving our Nation's Political Cultural Heritage
Content Identification, Selection, and Acquisition Path
 

Glossary of Terms

TermDefinition
Acquisition For digital materials, see Capture
Authenticity The genuineness of a digital object. Verification of authenticity requires ascertaining that the object is what it claims to be or is what the metadata associated with the object asserts it to be. Authenticity of a digital object is determined in several ways including checksums, provenance, and digital signatures.
Automated Capture Tool See Crawler
Baseline Metadata Baseline metadata is machine-generated and captured by a crawler at the time of data capture.
Born-digital Created originally in digital format (i.e., a machine-readable format). Examples include scientific databases, sensory data, digital photographs, and digital audio and video recordings. A born-digital resource may or may not have a counterpart analog format but, if it does, the digital version existed prior to the counterpart.
Capture The process of copying digital information from the web to a repository for collection or archive purposes.
Collection A group of resources related by common ownership or a common theme or subject matter. A web collection consists of one or more crawls that harvest a group of related websites (e.g., candidate websites for state election campaigns). Collections are owned and/or maintained by an organization or institution.
Crawl The content associated with a web capture operation that is conducted by a crawler.
Crawler Software that explores the web and collects data about its contents. A crawler can also be configured to capture web-based resources. It starts a capture process from a seed list of entry-point URLs (EPUs).
Curation Process Collection development for web-published materials includes the selection, curation, and preservation processes. In this context, the curation process involves description, organization, presentation, maintenances, and deselection of the materials in the collection.
Dark Archive A digital archive to which no end user access is permitted.
Dark Web See Deep Web
Deep Web Resources available via the World Wide Web that are invisible to or inaccessible by crawlers. These resources may be invisible or inaccessible to crawlers because they (a) are contained in a database or other data store, (b) require information collected from the end-user before they are created, or (c) are password protected.
Digital Archive A digital collection for which an institution has agreed to accept long-term responsibility for preserving the resources in the collection and for providing continual access to those resources in keeping with an archive's user access policies.
Digital Collection A collection consisting entirely of born-digital or digitized materials.
Digital Object Also called a digital information object. Digital objects can be interactive works (e.g., video games), sensory presentations (e.g., music or audio), documents, and data. Two types of digital objects included in digital archives are: surrogates of information objects in various original formats, (e.g., print books or audio tapes) and born-digital objects.
Dynamic Web Page A web page created automatically by software at the web server. The page may be (a) personalized for the user based on identification via login or based on cookies stored on the user's computer, (b) tailored to fulfill a specific request made by the user, or (c) code-generated (e.g., using php, jsp, asp, or xml). Information used for personalization or tailoring of pages may be retrieved in real-time from a database or other data store.
Emulation A method by which newer software interacts with older resources and displays the result using the same commands and formatting that the software that created the resource used. Emulation provides a means of allowing a digital resource to be preserved without altering its binary format.
Enriched Metadata Enriched metadata is generally specific to an organization and contains a mixture of baseline metadata and human-generated metadata added subsequent to data capture.
Entry-Point URL A URL appearing in a seed list as one of the starting addresses a web crawler uses to capture content. Also called a targeted URL.
External Link A hyperlink which takes the user to a new website. For a web archive, an external link is one that takes the user out of the archived collection.
Fixity The extent to which an archived object remains unchanged over time regardless of access and movement due to copying. One common fixity mechanism used to establish and protect the integrity of a digital object (or data) is the result of a cyclical redundancy check (CRC). Redundancy checks are sometimes referred to as checksums.
Harvest See Capture
Invisible Web See Deep Web
Light Archive A digital archive accessible to end-users.
Migration A method of preserving digital materials and access to those materials by copying or reformatting the materials while preserving their intellectual content.
Persistent Name A unique name assigned to a web-based resource that will remain unchanged regardless of movement of the resource from one location to another or changes to the resource's URL. Persistent names are resolved by a third party that maintains a map of the persistent name to the current URL of the resource.
Repository The physical storage location and medium for one or more digital archives. A repository may contain an active copy of an archive (i.e. one that is accessed by end users) or a mirror copy of an archive for disaster recovery.
Seed List One or more entry-point URLs from which a web crawler begins capturing web resources. Curators, or others responsible for building collections of web-based resources, specify seed lists for specific crawls.
Spider See Crawler
Targeted URL See Entry-Point URL
Visibility The extent of end user access allowed to a digital archive.
Web Archive A collection of web-published materials that an institution has either made arrangements for or has accepted long-term responsibility for preservation and access in keeping with an archive's user access policies. Some of these materials may also exist in other forms but the web archive captures the web versions for posterity.
Web Archive Service Enables curators to build collections of web-published materials that are stored in either local and/or remote repositories. The service includes a set of tools for selection, curation, and preservation of the archives. It also includes repositories for storage, preservation services (e.g., replication, emulation, and persistent naming), and administrative services (e.g., templates for collection strategies, content provider agreements, repository provider agreements.)
Web-published materials Web-published materials are accessed and presented via the World Wide Web. The materials span the cultural heritage spectrum and include a range of material types from text documents to streaming video to interactive experiences. Web-published materials are both dynamic and transient. They are at risk of disappearing. Web archives preserve web-published materials.