|
Information Technology (IT) Division |
|
|
|
|
|
|
|
|
The Winner of the 2007 IT Division Jo Ann Clifton
Student Award

|
Forging cultural heritage collections online
:
The story of An American Tale
Candidate for
M.A. Information Resources & Library Science
University
of Arizona
Tucson, Arizona
|
Windmill at Sunset
- Boone County [Missouri],
Photo Credit: Duane Perry, Columbia, Missouri
(http://www.missouri.gov/mo/mophotos/sunsets) |
Introduction
In the heartland of 19th-century America, Missouri
welcomed emigrants from states and countries, near and far, as the
central crossroads of the nation. The
digital collection, An American Tale:
19th-century Folkways to
Missouri, was created by the author to
document that migrant experience through the heritage of one individual,
to understand the process of constructing a cultural heritage collection
online.
The purpose of this
paper is to reflect back upon the extensive planning and execution
required to create, from the ground up, the digital repository of 3
migrant pathways to Missouri, to understand best practices in
building an online digital collection.
The reader will
learn in part 1.0 the initial goals of the project and the work which
was undertaken. Part 2.0 will describe
corresponding outputs from the effort, through a virtual tour of the
finished collection. Part 3.0 will evaluate
lessons learned from that endeavor.
Like the journey
of early settlers to Missouri,
the road to constructing a premier digital collection is fraught with
danger: potholes, treacherous stream crossings, dangerous
wildlife, bad equipment, limited funds, and all kinds of weather.
Through lessons learned by the experience of the author, the reader may
take away valuable lessons to begin the journey to building a premier
digital collection.
|
|
1.1
Goals
Project goals for the 8-week
digital collection project were
to: 1) digitize 30 primary and secondary
sources from research collected over the past ten years by the author, 2) create an open access collection online of
the digitized images with relevant metadata, 3) create an online guide
which
would include interpretive and educational materials pertaining to the
subject
, and 4) use the project as a platform to understand the decision
issues
associated with organizing, describing, indexing, classifying,
digitizing,
presenting and retrieving items in building a digital collection.
The scope of the collection
consisted of thirty vintage
photographs, and primary and secondary records, uncovered by the author
through correspondence with individuals or on-site research at local
cemeteries, public libraries, academic libraries, county courthouses,
state
departments of health, state historical societies, or federal archives. Types of records selected were photographs,
correspondence, vital records, census records, naturalization and
immigration
records, church records, military records, newspaper clippings, court,
land and
tax records.
Three discrete, topical
themes formed the intellectual
boundary of the project: 1) Slaveholder from Virginia,
2) Union Soldier from Hesse-Darmstadt and 3) Farmer from Iowa. Selection of the material was guided by the
mission of the
collection: to document the 19th-century
migrant experience to Missouri,
through the author’s ancestral heritage. Corresponding
records of patriarchs Henry M. Ogden
(1792-1888), Philip
P. Wilhelm (1827-1909), and Jacob Peters (1831-1918), were selected.
The NISO standard for
building digital
libraries, entitled "A framework of guidance for building good digital
collections," served as the framework for constructing the digital
repository (http://www.niso.org/framework/Framework2.html). The three intended audiences for
An American Tale were academic
historians, graduate
students, and family historians.
1.2
Project Design
Goals set for the project,
above, dictated requirements for its
design. The content management system
used as the container for the collection was Greenstone
Digital Library shareware (www.greenstone.org),
assigned to all
students matriculating Digital Libraries, the course for which the project was
assigned.
Revealed in a pilot
walk-through of sample records was the
need for a taxonomy to uniquely identify each object.
After research and experimentation, a naming
standard was created for each file, using the family name, generation
number,
pedigree placement, and record type. File
naming followed the ISO 9660, Level 2 convention, which
allowed file
names of up to 31 characters, only lower case characters
a-z,
numerical digits, and special characters period, underscore, and hyphen (http://en.wikipedia.org/wiki/ISO_9660). Spaces or any other special characters were
not used. The reader
will learn later in this paper how
critical proper formation of a file naming convention early on in a
project is to
its later success.
1.2.1 Metadata
The collection required a
simple metadata standard with
modest granularity, due to the simple nature of the collection and the
limited
experience of its builder. The
metadata
standard selected for the project was Dublin
Core (DC), which provides standard
accessibility and expanded use of the collection. Use
of the DC standard retains the context of
each record, and provides a 'footprint' for rights status and digital
provenance. This
is compliant with the Open Archives Initiative Metadata
Harvesting Protocol standard (http://www.openarchives.org/OAI/openarchivesprotocol.html
).
Full bibliographic detail of
the preserved items, including
structural, administrative, and descriptive metadata, is detailed in a Microsoft Excel spreadsheet file which
accompanies the digitized records. Implementation of standard encoding
practices for metadata will facilitate sharing with others among
federated
archives. Library of Congress Subject
Heading authorities were used to standardize descriptive terms.
1.2.2 Copyright
protection
Of the six possible Creative
Commons licenses available to individuals, the project used the”
Attribution Non-commercial No Derivatives license”
(www.creativecommons.org). Others may
download works in the collection, on the condition that users cite
their
source, do not alter the material in any way, or reuse it for
commercial
purposes. Access to the original physical photographs and print records
is
available to the general public, with prior request for permission in
writing.
1.2.3 Custodianship
Custodianship of high-quality
digital master copies of the original
records is retained by the author on compact discs, and on the author’s
local hard drive. FastSum
Integrity Control was used to ensure data integrity of
master files through back-up and any future migration (www.fastsum.com).
Lower
resolution digital surrogates of the high-quality digital master copies
reside
in the Greenstone Digital Library
database for public viewing.
1.2.4
Image collection
Records were scanned using an
A4-standard AcerScan 620U Prisa USB flatbed image scanner. Maximum resolution of 600x1,200
dots per inch provided adequate viewing of the objects.
Images
were manipulated to ensure consistency in size using Microsoft
Paint. No part of
the original digitized record was cut, cropped, or altered in any way
in the
manipulation process. TEI-P-5
Guidelines (version 0.4.1, July
2006) for processing and creating images were used to guide
digitization of
photographic or photocopy images, created for uploading to Greenstone
(http://www.tei-c.org/release/doc/tei-p5-doc/html/).
Finally, as part of project
management of the digital
library construction, an 8-week timeline was created toward work
completion, auxiliary
personnel were identified, equipment needs were assessed, a proposed
budget was
assembled, and project metrics or means for evaluating the process were
created.
In summary, the ‘magic
formula’ for creating a
premier digital library collection was to clearly state project goals,
identify the scope and
selection policy of the collection, then target the main audience. Next, the best metadata standard was considered,
and
copyright rules suitable to the collection were identified. Then,
ownership and access conditions of the collection were ascertained, and
software and hardware requirements were refined. Finally, a clear
timeline was created, needed personnel were hand-picked, a flexible budget was
formulated, and plans to
measure success were defined.
|
|
The finished Greenstone library
collection represents
a simple mock-up of 30 artifacts, representative of a grander vision of
what
could be an extensive collection on 19th-century records of immigrants
to Missouri.
It
includes three main features: 1) orienting remarks, 2) search
features,
and 3) browse features.
The home page, or the About page in Greenstone vernacular,
shown in Exhibit 2.1, orients the user to the
purpose, scope, selection
process, and arrangement of the collection.
The collection is divided up into 3 searchable modules shown on the About page top task bar: the
titles a-z of the
artifact, the subjects entailed,
and its coverage period in Missouri
history. The home, help, and preferences buttons lead the user
back to the University of Arizona master
site, help the user form search arguments, and allow the user to select
foreign
language options, textual or graphical interface, and toggle search
preferences.
Finally, the search tool bar found in the middle of the page lets the
user enter
search terms belonging to titles in one of the three discrete topical
themes: Slaveholder from Virginia,
Soldier from Hesse-Darmstadt, or Farmer from Iowa.
Exhibit 2.1: Collection
Home Page

On the Titles
a-z page,
each bookshelf icon (below)
represents a single document or photograph, sorted alphabetically by
topical theme, surname, and then record title, shown below in Exhibit
2.2. Where
did all of that information come from to fill the Titles a-z index?
The secret is in the metadata. The
elegance of Greenstone lies
behind the
scenes, buried in the metadata assigned to each record.
What the user will not see
while navigating An American Tale,
are
the 15 Dublin Core metadata
elements which describe each
record.
Exhibit
2.2: Titles A-Z Page

Take for example,
the 1863 Certificate of Disability
for Discharge issued to Sergeant Philip P. Wilhelm (20th from
the top of the list), who was
released from the Union Army's Company E, 37th Ohio Volunteer Infantry,
at Louisville, Kentucky. All that we know
from the Titles A-Z entry,
above, is the name of the topical theme, name of the person concerned,
his dates
of birth and death, and the title of the record: Certificate
of disability for Discharge. But when we pull the curtain behind Greenstone (see Exhibit
2.3), we see the hidden Dublin Core
elements tagged to the certificate,
like the record creator, selected Library
of Congress Subject Headings, a full description of the
document,
who donated it, its date in universal format, the type of record, its
file name, where the document originated, its native language, and its
time period
in Missouri
history.
Exhibit
2.3 Greenstone
Metadata Screenshot

Wherever the certificate travels,
shown in Exhibit 2.4, below, through data
harvesting or other means, users will have full Dublin Core metadata to know its
provenance. Sergeant Wilhelm may not appreciate the world knowing
about his indelicate disease contracted at the Battle of
Fayetteville. But the world will have accessible proof that he
was there at
the Battle, thanks to descriptive administrative and structural
metadata which comply with generally-accepted metadata harvesting
standards.
Exhibit
2.4: Document
Object - Certificate of Disability
for Discharge for Phillip P. Wilhelm, 12 January 1863

The Subjects
Page, shown below in
Exhibit 2.5, represents some of the most powerful browsing
capability within the An American
Tale website. Users may browse through detailed Library of Congress Subject Heading
authorities to identify the specific document or photograph sought
after. Four pages of subject headings give the user over 70
topics and sub-topics from which to choose.
Exhibit
2.5: Greenstone Subjects
Page

For example, an interest in
carte-de-visite photographs (pronounced cart-du-viZEET), popular during
the American Civil War, will net three finds in the photographic medium
shown in Exhibit 2.6, below: one for Henry M. Ogden, Mary Frances
Turpin Ogden, and Captain John James Ogden.
Once again, by
clicking on the thumbnail photograph,
a full screen version comes into
view of an enlarged image. Once the
enlarged document appears, a navigational icon in the lower right
hand corner allows the viewer to zoom in on the text for better
viewing.
Exhibit 2.6
Subject Thumbnails

The final searchable module is
the Coverage Page (Exhibit
2.7), which outlines four periods in Missouri
history: a) 1812-1819 Territorial Missouri, b)
1860-1877 Civil War
and Reconstruction Missouri, c) 1878-1899 Outlaw and Volunteer Missouri
and d) 1900-1929 World's Fair and Lindbergh Missouri. The only
period not represented is 1820-1859 Statehood Missouri, for which no
documents or photographs exist in the present collection.
Exhibit
2.7:
Coverage Page

Once a
bookshelf icon is selected on the Coverage
Page, for example, for
Territorial Missouri,
two thumbnail images appear for that time
frame. Both are legal records associated with the 1816 marriage
of Katharine Smith to Henry M. Ogdon [sic]: an Affidavit of Age of Majority (Exhibit 2.8),
and a Marriage Bond, for
financial remuneration to the bride's father, should the groom, 24-year
old Mr. Ogdon, choose to flee from the altar.
Exhibit
2.8: Affidavit of age of majority - Katharine Smith,
November 4, 1816, Bedford County, Virginia

The finished
Greenstone library
collection
represents a compact vision of what could be an extensive collection of
19th-century records of immigrants to Missouri.
A magic elixir of careful collection design
and development helped to distill the collection of cultural heritage
artifacts
into a navigable website.
In the final
section, 3.0, the reader will learn shortcomings and
successes of the project in an effort to understand best practices in
building a premier digital collection online.
|
|
3.0: Lessons learned
The reader
may take away five important
lessons from the
project: 1) planning is crucial,
2)
experience counts, 3) choose wisely, 4) be
flexible, and 5) keep a sense of humor.
3.1 Planning is crucial
Digitization
requires a material long-term
financial investment, and pulls on often limited organizational
resources. Common sense dictates that
results be planned
well, and measured to make the return on
the initial investment imminently clear. Tools like a project
schedule, a prospective budget, a mid-project review, and a plan for a
file
naming convention served to ease the project's execution.The reader should have a
roadmap to where he or she is going.
The
timeline used in the project, shown in
part in Exhibit 3.1, helped the author to frame the project by
letting benchmarks lead
the way. Instead of wondering through
the project what remained to be done, the author needed only refer to
the Project
Schedule. That is of
particular value later in the
project when the tsunami of fine details and unresolved issues press to
take
over the project. A project schedule is
a valuable way to stay on track.
Exhibit
3.1: Excerpt from Project
schedule (in U.S. dollars)
DeadlineDate
|
Activity
|
Budgeted Duration (hours)
|
Actual Duration (hours)
|
Personnel*
& Material Resources
|
Budgeted Cost ($)
|
Actual
Cost ($)
|
Cost
Variance ($)
|
Oct 8
|
Training in
Greenstone Digital
Library (GDL) software.
|
8
|
50
|
Project lead
|
200
|
1,250
|
+1,050
|
Oct 10
|
Survey file records
and images; Group photographs/records for selection in 3
patrilineal groupings, retaining provenance - women will be categorized
by their maiden names.
|
10
|
10
|
Project lead
|
250
|
250
|
0
|
|
|
|
|
|
|
|
|
Nov 16
|
Evaluate responses to test launch;
make adjustments to digital library, accordingly.
|
8
|
0
|
Project lead
|
200
|
0
|
-200
|
Dec 4
|
Project Launch:
Publish proposed digital library project; track site usage
statistics with ChangeDetection
[Santa Cruz, California:
FreeFind.com].
|
1
|
0
|
Project lead
|
25
|
0
|
-25
|
Dec 6
|
Prepare final report; include comparison
of work hours logged against time
budgeted
|
5
|
9
|
Project lead
|
125
|
225
|
+100
|
Total
|
|
147
hours
|
220
hours
|
|
$4,425
|
$5,500
|
+$1,075
|
Planning a notional budget upfront forced the author to reflect upon
actual inputs to the project. How much
would be needed for office supplies,
long-distance calls, or extra equipment? Have
a small amount of contingency funds been imbedded
into the budget to cover inevitable surprise
expenses, like the subscription to FastSum
Integrity authentication
software? A working budget forces one to
ponder the various pieces which go into the effort.
A plan to evaluate day-to-day journaling
of activities mid-point through the project helped
enormously in planning
the
second part of the project. By keeping a
log to which one could refer mid-point, the author was able to
pinpoint logistical
problems immediately, and prepare for their resolution in the second
part of
the project.
Inadequate planning on the
author's part in arriving at a file naming convention meant
repeating tasks four or five times, which would
have been avoided with better preparation, and saved lots of time.
Files were renamed 6 times before an adequate name
construction could be used, shown in the example, below,
Exhibit
3.2. First, the files
were named for
consistency, by family surname, to provide some order to the chaos.
Then, it became clear that some unique
object identifier was necessary to reduce confusion regarding similar
documents. Thus, a taxonomy was born to
assign numbers to each file. The unique
identifier consisted of an assigned surname number, a generation
number, a
unique individual number, a document type, then the numbered order of
the
document type.
In the example in Exhibit
3.2, the Civil War
Pension record JPEG file, was assigned the number 23 because the
first individual bearing that surname in the pedigree chart was
numbered 23 (Rosa
May Wilhelm); that Phillip P Wilhelm was the sixth generation back, his
unique
individual number on the pedigree chart was 46, the document type was a
military record assigned the number 7, and it was the first of its kind.
Exhibit
3.2: Iterations of one file name for a Civil War Pension record
|
1
|
consistency
|
wilhelm_phillip_p_page_1_1898_jpeg
|
|
2
|
taxonomy
|
23.6.46-7.1_wilhelm_phillip_p_page_1_1898
|
|
3
|
ISO 9660-shortened
|
23.6.46-7.1_wilhelm_pp_page_1
|
|
4
|
low-resolution
surrogate |
23.6.46-7.1_wilhelm_page_1_lo
|
|
5
|
without
term ‘page’ |
23.6.46-7.1_wilhelm_1_lo
|
|
6
|
without
period (.) |
23646-71_wilhelm_1_lo
|
But then the filename
became too long. File naming outlined in
the project's initial research proposal dictated that the
ISO 9660, Level 2
convention for naming files would be followed, as mentioned before in
section 1.2, which
allows file names of up to 31 characters. In
file naming, only lower case
characters a-z, numerical digits, and special characters period,
underscore,
and hyphen would be used. Spaces or
any
other special characters would not be used. Thus,
the file name was shortened in its third iteration.
For its fourth iteration,
a new name was needed because the new low
resolution surrogates of the digital master files needed a name, which
would actually be
uploaded to Greenstone. The suffix 'lo'
was appended to the file name, still within the 31 character
limit. Then, after creating, building, and
previewing the Greenstone collection,
the author learned that the term "page"
in the filename, confused Greenstone,
and aborted compiling of the
library. Thus, in its fifth iteration, the
term 'page' was
removed from the file name.
Finally, the author
learned that Greenstone reads
a
file name up until the first period to extract the name of the file,
and then stops reading the name. Typically,
one would call a file filename dot
JPEG, or filename dot
BITMAP, and so forth. But
Greenstone stopped reading the
filename
after the first dot, which in our example was 23. Thus,
the file name was recorded in Greenstone
as 23 along with all other files prefixed '23,' excluding
the rest of the file name, creating confusion for the user with
multiple files titled "23." Therefore, in
its sixth and final iteration, periods were removed from file names to
allow Greenstone to properly
index the entire name of the file.
Through each
iteration of renaming files, all associated metadata needed to be
reloaded, as
well. It wasn't just a matter of
renaming one file. The new digital library
designer would do best by testing a small sample of about 5
files, and running them through the whole process, including compiling
the library,
before naming all files. Technical
anomalies as with the term 'page' or the 'dot' were unavoidable. But beware of rushing into a project without
giving serious thought to planning the file names.
Time invested in planning
upfront nets
tremendous
savings later in the project.
3.2 Experience counts
With that thought in mind,
lesson 2 teaches
us that experience counts, or rather that inexperience can result in
painful
revisions and delays. Problems encountered in the project were,
perhaps, typical of a first effort in creating an online digital
library
collection. Problems arising from inexperience in digital project
planning, as mentioned, and problems of a technical nature were common
in the
effort.
Inexperience in prepping
the documents and
retaining their provenance added time.
With little training, creating a taxonomy for the first time added
time. Understanding new and complicated
standards, like the TEI-P-5
Guidelines, added time. Poor
equipment selection added time in
requiring that some documents be outsourced. Inexperience
with Library
of Congress Subject Headings meant long and
laborious dissection of appropriate headings for collection objects
which added time.
Several technical problems
resulted from
inexperience. Resolution of scanned images was of mediocre quality due
to the
age of the scanner and poor technique, and troubleshooting the
Greenstone Library Interface in
constructing a basic digital library with Greenstone was an ongoing battle
that a more experienced builder would not have had to endure.
3.3
Choose wisely
The free, open access,
Greenstone Digital
Libray software is a welcome solution to many collections,
which would
otherwise not be mounted to the World-Wide Web were it not for
Greenstone. The
power of posting to the
world an item plus its full bibliographic record for later data
harvesting, is
the stuff of which librarians dream. But
the amount of difficulties surmounted,
and the limited support documentation made the selection one to reflect
upon. Greenstone proved
a very cramped space in
which to build a repository for a beginner. Its
assets are great for universality of metadata but at a
high cost. Dated and unreadable user manuals meant
repeated combing through computer-ease written in awkward English.
Grand designs of "reading
rooms"
or side-by-side ASCII text translations for each object in an
An American
Tale, FAQ's, a
Contact Us page,
Chronological Lifeline, and
User Guide all fell
flat with hard to understand
capabilities served up in Greenstone.
If the author were to embark on the
same project again, the author would have
invested more time in learning more about the potential of other
content
management systems like DSpace, WebPress, Drupal, Mambo, PostNuke, or
Plone. They may not explicitly be identified as
"digital library" software, or automatically support library
standards like MARC, Dublin Core,
or the Open Archives Initiative
Protocol for
Metadata Harvesting, but
their features and documentation may be better suited for the
non-programmer,
at a first-strike effort in creating a digital library.
3.4 Be flexible
Revisions during the course of the project
meant a better end-product. Original scanning techniques were changed
mid-project to improve clarity. Tighter
chronological groups of records meant more refined presentation. Outsourcing oversized documents meant
inclusion of critical objects. Therefore, flexibility
resulted in a more polished digital library.
3.5 Keep a sense of humor
Above all else, the reader
should remember to try to maintain a sense of humor. The task of
mounting a digital library to the World-Wide Web is no small
feat. The ability to stand back and laugh at one's foibles
or mishaps will only aid to keep the project on track, as well as its
creator.
|
|
In Forging cultural heritage collections
online: The story of An American Tale, the reader learned how
to lay the foundation for building an online digital collection of
cultural
heritage artifacts, by first defining one's requirements then defining
one's design. The reader learned what the
finished product might look like using the Greenstone
Digital Library platform, and its assorted features. Finally, as a result of the 8-week effort, the
reader
learned about 5 important lessons concerning digital library design
which may save time and heartache in any future attempts.
Like the journey of early
settlers to Missouri,
the road to
constructing a premier digital collection is fraught with danger. Through mistakes made by the author's
own experience,
the reader may take away many valuable lessons to begin the journey to
building a collection of universally-accessible artifacts of cultural
heritage worthy of preservation for generations to come.
|
|
|