Tool Development for Optimized Scalable Ingestion Workflows: The Case of the Japanese American WWII Incarceration Camp Incident Cards

Download Report

Transcript Tool Development for Optimized Scalable Ingestion Workflows: The Case of the Japanese American WWII Incarceration Camp Incident Cards

Tool Development for Optimized Scalable
Ingestion Workflows:
The Case of the Japanese American WWII Incarceration Camp Incident Cards
Developer: Magdalena Balderas| Professor: Dr. Kathy Weaver | Client: Dr. Richard Marciano
Cultural institutions currently have millions to
billions of objects that they are responsible for. In
some cases these objects are accessible to the public,
but in most cases they are not due to restrictions in
resources and rules that limit their access.
POSSIBILITIES
If ingestion workflows can be optimized for
running at scale, everyone will benefit including:
cultural institutions, scholars, researchers,
students, and the general public. More digital data
allows for web databases to be created in which
people, places, dates and other entities can be
linked as shown below. This allows for a
comprehensive integration of events with
available data.
Japanese American WWII Incarceration Camps: Tule Lake
The case of the Japanese American WWII Incarceration Camp Incident
Cards is an interesting one. The cards were initially accessible but were
removed from public access in the fall of 2014. Updates in the access
policy in 2015 now allow them to be accessed again provided they do
not reference minors and can be automatically appraised. The Digital
Curation Innovation Center (DCIC) established a partnership with the
National Archives and helped fund the recent digitization of the camp
incident cards and their automated appraisal.
Challenges Faced:
•
Unobtainable data in the
designated timeline
• Lack of funding
• Time constraints
• Uncertainty of findings
and possibilities
Interment Card, Box 8 Tule Lake 0307
Provided by the National Archives and Records Administration
1943, “RIOT”
INCIDENT CARDS
Image retrieved from www.vintag.es
Provided by the National Archives and Records Administration
Successes of the project:
• Comprehensive understanding and description of the ingestion
workflow when it comes to cultural institutions and objects
• Comprehensive analysis of how commercial and open source
software do not meet the requirements necessary to complete the
ingestion process for cultural institutions.
• Creation of a clear vision of future research and software
development necessities within the field of digital curation in
terms of automated ingestion especially related to data from
cultural institutions.
Box 12
Box15
Box 12
INGESTION WORKFLOW
Box 12
Cultural Object
Digitization
• Provided by National
Archives and Records
Administration
(NARA) Scan Lab
•
•
•
•
Optical Character
Recognition (OCR)
KoFax Express**
Tesseract
Cuneiform Linux
ABBYY FineReader
•
•
•
•
Text Extraction
PDF2Text **
Zilla PDF
PDF to Text
TextfromPDF
Name Entity
Recognition (NER)
• Alchemy API
• OpenNLP
• Stanford NER
• OpenCalais
Ingestion into
Database
Box 12