search

Tools

Reading In: Analyzing Embedded Metadata in Digital Images

 

When the news came out about Dylann Roof’s website and the photographs that had been posted there, I visited the site and downloaded the package of digital images out of curiosity, as an archivist, to take a look inside the images. I was interested to see the metadata, to break open the files — so to speak — and evaluate the bytes themselves. As archivists, that is exactly what we do with any collection of digital objects. We look under the hood. We look for patterns and information that will help us understand the context in which the files were created, and to help us identify any evidence that might inform us of the provenance, the originality, the history of the documentary resources we have acquired.

As an archivist, working in a contemporary information landscape, I find that digital files are ultimately the things in which I am most interested. These are the containers of the information we often seek to preserve and make accessible. At a high level of abstraction, a digital file is a stored block of information that is available to a computer program. Computer operating systems consider files as a sequence of bytes, while application software interprets the binary data as, say, text characters, image pixels, or audio samples.

If you find yourself in the library or archive world then you have probably long since tired of hearing the word metadata. Metadata undefined is a useless word. In this article, I am focused on metadata embedded in a file header (most file types also allocate a few bytes for metadata, which allows a file to carry some basic information about itself separate of the binary payload). This is all the information contained within the file that is used to help a piece of software understand what the file is and how it should be interpreted in order to decode the bits so that a human can understand them how they were intended to be understood. This can also include embedded chunks of data that provide additional information (descriptive or ancillary) about the contents of the file. This information is very unlikely to change when copying the files to new storage environments.

There are a variety of tools that make it easier to extract packets of embedded metadata for certain file types. A hex-editor can help you look at the actual bytes, but without a clear understanding of a particular file format specification at the level of bytes and offsets, it is extremely difficult to make sense of the embedded information. A few developers have taken initiative to build automated tools that help users translate the embedded bytes into human-readable information. Examples include ExifTool (optimized for still image files), MediaInfo (optimized for audio and video files), and ffprobe(optimized for video files). For this evaluation, I used ExifTool because its strength is the extraction of metadata from digital images.

What follows is an embedded metadata evaluation of the digital images posted to the website of Dylann Roof (http://lastrhodesian.com/). Downloaded 2015–06–22. Available for download from the Internet Archive here.

 

Overview of the Files

Zipped file downloaded from website: 103600296_19.zip.

Decompressed on local MacBook Air to folder entitled 103600296_19.

Folder contains 60 digital images.

Contents:

 

 

Evaluation of the Files

Upon inspection with ExifTool (using the following command: exiftool -csv -a -r ./ > out.csv), the following observations are made.

These 60 images were likely taken with two different cameras. Two cameras are actually visible in 100_1611.JPG, but the extracted metadata corroborates this observation (e.g., similar metadata fields for one set of files, as opposed to another set, different file name logic for the same two sets of files).

Observation is that 34 images were extracted directly from what we will call Camera 1 and 26 images came from what we will call Set 2 (being confident that these images were not created with Camera 1 but not knowing for sure if all of these images came from the same camera).

 

Camera 1 Files

Images from Camera 1 have the following create date/modified data: 

 

 

From this set, one photo was taken in 2014: 2014–08–03 @ 16:56:55 to be precise [100_1443.JPG]. The remaining photos were taken over the span of three months, from 2015–03–18 to 2015–05–11.

Four of these images from Camera 1 were modified using Microsoft Windows Photo Viewer 6.3.9600.16384 on 2015–06–17 at the following times:

14:45:38–05:00 (CDT) [100_1611.JPG]

14:48:10–05:00 (CDT) [100_1688.JPG]

14:49:30–05:00 (CDT) [100_1706.JPG]

14:51:46–05:00 (CDT) [100_1808.JPG]

 

Set 2 Files

Images from Set 2 have no date of creation in the files themselves, but file modification dates exist:

 
 

 

Twenty-two of the twenty-six images were last modified (likely extracted from a camera and transformed to JPEG with computer software) on 2015–03–16 between the time of 21:30:10–05:00 and 21:34:18–05:00 — four minutes to extract and convert 24 images.

Two images — 103753459_18.jpg, and 103753459_21.jpg — were last modified on 2015–03–24 at 23:01:50–05:00 and 23:02:10–05:00 respectively. These happen to be the same images (modified) as 100_1636.JPG and 100_1644.JPG from Camera 1. Comparing file modification dates, it appears that Camera 1 images are the original and that Set 2 images are modified versions of the two Camera 1 photographs.

Two images — 103600296_4.jpg, and 103600296_3.jpg — were last modified on 2015–06–17 at 16:53:10–05:00 and 16:53:20–05:00 respectively. Exif metadata supports this observation. Both files were adjusted with Microsoft Windows Photo Viewer 6.3.9600.16384 (likely the orientation was changed from horizontal to vertical).

Additionally, color coded above, it appears there are three separate batches of photographs generated from Set 2, based on the machine-generated filenames (and the modify dates).

 

Overview of Cameras Used by Dylann Roof

Camera 1: Blue 14-MP C1530 EasyShare Digital Camera

Set 2: Based on Exif data for ExifByteOrder that read “Big Endian (Motorola MM),” it is possible that images from Set 2 could be from one of these cameras:

Also, based on Exif data for two of the images associated with Set 2, it is likely that these photos were imported into a PC with a Microsoft operating system and saved as JPEGs using Microsoft software: PhotoViewer.dll 6.3.9600.16384.

 

Additional Analysis

Six of the 60 files were modified the day of Roof’s terrorist actions. Four from Camera 1 and two from Set 2. 

 

The most recent photographs taken in the batch were from 2015–05–11. There were three photographs taken that day. 

 

 

The earliest photograph in the batch is from 2014–08–03 and was taken with Camera 1.

 

 

Create and modification dates suggest that most images from Set 2 were taken on or before 2015–03–16, whereas images from the Camera 1 set begin being generated as of 2015–03–17.

I stop here because these are not my photographs. These are not photographs that I intend to collect. However, I include a link to the raw extracted metadata for others to evaluate and examine.

MediaSCORE & MediaRIVERS

A free, open source media preservation prioritization web application created in a collaboration between AVPreserve and Indiana University. MediaSCORE (Media Selection: Condition, Obsolescence, and Risk Evaluation) enables a detailed analysis of degradation and obsolescence risk factors for most analog and physical digital audio and video formats. MediaRIVERS (Media Research and Instructional Value Evaluation and Ranking System) guides a structured assessment of research and instructional value for media holdings.

Click “Sandbox” below to test the application out. Use:
[email protected]
mscoresandboxuser

MDQC

MDQC reads the embedded metadata of a file or directory and compares it against a set of rules defined by the user, verifying that the technical and administrative specs of the files are correct. This automates and minimizes the time needed to QC large batches of digitized assets, increasing the efficiency of managing digitization projects. MDQC can be used on any file type supported by ExifTool and MediaInfo. Both ExifTool and MediaInfo will need to be installed on your system in order for MDQC to work.

Fixity

Fixity is a utility for the documentation and regular review of stored files. Fixity scans a folder or directory, creating a manifest of the files including their file paths and their checksums, against which a regular comparative analysis can be run. Fixity monitors file integrity through generation and validation of checksums, and file attendance through monitoring and reporting on new, missing, moved and renamed files. Fixity emails a report to the user documenting flagged items along with the reason for a flag, such as that a file has been moved to a new location in the directory, has been edited, or has failed a checksum comparison for other reasons. Supplementing tools like BagIt that review files at points of exchange, when run regularly Fixity becomes a powerful tool for monitoring digital files in repositories, servers, and other long-term storage locations.

Open Reel Audio Duration Calculator

A simple Excel spreadsheet that shows the total capacity for 1/4 inch open reel audio, using variable for Track Configuration, Sound Field Configuration, Tape Thickness, and Reel Size. Assumes full tape reels and full use of capacity. Look for an online app version coming soon.

Archival Management System (AMS)

The Archival Management System (AMS) is a multi-functional tool that supports management of the digitization workflow, especially useful for projects involving multiple departments or organizations. Existing functions in this open source software include: 1) Aggregation and normalization or refinement of collection inventories. Using tools such as Open Refine and MINT, data cleanup can be done analytically and in bulk, though the system also allows individual record-level editing. 2) Prioritization and selection of items for digitization. 3) Scheduling and system alerts to inform users when it is time to begin packing materials for shipping to digitization vendors, shipping dates, when the vendor has completed a batch, and when the materials will be shipped back. 4) Record level search which includes an audio or video player for playback of the digitized item. 5) Bulk ingest of technical and preservation metadata generated by the vendor. 6) Dashboard reporting that tracks project progress, including number and types of items, percentage of project completed, departments/locations, and other pertinent information. AMS was originally developed to support the digitization of 40,000 hours of audiovisual materials from 120 public media stations as part of the Corporation for Public Broadcasting American Archive project, and AVPreserve has also customized an instance for the Flemish Institute for Archiving (VIAA) to manage nationwide digitization from broadcasters, universities, and museums. The source code is available for download from GitHub, and AVPreserve can also provide services to customize it for project particulars such as organizational structure, workflow specifics, language, other material types (such as newspapers), reporting, systems integration, and more. A sandbox instance will be available here soon.

Catalyst Inventory Software

Catalyst is an innovative method of creating item-level inventories of audiovisual collections. The process uses a team of photographers onsite to image each item in a collection, capturing all information-carrying sides of a cassette/reel/disk, its housing, and any paper inserts. The photos are uploaded daily to our central server where they are sorted into item records and fields for Unique ID, Location, and Format are automatically generated. After this the database records are immediately accessible by a team of offsite catalogers who use the images to enter further metadata. Taking advantage of automated processing and minimal datasets, even a small team can work through hundreds or thousands of items a day. Catalyst data can be exported to generate reports for preservation planning and selection, or to become the basis of a finding aid or more complete catalog record. The benefit of the photos is that materials can be searched for and reviewed without the need to pull tapes until correct items are identified, minimizing handling and staff time. Also further descriptive cataloging can take place at a more reasonable pace or after reformatting has been completed. The Catalyst Inventory software is currently only available as part our inventory services, but screenshots are posted below or here and here.

AVCC

AVCC is an open source web application and guideline developed to enable collaborative, efficient item-level cataloging of audiovisual collections. The application incorporates built-in reporting on collection statistics, digital storage calculations, shipping manifests, and other data critical to prioritizing and planning preservation work with audiovisual materials.

Based on years of experience with how audiovisual collections are typically labeled and stored, AVCC establishes a minimal set of required and recommended fields for basic intellectual control that are not entirely dependent on playback and labeling, along with deeper descriptive fields that can be enhanced as content becomes accessible. The focus of of AVCC is two-fold: to uncover hidden collections via record creation and to support preservation reformatting in order to enable access to the content itself.

interstitial

interstitial is a tool designed to detect dropped samples in audio digitization processes. These dropped samples are caused by fleeting interruptions in the hardware/software pipeline on a digital audio workstation. The interstitial tool Follows up on our work with the Federal Agencies Digitization Guidelines Initiative (FADGI) to define and study the issue of Audio Interstitial Errors.

interstitial compares two streams of digitized audio captured to a digital audio workstation and a secondary reference device. Irregularities that appear in the workstation stream and not in the other point to issues like Interstitial Errors that relate to samples lost when writing to disc. This utility will greatly decrease post-digitization quality control time and help further research on this problem.

AVI MetaEdit & reVTMD

AVI MetaEdit supports embedding and validating metadata in RIFF-based AudioVisual Interleave format (AVI) video files. AVI is currently the target format for creation of Preservation Masters within the Digitization Services Branch at the National Archives.

reVTMD is an XML schema tailored to include fields that address the creation and long term management of reformatted videos, especially with the cultural heritage community. It is a concise subset of the large array of technical metadata available, structured in a way to make it highly usable for accessing and managing all types of video files beyond AVI.

Both tools were developed by NARA in collaboration with AudioVisual Preservation Solutions. AVI MetaEdit is available for download at NARA’s GitHub site, and reVTMD is available on NARA’s website.

BWF MetaEdit

BWF MetaEdit is a free, open source tool that supports embedding, validating, and exporting of metadata in Broadcast WAVE Format (BWF) files. BWF MetaEdit is available for download at SourceForge and was developed by the Federal Agencies Digitization Guidelines Initiative to support its guideline for embedded metadata in the bext and INFO chunks. The application was developed by AudioVisual Preservation Solutions.

Users of BWF MetaEdit can:

  1. *Import, edit, embed, and export specified metadata elements in WAVE audio files
  2. *Export technical metadata from Format Chunks and minimal metadata from bext and INFO chunks as comma-separated values and/or XML, across a set of files or from individual files
  3. *Evaluate, verify and embed MD5 checksums, as applied to the WAVE file’s data chunk (audio bitstream only)
  4. *Enforce the guideline (above) developed by the Federal Agencies Audio-Visual Working Group, as well as specifications from the European Broadcasting Union (EBU), Microsoft, and IBM
  5. *Generate reports that show errors in the construction of WAVE files
  6. *Choose from command line and GUI, for Windows/PC, Macintosh OS, Linux.

A Study of Embedded Metadata Support in Audio Recording Software

This report presents the findings of an ARSC Technical Committee study, coordinated and authored by AVPS, which evaluates support for embedded metadata within and across a variety of audio recording software applications. This work addresses two primary questions: (1) How well does embedded metadata persist, and is its integrity maintained, as it is handled by various applications, and (2) How well is embedded metadata handled during the process of creating a derivative? The report concludes that persistence and integrity issues are prevalent across the audio software applications studied. In addition to the report, test methods and reference files are provided for download, enabling the reader to perform metadata integrity testing.

PBCore Instantiationizer

PBCore Instantiationizer is part of a toolset for conforming extracted technical metadata to the PBCore 1.2.1 metadata standard instantiation element set. The automated approach to extraction and conformance of this element set allows for consistent application of standards to fields that require a strict level of control for usability while also relieving the burden from the cataloger to document what can be a large dataset that can be human-readable unfriendly. This draft version of the Instantiationizer Toolset contains an XSL stylesheet as well as a Mac drag-and-drop application for an even simpler conformance process.

**New Version 1.2 Available**
The update of the PBCore Instantiationizer tool to version 1.2 presents refinements that improve usability and user control. Follow the links below for downloading elements of the Toolset and further information on the latest version and the development and use of the tool.

DV Analyzer

DV Analyzer is a technical quality control and reporting tool that examines DV streams in order to report errors in the tape-to-file transfer process. DV Analyzer also reports on technical metadata and patterns within DV streams such as changes in DV time code, changes in recording date and time markers, first and last frame markers within individual recordings, and more. To those concerned with preservation and archiving, this means that you now have the ability to automatically monitor integrity during reformatting of DV tapes and extract meaningful metadata from DV files.