Digitization (& access), Data Management, & Clean Data

“What do we keep, when do we let it go, and, who has the authority, expertise, and resources to digitize materials?” These are some of the questions that came up during the breakout session, which sparked an excellent discussion. We don’t often think about the environmental impact of data storage, or the hidden costs of electronic access. In some ways it’s a tradeoff between the physical resources of print (deforestation) and sharing documentation through the literal shipment of materials (gas, packing materials, and transportation labor). There are conflicting statistics about how much better an eBook may (or may not) be. However, we can’t escape the reality that the world’s digital footprint is significantly contributing to the climate crisis. In slightly dated research from 2019, global data transfers, including streaming media like Netflix, accounted for at least 4% of Co2 emissions.[1] With a global pandemic, that has made folks rely on virtual meetings, etc., that amount has surely increased. However, there is also likely an offset with a reduction in fossil fuel emissions related to transportation.  

It is important for us to assess the environmental impact of our archival collections (digital and physical), along with information disparities. There are regions of the world where digitizing is cost prohibitive and/or inaccessible. When asked about “who” is able to digitize it made me really think. Besides creating digital content with my social media footprint, texts, e-mails, and website, I scan physical works nearly every day in a library setting. However, this is not the case for all libraries across the world, or areas in turmoil where even having access to a library is privilege. For many places effected by colonialism and war, materials can be scant, written through the lens of oppressors, destroyed by violence, or not a priority for people struggling to have basic necessities.[2]

In my bubble of plenty, access to resources and research materials is generally not a problem. I’ve worked in Resource Sharing for over a decade and find that its typically rare when I have to cancel requests to borrow materials. There is a culture of reciprocity and helpfulness in this nearly[3] worldwide slice of the library world. However, there are gaps in access. My job is facilitated by having access to OCLC (Online Computer Library Center), which is a membership based non-profit cooperative that produced and maintains WorldCat, the world’s largest library catalog.[4] Member libraries can upload their holding information which is then searchable worldwide. It’s why less than a month ago I received a request to scan an out-of-copyright, rare book for a researcher at a library in Cape Town, South Africa.  My library was one of two institutions that owned a copy and since our holdings were uploaded and searchable globally another library could know to contact us, a request and task that wouldn’t have been possible 20 years ago.

When I think about the advancement in searching and access, I found myself at times being frustrated with Lara Palmer’s article. I agreed with parts about the side-glance and the importance of context, but think that having more access to content is generally a benefit to researchers and that understanding context is something that responsible historians should be mindful of. Things that were once inaccessible are so much more readily available. BUT, it’s not true of ALL locations and there are areas of the world where researchers need to physically travel to access materials that have not been digitized or added to Worldcat’s database. I was looking through the lens of privilege and not thinking about how many libraries don’t lend internationally through OCLC (or other means).

This doesn’t get at the heart of what should be retained digitally. That question is going to have a million different answers influenced by local infrastructure, culture, storage and curation privilege, environmental impact, etc. However, I’m personally in favor of making information as accessible and available to researchers as possible (and hopefully the environmental impact will be lessened with technological and storage advances). This is the way in which voices that have traditionally been silenced can resurface. I personally want to do the work to tell the stories of those who were marginalized throughout history and sometimes that means digging through obscure references.

Which leads me to the technical exercise! I wish I’d found OpenRefine sooner. I’m still learning and bumbling through wording to make some of the “transform” functions do what I want them to, but I believe I have the basics. For the assignment, I found that John was the most common name with roughly 1860 records which included a variation of John (John (1571 records), Sir John (242 records), John Baptist (3 records), and so on with other middle names etc.)[5]  

Sorting John names, including instances of “Sir John”

While this is the most common name in the dataset, it drew my attention to the incompleteness of the records. Very few women are represented here. In fact, this dataset contains approximately 610 women –less than 5% of the records (4.56%). It’s possible that the most popular name for this time and region belonged to a woman, potentially named Elizabeth (89 records) or Mary (81 records), but without having a larger and more inclusive sample size, it’s hard to know if Elizabeth was more popular than John.

For the 2nd part of the assignment, I separated the date columns using the separators “,” and “-“. This reduced the instances of potential death dates occurring when searching for 1533. I manually viewed the remaining 33 records and eliminated “flourishing” dates and death only dates. The result was 25 records remaining records including Elizabeth’s record # 12172.  

Finding out how many people in this dataset were born between 1533 and 1665 was a little more tricky. With easy to sort numerical data, I found 2822 records. There are an additional 478 baptism records that meet this date criteria and at least 1242 records based on circa data. Which makes at least 4542 people born during these dates on this dataset.  I know there must be an easier way to search and manipulate the data and I need to read more to find those tools.

[1] https://www.ecowatch.com/netflix-bad-for-environment-2639174138.html?rebelltitem=1#rebelltitem1 Accessed September 4, 2021.

[2] I’m not really sure where to put this without going on a huge tangent, but CRL (the Center for Research Libraries) has one of the best collections of newspapers from the Middle East. They haven’t been digitized and are available for circulation library to library on microfilm. HOWEVER, in order to access these materials your library either needs to be a member of CRL(which is VERY expensive) or pay $175 for the request. This adds a level of gatekeeping and privilege to access rare materials.

[3] There are definitely parts of the world where access to international library resources are less common. OCLC has member libraries in approximately 109 countries, but that means that there is no presence in 86 countries, leaving out a lot of the world.

[4] The official term is OPAC which stands for online public access catalog.

[5] I wasn’t sure if I should include the few records which included John as a middle name. Middle name and Fitz John (son of John) account for only 6 of the 1860 records associated with the name  “John” in this dataset.

One Comment

  • Olivia Holly- Johnson

    I am so glad that someone in the class actually liked OpenRefine. I can see its potential but it was very frustrating for me to work with. That may be because I am not a big fan of Excel and it reminded me of that. I got similar results to you with John being the most popular name at 1,780, but that is with removing titles. I didn’t even think about middle names. I would like to know where the data in this set was gathered from, that might shed light on why so few women were included. Is it because the initial gatherers did not include many women or is it because women’s data was not noted during the time period.

Leave a Reply

Your email address will not be published. Required fields are marked *