Digital History: Source Availability, Retention, & the Shifting Possibilities of Scholarship
Well, well, well, what have we here? Just 588 billion archived websites, 28 million books and texts, 14 million audio recordings (including 220 thousand live concerts), 6 million videos(including 2 million television shows), 3.5 million images, and 580 thousand software programs, including some classic video games, like the Oregon Trail. Amidst the hum of the Internet Archive’s server rooms in San Francisco resides this vast quantity of digitally archived resources, a source collection that is unparalleled in human history. The incredible amount digitally born and digitized historic source materials is enough to make one’s head spin. In fact, the dizzying statistics may have been enough to make you gloss over those staggering numbers! If that vastness and source possibility makes your heart skip a beat, you’re not alone. Historians, librarians, and humanities scholars are currently grappling with the implications and usage of this relatively new plethora of source materials, which is not just limited to The Internet Archive. While there is so much potential for scholarship, those working with the raw data need to consider standards for curation and access, metadata, algorithmic bias, evolving digital humanities methodologies, creator consent, and available software to present research/data to the public.
Thanks to The Internet Archive’s foresight to start collecting in the 1990s, the world currently has an unfathomable amount of digitally born archived source materials from around the world. The Internet Archive collects materials and metadata in all languages with characters that can be read in UTF-8, the most common encoding which accounts for 97% of all websites. My understanding and perception is that the standardization of code and across-the-board collecting means that this archive is not only the most inclusive post-1996 digital archive of diverse geographical and cultural sources, but that it also reduces western-centric collecting practices and older metadata descriptions that often had roots in colonialism and antiquated subject headings, issues that many academic libraries are currently trying to rectify with updating their metadata standards and catalog records.
While considering the quantity and quality of certain digitally archived source materials, the current “age of abundance” as Dr. Ian Milligan aptly describes, raises a lot of questions about archival consent, intent, and privilege; along with mass accessibility and longevity of long abandoned digitally born sources, including GeoCities websites. Some of what’s been retained would have been historically considered ephemera by archivists. In the case of the Internet Archive, the WayBack Machine gives historians a wider potential window into that era, but right now researchers must have a specific web address instead of keywords to search. It’s a limitation, but also a potential privacy tool since most folks in the 1990s/early 2000s were not creating for future consumption.
As an elder millennial, I used the Wayback Machine to look at the web crawl captures of the personal website that I built in high school, which was a combination of Angelfire and GeoCities. The teenage version of myself could not fathom that a little page I made for friends, prior to the existence of social media giants, Myspace or Facebook, would still be viewable in a time capsule form. It had been a while since I looked at my archived site but noticed on one capture that I’d added a page promoting an artist that I had met in NYC. I’d completely forgotten about that and was pleasantly surprised that a street artist who I bought some prints from in 1999 was still creating art and now had about a 600k following across some platforms. While in some ways it’s cool that archived versions of my website are still viewable, sometime after 2005 I made an effort to remove the link to my emo, goth teenage poetry page, which, to my dramatic dismay, still exists in an archived form. Le sigh. Broken crystal hearts and all. This is an example of the how an author can attempt to change what is available/visible to the public but has little control of previously archived material.
Although I’m pursuing a degree in history, for the past 13 years I’ve worked in an academic library as an Interlibrary Loan Borrowing and Distance Learning Coordinator. I’ve become an expert at verifying citations and locating materials that my library doesn’t own or have access to from all over the world, which adds an additional perspective regarding source access. I use archive.org and Google Books all the time to access pre-1925 historic books. A lot of the work is wonderful for providing immediate access to resources, but one needs to be mindful of the scans and potentially reduction in quality. Not all digital access and readability is equal.
During my time in this field, I’ve gotten to watch the increasing shift towards digitized material availability and be involved in copyright discussions. At this point, there’s a lot of material that is still only available in print or increasingly obsolete mediums, such as microfilm/fiche. There’s a lot of work to be done in the realm of digitizing non-digitally born/pre-1996/post-1925 materials. However, in many instances there is an additional copyright barrier that creates a gap in data-mining and source availability for post 1925 published works. I think it’s important for historians and librarians to be mindful of this gap and seek out grants to increase access when copyright lapses. Additionally, copyright restrictions on eBook and scholarly articles (even many that are digitally born) are restricted for loans/access by the publishers. The high paywall and restricted assess has contributed to the Open Access movement, which will also benefit future scholars.
I’m looking forward to a further discussion of digital humanities ethics, access, qualitative data collection, and algorithm bias shapes the field, along with the how various software can be used to present research to the public.
Also, this adventure in WordPress has been an exercise in failing gloriously. I’m much more familiar with Wix and have been stumbling though this new-to-me medium. I hope the page will look a little nicer later this week.
The Library of Congress is another good example of curating digitally born materials. They archived all Tweets from 2006-2017. After 10 years, they reassessed archival practices and changed their retention policy to only include Tweets on a selective basis. https://blogs.loc.gov/loc/2017/12/update-on-the-twitter-archive-at-the-library-of-congress-2/
 A lot of these will be the focus of future modules, so it’s nice to start thinking about it now and how it all weaves together.
 Please don’t call me a “geriatric” millennial. *cries*