Module 1 – Paula F. Green

Digital History: Source Availability, Retention, & the Shifting Possibilities of Scholarship

Well, well, well, what have we here? Just 588 billion archived websites, 28 million books and texts, 14 million audio recordings (including 220 thousand live concerts), 6 million videos(including 2 million television shows), 3.5 million images, and 580 thousand software programs[1], including some classic video games, like the Oregon Trail. Amidst the hum of the Internet Archive’s server rooms in San Francisco resides this vast quantity of digitally archived resources, a source collection that is unparalleled in human history. The incredible amount digitally born and digitized historic source materials is enough to make one’s head spin. In fact, the dizzying statistics may have been enough to make you gloss over those staggering numbers! If that vastness and source possibility makes your heart skip a beat, you’re not alone. Historians, librarians, and humanities scholars are currently grappling with the implications and usage of this relatively new plethora of source materials, which is not just limited to The Internet Archive.[2] While there is so much potential for scholarship, those working with the raw data need to consider standards for curation and access, metadata, algorithmic bias, evolving digital humanities methodologies, creator consent, and available software to present research/data to the public.[3]

Thanks to The Internet Archive’s foresight to start collecting in the 1990s, the world currently has an unfathomable amount of digitally born archived source materials from around the world. The Internet Archive collects materials and metadata in all languages with characters that can be read in UTF-8, the most common encoding which accounts for 97% of all websites.[4] My understanding and perception is that the standardization of code and across-the-board collecting means that this archive is not only the most inclusive post-1996 digital archive of diverse geographical and cultural sources, but that it also reduces western-centric collecting practices and older metadata descriptions that often had roots in colonialism and antiquated subject headings, issues that many academic libraries are currently trying to rectify with updating their metadata standards and catalog records.

While considering the quantity and quality of certain digitally archived source materials, the current “age of abundance” as Dr. Ian Milligan aptly describes, raises a lot of questions about archival consent, intent, and privilege; along with mass accessibility and longevity of long abandoned digitally born sources, including GeoCities websites. Some of what’s been retained would have been historically considered ephemera by archivists. In the case of the Internet Archive, the WayBack Machine gives historians a wider potential window into that era, but right now researchers must have a specific web address instead of keywords to search. It’s a limitation, but also a potential privacy tool since most folks in the 1990s/early 2000s were not creating for future consumption.

A condensed view of my webpage from 1999, along with the Wayback Machine header. I laughed when I saw the term “computer art.” Linguistic and terminology changes are another aspect of info that can be found through digital archiving.

As an elder millennial[5], I used the Wayback Machine to look at the web crawl captures of the personal website that I built in high school, which was a combination of Angelfire and GeoCities. The teenage version of myself could not fathom that a little page I made for friends, prior to the existence of social media giants, Myspace or Facebook, would still be viewable in a time capsule form. It had been a while since I looked at my archived site but noticed on one capture that I’d added a page promoting an artist that I had met in NYC. I’d completely forgotten about that and was pleasantly surprised that a street artist who I bought some prints from in 1999 was still creating art and now had about a 600k following across some platforms. While in some ways it’s cool that archived versions of my website are still viewable, sometime after 2005 I made an effort to remove the link to my emo, goth teenage poetry page, which, to my dramatic dismay, still exists in an archived form. Le sigh. Broken crystal hearts and all. This is an example of the how an author can attempt to change what is available/visible to the public but has little control of previously archived material.

Although I’m pursuing a degree in history, for the past 13 years I’ve worked in an academic library as an Interlibrary Loan Borrowing and Distance Learning Coordinator. I’ve become an expert at verifying citations and locating materials that my library doesn’t own or have access to from all over the world, which adds an additional perspective regarding source access. I use archive.org and Google Books all the time to access pre-1925 historic books. A lot of the work is wonderful for providing immediate access to resources, but one needs to be mindful of the scans and potentially reduction in quality. Not all digital access and readability is equal.

During my time in this field, I’ve gotten to watch the increasing shift towards digitized material availability and be involved in copyright discussions. At this point, there’s a lot of material that is still only available in print or increasingly obsolete mediums, such as microfilm/fiche. There’s a lot of work to be done in the realm of digitizing non-digitally born/pre-1996/post-1925 materials. However, in many instances there is an additional copyright barrier that creates a gap in data-mining and source availability for post 1925 published works. I think it’s important for historians and librarians to be mindful of this gap and seek out grants to increase access when copyright lapses. Additionally, copyright restrictions on eBook and scholarly articles (even many that are digitally born) are restricted for loans/access by the publishers. The high paywall and restricted assess has contributed to the Open Access movement, which will also benefit future scholars.

I’m looking forward to a further discussion of digital humanities ethics, access, qualitative data collection, and algorithm bias shapes the field, along with the how various software can be used to present research to the public.

Also, this adventure in WordPress has been an exercise in failing gloriously. I’m much more familiar with Wix and have been stumbling though this new-to-me medium. I hope the page will look a little nicer later this week.

[1] Statistics taken from Archive.org’s “About” page. – https://archive.org/about/ Accessed 8/29/2021.

[2]The Library of Congress is another good example of curating digitally born materials. They archived all Tweets from 2006-2017. After 10 years, they reassessed archival practices and changed their retention policy to only include Tweets on a selective basis. https://blogs.loc.gov/loc/2017/12/update-on-the-twitter-archive-at-the-library-of-congress-2/

[3] A lot of these will be the focus of future modules, so it’s nice to start thinking about it now and how it all weaves together.

[4] https://w3techs.com/technologies/cross/character_encoding/ranking

[5] Please don’t call me a “geriatric” millennial. *cries*

6 Comments

Dr. Otis

August 31, 2021 at 5:40 pm Reply

Ha, I hear you on the Elder vs. Geriatric Millenial. Elder Millenial is a fun term (it always sounds to me like an eldrich horror about to rise from the depths, Cthlulu-like) though I’ve also a fondness for “The Oregon Trail Generation” – a la this blog post https://socialmediaweek.org/blog/2015/04/oregon-trail-generation/

I was lucky in that all my 1990s websites appear to have vanished from the IA (phew) although I suspect there’s some code banging around in a GeoCities database somewhere. It’s definitely banging around on my hard drive and back up hard drives but I control those 🙂 At least those were intended to go on the web from the get go – there’s a lot of interesting conversations about things that were “published” in physical zines in days when you could assume a small circulation among a community of like-minded individuals that can (in part due to badly worded contracts and copyright agreements) now be put up online for hundreds of millions of people to see, and what the ethics are of that kind of digitization out of context.

The IA is a great resource, but I confess I am always eyeing it sideways wondering if it’s as adequately backed up as it seems because imagine what would happen if the non-profit went belly-up and the IA was destroyed? There’s always the occasional flash of concern about a Digital Dark Age (a concern that is definitely appropriate, especially for 20th century materials) but I don’t think that people truly understand how ephemeral our internet content can be sometimes.

Ellie Canning

August 31, 2021 at 6:14 pm Reply

Hi Paula!

I really like this response- I also had to include personal ruminations on using the internet in my response because, well, we live here now. Of course fluency and familiarity with the internet comes down to questions of access, but enough people (in the United States) have access that we call an entire generation of children “digital natives” because they will grow up alongside the internet. I would consider myself a very twilight millenial aka the older side of Gen Z because I know how to rewind a VHS tape and play a cassette, but just barely. My personal ramblings aside, the point that most stuck with me from Ian Milligan’s book was the idea of when do recent events become history? My tiny comment on your blog on one small slice of the internet might become relevant if they are archived and someone else accesses them again for their own reasons. The digital age is so fast moving from the user side: our brains are always moving as we live with computers in our pockets. On the flip side of that, as historians, we struggle to deal with the massive influx of data made possible through the internet. I am very interested in ideas of intent and consent this semester because it is 2021, and if we follow Milligan’s rule of thumb, 2001 is on the books as history. So many people use the internet without considering what they do “historical” because as I said earlier, the internet is how humans communicate nowadays. There is so much data floating around out there that was not intended to be historicized or analyzed but digital humanities understand that all things can be data and data can show trends. I will be very interested to see how current DH methodology deals with internet ethics and consent.

Stephen Reiter

August 31, 2021 at 10:22 pm Reply

This was a great post! I feel like it really gets to the crux of the challenges related to the study of digital history. You are correct that historians in the digital age are faced with a dizzying array of resources and the amount of information at their disposal is seemingly endless. I just think about all the people like yourself who created their own webpages over the past 25 years or so, and the “digital imprint” they left behind. What are historians to make of these in the future? Presumably, they could be used as great reflections of how society looked at the time of their creation, but with such an endless supply of them, what could get lost in the shuffle? It’s such an interesting issue to ponder.

Hayley Madl

September 1, 2021 at 12:56 am Reply

Haha I think I just barely scrape into that nice cusp category between true millenial and gen-z. I remember VHS tapes (pretty sure we still have our collection up in our attic, but alas, nothing to play them on) and I even remember learning to use the catalog cards at the school library. However, I also remember growing up with the advent of technology and digitization. I remember when my parents first got flip phones, and then Blackberries, and eventually smartphones. I can vividly recall the evolution of our school’s computer lab, going from CRT monitors and towers to the flat monitors until eventually the lab was retired completely and the whole school switched to distributing iPad Mini’s to the student body. Thinking back on the rate at which technology advanced, especially going from zero technology in the classroom in elementary school to relying on iPads for 90% of our classwork in high school, really cements in my brain just how rapidly technology is continuing to grow. And this rate of technological growth–and more importantly technological use–makes me wonder what resources like the Internet Archive are going to look like in ten or fifteen years.

Paola Torrico

September 1, 2021 at 3:02 am Reply

The Wayback Machine that you introduced to us in our group discussion last week is so interesting! It’s amazing how you can essentially freeze time and view archived websites, including your personal website!

I also had a hard time adjusting to the inner workings of WordPress. I’m very familiar with Blogger and setting up the WordPress website was a bit of a challenge. I think it’s important to adapt to new mediums, so I am grateful that this class is allowing us to do so!

Gail V Coleman

September 1, 2021 at 3:22 am Reply

I love the way your website looks! You were much more successful than I was! What theme did you use?
I also loved reading about your personal reflections. I would like to know more about Wayback?? How or when would your old website have been archived??
As an elder boomer! my experience with computers etc goes way back. In college we learned fortran or basic and had those main frames. In my first job I typed punch cards. My first home computer used dos. I never mastered that! My favorite word processing program — Word Perfect — has disappeared. It is astonishing how accessible everything seems to be now — provided you have the tools to access the digital world.

You talk about digitization of print materials. But what is important to me is digitization of hand-written 18th and 19th c manuscripts that currently can be accessed only in person or through microfilm!