Web Archiving at BPL: Saving Brooklyn's Web Content One URL at a Time

Did you know that Brooklyn Public Library has a web archive? In 2017, the Brooklyn Collection (now part of the new Center for Brooklyn History) joined the Internet Archive’s Community Webs program, in which public libraries around the country are given the funding and support to start and sustain web archives. We have been archiving Brooklyn web content through this program for over three years now. 

Screenshot of Center for Brooklyn History web archive homepage

Web archiving is how we describe the process by which we save and preserve websites and web content in a stable and static archival format. This is accomplished with a tool called a web crawler, which digitally “crawls” through a website’s many layers to document and save its content. You may be thinking, isn’t the Internet forever? Don’t we always get warned that one embarrassing picture, blog post, or tweet could live on in infamy? 

While it’s true that some sites save your data even if you think you’ve deleted it, and that screenshots can cause something embarrassing to last even longer, that’s still not “forever.” When we talk about web archiving, or any kind of archiving, we’re talking about saving things for posterity, and in perpetuity. When it comes to web content, that goal poses many challenges. The average lifespan of a website is only 100 days. That tells us that even though the Internet enables us to create information at an unprecedented rate, we’re also losing more information than ever before. 

As an example of content disappearing from the web, some of you may remember that BPL’s own website used to look very different.

Screen shot of Brooklyn Public Library website in March 2017 via the Wayback Machine

In April of 2017, we launched an updated website, and our page completely changed.

Screen shot of Brooklyn Public Library website in 2021

As we built the new website, and especially after it launched, those of us on staff worked hard to examine each page and make sure it functioned. There’s so much information on a website for a library system with tens of branches and numerous departments that in those first days of the new site, we often found ourselves on an error page. Everyone has encountered broken links before. That’s one of the consequences of the dynamic and ever-changing nature of the web. 

The Internet Archive is an American non-profit digital library that aims to provide “universal access to all knowledge.” Toward that end, they provide free public access to much of their vast collections of digitized materials, including music, movies, software and video games, and over three million books. Our own digitized city directories, high school newspapers and some audiovisual materials are made available through the Internet Archive. One of their more popular initiatives has been resurrecting old video games from obscurity and making them playable online, including the much-beloved game Oregon Trail. 

One of the Internet Archive’s features is the Wayback Machine, which is a web archive, or, as a 2015 New Yorker article put it, it’s really “the web archive." The Wayback Machine has archived more than 430 billion web pages. The world’s other large web archives are mostly national libraries; for context, the Library of Congress has archived only 9 billion webpages (and contracts with the Internet Archive to do so) and the British Library, 6 billion pages. On the Wayback Machine site, you can type in any URL and, if it’s been archived, view past versions of that site by the date on which they were captured. You can also, with the “Save Page Now” feature, request that any URL be crawled and saved. 

In 2017, the Internet Archive launched the Community Webs program, funded by the Institute of Museum and Library Services. Brooklyn Public Library was awarded a place in the Community Webs cohort, along with 26 other public libraries across the U.S. The program is designed to assist public libraries to either begin or continue their web archiving efforts. In their words:

Local history archives have long served as vital resources for preserving the stories of communities. With many records now published on the web, the ability to preserve collections of online newspapers, local blogs, civic websites, social media, and other platforms is an increasingly important skill for librarians in fulfilling their role as information custodians and community anchors. Locally-focused web archiving can also diversify the historical record and preserve the voices of those often excluded from the archive. 

While the initial IMLS grant ran out after two years, thanks to funding from the Andrew W. Mellon Foundation, the program is continuing and currently adding more libraries with an eventual goal of over 150 public libraries participating.

Our Center for Brooklyn History web archive has shifted somewhat over the past three years. We initially focused quite heavily on local news sources on the web. As part of this, I hosted a panel discussion during Endangered Data Week, “Saving Local News on the Web,” and wrote a follow up blog post for the Archive-It blog. The focus on news content came out of a knowledge of which analog sources are used most frequently in our collections. In fact, the seed from which the library's local history division initially grew was the library’s acquisition in 1957 of the records of the Brooklyn Daily Eagle newspaper. Our Brooklyn Newsstand portal, with the Eagle and over 40 other local newspapers available digitally, is accessed extraordinarily frequently by users all around the world. So even more so than a digital clippings file, I wanted to try and save entire “newspapers,” especially given that now a lot of local news publications exist entirely online.

Screen shot of Caribbean Life website in the Center for Brooklyn History web archive

I quickly realized, however, that our data budget would soon be overwhelmed if I were to continue attempting to comprehensively capture the many local Brooklyn news sources on the web. As a result, I turned my focus to smaller sites, which also tended to be in more imminent danger of disappearing. I experienced some “near-misses” during the relatively short timeframe of the project, where I captured a site only to realize it had disappeared from the live web soon thereafter. Unfortunately, I also experienced some actual misses, where I failed to capture sites in time. Nonetheless, the Brooklyn Blogs collection and especially the Brooklyn Organizations & Projects collection provide a valuable record that would otherwise not have been captured.

We also capture institutional web pages, such as the BPL website and related Twitter accounts, in our Brooklyn Public Library Institutional Web Archive collection. In addition, we capture the websites and Twitter feeds of local politicians in our Brooklyn Politics collection. And we’ve just added a new collection, the Brooklyn Jewish History Project, as part of a larger project collecting Brooklyn Jewish history with funding from the David Berg Foundation. 

Web archiving is still very new, relative to other kinds of archiving, and definitely imperfect, even by the admission of those who are most expert at it. The goal of the Community Webs program is to get the ball rolling at public libraries around the country. So if after reading this you are interested in web archiving, my advice would be to just go for it. The best way to learn is by diving in. The web is vast, and the more local institutions and individuals take responsibility for archiving their own corners of the web, the more broad a picture we will have saved.

Post as
DBowers-Smith
Guest Blog Author
Author
Post as different staff member
James D. Keeline (not verified)

When the web was mainly static pages, making an archive was writing a program to visit a URL and follow each link it found. This is the basis of a search engine web "spider." However, many pages in the past couple decades are not static. They may be Drupal or WordPress sites with content that is generated upon the visit. Imagine what happens when you see a search field. For the visitor, this can be a useful way to reach into a site's content. But how can a spider program interact with it? There are clever attempts at workarounds but most content will be a partial record — like any archive or library would be.
Fri, 03/12/2021 - 15:41 Permalink