Web archiving: The importance of collecting born-digital materials

Recently I had the privilege to sit in on the Board of Regents meeting at the National Library of Medicine (NLM). At this meeting the History of Medicine and Technical Services Division presented a report on an initiative to expand the NLM’s collection to born-digital web materials. The presentation involved a preliminary trial where the team collected twelve specific doctor and patient blogs to be preserved. I thought that this was an incredible idea and naturally ran up to them immediately after the presentation and asked if I could participate in the project as part of my Associate Fellowship. What I liked most about their presentation was their methodology, and the tools they used to collect this content. I thought the strategy they used was a good opportunity to write a blog post giving an overview of what tools they used in this process. 

Strategy & Guidelines

What first caught my eye was that the NLM Web Collecting and Arching Working Group has recommended that the NLM follow the ARL Code of Best Practices and Fair Use for Academic and Research Libraries. This code was created in February of 2012, and Section 8 is completely devoted to Collecting Material Posted on the World Wide Web and Making It AvailableAccording to the code collecting web material is valuable because it creates an accessible archive of what is available on the web — an environment that contains an enormous amount of important historical and research related content. The code states that:

Selecting and collecting material from the Internet in this way is highly transformative. The collecting library takes a historical snapshot of a dynamic and ephemeral object and places the collected impression of the site into a new context: a curated historical archive.

The ARL also places certain limitations on how this content should be created. This is important because it sets a standard for other libraries and archives to follow. Furthermore, it provides guidance to institutions on how to approach the creators of this content. In accordance to fair use, the ARL states that:

Captured material should be represented as it was captured, with appropriate information on mode of harvesting and date.

To the extent reasonably possible, the legal proprietors of the sites in question should be identified according to the prevailing conventions of attribution.

Libraries should provide copyright owners with a simple tool for registering objections to making items from such a collection available online, and respond to such objections promptly.

These limitations support traditional archival theory in the sense that the goal is to preserve the integrity and authenticity of the website that is captured. It also reflects the importance of acknowledging the creator of the content, and asking for permission before the material is made available to the public.

What I love about the NLM’s consideration of the use the ARL Code for Fair Use is that it is one library collaborating with another to access and preserve content for the benefit of others. The ARL is an excellent resource for academic and research libraries that should be used more often. I would also love to see more libraries collaborating with one another on this topic. Because collecting born-digital material should reflect an institutions own collection development policies, it is important that the library community communicate with one another to avoid duplication of captured content.

Now that I have gone over the guidelines and standards that were used to approach the content, I would like to speak briefly about the technology the NLM used to gather this material: Archive-it. 


Archive-it is a subscription web archiving service developed from the Internet Archive that helps to build, harvest and preserve digital content on the web. The program allows users to collect and manage this content in a way that preserves all the original qualities of a web page keeping its integrity in place. The program can essentially run 24 hours a day in order to harvest and capture the material on a web page. Once the content is captured it is stored in the Internet Archive data centres. The NLM staff who presented this report praised Archive-it for the ease of use and outstanding institutional support. Many other libraries have begun using Archive-it as well including the Library of Congress, University of Michigan, Tufts University and many others. Although I have not tried the program yet, when l browse through some of the collections they support it appears to do an excellent job of maintaining the look and feel of the web pages. Take a look for yourself and see; they have a large number of collections available. 

Why is this important?

The staggering amount of research available on the web is the most obvious reason for collecting this material. Every day we surf the web, gather information and use it as evidence for solving problems, answering questions and enhancing research. What we don’t consider is that if the web was ever to disappear we would no longer have this amazing research to refer to. Grey literature, social media and at-risk content are just a few types of research content that would be very useful to have a historical record of. With grey literature, many websites that provide valuable reports may only last for a limited amount of time — capturing this content will provide an opportunity for increased exposure, and safeguard against the loss of valuable research. Similarly, social media provides a wealth of information about the collaborative and interactive nature of doctors and patients. Preserving this material to better understand trends and issues among these groups can be a valuable resource. Finally capturing at-risk content can help government agencies track and gather web content that provide information about threats on public health; websites of early responders to disasters; and social media (blogs, Twitter, Facebook) that documents individuals responses to health crises. All of this information is valuable and can provide a historical record for those interested in researching it in the future. 

Not enough has been done to preserve the valuable information born on the web. So many of us use the Internet for many of our daily tasks, yet we don’t think about how in ten years from now we may never be able to access that material again. I believe that each library institution needs to think about gathering born-digital content that aligns with their own collection development policy. The fact that the NLM has launched an initiative to preserve biomedical born-digital material demonstrates that it is deemed important on a national level smaller health science libraries could start archiving regional web research as well as their institutions personal webpages that provide unique research to their patron base. 

What do you think? Should libraries be trying to do more work in this area? Should this be a part of their strategic plan? 

It is important to note that these opinions are mine and not of the National Library of Medicine. I wrote this post out of appreciation for the project and the opportunity to share my beliefs on the important role libraries can play in collecting born-digital materials.


Archive-it: A web archiving service to harvest and preserve digital collections. 2012. Retrieved from http://www.Archive-it.org on September 16, 2012.

Association of Research Libraries. Code of Best Practices in Fair Use for Academic and Research Libraries. Jan 19, 2012. Retrieved from: http://www.arl.org/pp/ppcopyright/codefairuse/index.shtml on September 16, 2012.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s