Skip Maine state header navigation

Agencies | Online Services | Help

Skip First Level Navigation | Skip All Navigation

Home > Newsroom > Library of Congress and Maine.gov

Library of Congress and Maine.gov

By Web Capture Team, Library of Congress

Recently, the Office of Information Technology was contact by the Library of Congress in regard to archiving portions of Maine.gov. Approval was granted and the CIO's Office made a request of the Library of Congress for an article to explain the process, the article follows.

Web archiving is still in its infancy. The Library is a founding member of the International Internet Preservation Consortium ( http://www.netpreserve.org ), whose members include other national libraries trying to figure out how and what to archive from the web but who all agree that the preservation of these objects is critical to the historical record of the world. The Internet Archive ( http://www.archive.org ), a 503(c) non-profit group founded by Brewster Kahle in 1996, was one of the first organizations to capture materials from the Web.

“With the rapid growth of the Internet and the World Wide Web, millions of people have grown accustomed to using these tools as resources to acquire information; and the availability of electronic information is taken for granted, “Kahle said. “However, it is a fallacy that if something is on the web, it will be there forever. The average lifespan of a web page is 44 -75 days. There's an urgent need for people to understand that that web is who we are. It's our culture and our social fabric, and we don't want to lose any of it. What is here today might be gone tomorrow.” ( http://www.archive.org/web/web.php )

A mission of the Library of Congress is to preserve the nation's cultural artifacts and provide enduring access to them. The Library's traditional functions of acquiring, cataloging, preserving and serving collection materials of historical importance to the Congress and the American people to foster education and scholarship extend to digital materials, including web sites. Since 2000, the Library has developed almost 30 thematic web archives on such topics as the United States National Elections, the Iraq War, and the events of September 11, 2001. Many of these collections are available through the Library of Congress Web Archives (LCWA) Web site ( http://www.loc.gov/lcwa ).

One of the ongoing collections is the Presidential Transition During a Time of Crises Web Archive, a selective collection of websites in the following categories: website materials produced by domestic and foreign political groups, state and federal governments, community and religious organizations, advocacy groups, foreign and domestic news sources, and independent websites.

The web archiving process encompasses selection, review, capture, cataloging, and access. Once a website is nominated for a collection, the content owner is sent a notification on the Library's intent to capture that site for inclusion in the collection. At that time, the Library also asks permission to allow off-site display of the archived web site. The Library's Web Archiving Team then reviews each selected website to scope out its depth and breadth and translates that into instructions to the web capture tool. Once the initial capture is done, the team reviews the archived website to make sure that what has been captured is in consonance with the collection plan. When the collection is complete, the website is again reviewed to identify any access issues and prepare it for cataloging. The bibliographic records and the archived website is made available from the LCWA website. If the content owner has not granted off-site display permissions, the archived website is only viewable onsite at the Library. The catalog records are indexed by search engines such as Google. To ensure that users understand that they are looking at an archived website, there is a banner notice providing information about the capture. For example, the Maine.gov site was captured on May 14, 2009.

Maine.gov screen capture

There are a number of distinct challenges for archiving and access to web archives:

  • with the advent of social networking sites and inexpensive domain names, it is difficult sometimes to identify the website boundaries.
  • it is difficult and time-intensive to find content owner email addresses to send the Library's notice to.
  • the capture tool developed by the Internet Archive, an open-source, archival quality crawler used by most national libraries, is very capable but it cannot capture many web 2.0 objects such as Google map mashups and Youtube and Google flash.video.
  • the Wayback, an open-source web archives viewer developed by the Internet Archive, has difficulties interpreting some links and will send the user to the live web, which can be disconcerting.
At this time, the Library has almost 100 terabytes of data comprising three billion objects. Future efforts will be focused on users and we welcome comments and suggestions. Please visit the project page, http://www.loc.gov/webcapture .

Article posted on: June 15, 2009
Comments on this article? Send us your feedback.