Capturing Web Content with Archive-It

You know what they say – once you post something online, you can’t take it down. “The internet is forever” – except when it’s not. Ever clicked on a link only to receive the pesky message “404 Error: Page Not Found”? Web records such as websites and social media are only “forever” if they are properly, and promptly, preserved.

Most Alabama state agencies maintain a website so that citizens can access content and get things done online without having to make a call or come by the office. State agencies also use websites and social media to communicate with citizens. These websites and social media pages are updated frequently, however, and may one day disappear. Websites and social media serve state agencies and citizens in the present but may also be of interest to future researchers.

The State Records Commission has identified all state agency websites as permanent records per the Records Disposition Authorities (RDAs). Yet the archivists at the ADAH (talented though we may be) cannot capture the constantly evolving websites of around 200 state agencies. Since 2005, the ADAH has used a service called Archive-It to capture state agency websites.

What is Archive-It?

Archive-It is a subscription-based web archiving service from the Internet Archive, a 501(c)(3) non-profit and digital library. The Internet Archive provides free access to archived websites and other digital artifacts to researchers, historians, and the general public.

The Internet Archive also works with over 600 libraries and other partner organizations to harvest, build, and preserve collections of digital content, such as websites, blogs, and social media sites. The Archive-It service takes “snapshots” of a website’s appearance and top-level content throughout the year through a process called web crawling.

Webcrawling: How does it work?

Have you ever wondered how Google provides just the search result you need? Search engines like Google use webcrawlers. A webcrawler, sometimes called a spider, is software that systematically browses (or “crawls”) and automatically indexes the web.

Webcrawlers are always at work. They start with the targeted URL or “seed” URL. Usually the home page, the seed is the web crawler’s starting address for capturing content. From there, they follow links and extract data and documents. If a crawler comes across a new webpage, it indexes the page. If the webpage has already been indexed, then the crawler determines whether re-indexing is warranted.

Archive-It uses Heritrix, a webcrawler developed by the Internet Archive. Heritrix crawls all the seeds provided by the ADAH simultaneously and copies and saves the information as it goes. Archived websites are stored as “snapshots” but can be read and navigated as if they were live. They are full-text searchable within seven days of capture. The Internet Archive stores a primary and back-up copy at its data centers on multiple servers.

Note: All web crawlers, including Heritrix, fall short of making a complete index. There is no guarantee that documents placed on agency websites will be captured. Documents with a permanent retention must be transmitted to the ADAH separately. 

How does the ADAH use Archive-It?

The ADAH pays a subscription to collect a certain number of URLs. To archive a website, we provide its seed URL. The ADAH crawls all websites and select social media sites of all state agencies as well as the social media sites of Alabama Representatives and Senators. Social media sites crawls occur four times a year, while website crawls occur two times a year.

The ADAH has assigned descriptive metadata to each seed including website name, agency name, and short descriptions to aid access for researchers. The ADAH generates quarterly reports with statistics such as the total number of seeds crawled, the total number of documents crawled, and the total amount of data crawled in bytes.

How do I access archived websites?

Websites currently preserved by the ADAH are accessible here. If your agency’s website is not being captured, has been redesigned, or its URL has changed, please email a list of the URLs to the following:

Rachel Smith at Rachel.Smith@archives.alabama.gov

Becky Hebert at Becky.Hebert@archives.alabama.gov

Note: Universities and Local Governments are responsible for archiving snapshots of their own websites.

Imagine surfing circa 1999 and looking back on the Y2K hype, or revisiting an older version of your favorite Web site. Use the Wayback Machine to see billions of archived websites including vintage games, grab original source code from archived web pages, or visit websites that no longer exist. Simply type in a URL, select a date range, and begin surfing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s