Capturing Web Content with Archive-It

You know what they say – once you post something online, you can’t take it down. “The internet is forever” – except when it’s not. Ever clicked on a link only to receive the pesky message “404 Error: Page Not Found”? Web records such as websites and social media are only “forever” if they are properly, and promptly, preserved.

Most Alabama state agencies maintain a website so that citizens can access content and get things done online without having to make a call or come by the office. State agencies also use websites and social media to communicate with citizens. These websites and social media pages are updated frequently, however, and may one day disappear. Websites and social media serve state agencies and citizens in the present but may also be of interest to future researchers.

The State Records Commission has identified all state agency websites as permanent records per the Records Disposition Authorities (RDAs). Yet the archivists at the ADAH (talented though we may be) cannot capture the constantly evolving websites of around 200 state agencies. Since 2005, the ADAH has used a service called Archive-It to capture state agency websites.

What is Archive-It?

Archive-It is a subscription-based web archiving service from the Internet Archive, a 501(c)(3) non-profit and digital library. The Internet Archive provides free access to archived websites and other digital artifacts to researchers, historians, and the general public.

The Internet Archive also works with over 600 libraries and other partner organizations to harvest, build, and preserve collections of digital content, such as websites, blogs, and social media sites. The Archive-It service takes “snapshots” of a website’s appearance and top-level content throughout the year through a process called web crawling.

Webcrawling: How does it work?

Have you ever wondered how Google provides just the search result you need? Search engines like Google use webcrawlers. A webcrawler, sometimes called a spider, is software that systematically browses (or “crawls”) and automatically indexes the web.

Webcrawlers are always at work. They start with the targeted URL or “seed” URL. Usually the home page, the seed is the web crawler’s starting address for capturing content. From there, they follow links and extract data and documents. If a crawler comes across a new webpage, it indexes the page. If the webpage has already been indexed, then the crawler determines whether re-indexing is warranted.

Archive-It uses Heritrix, a webcrawler developed by the Internet Archive. Heritrix crawls all the seeds provided by the ADAH simultaneously and copies and saves the information as it goes. Archived websites are stored as “snapshots” but can be read and navigated as if they were live. They are full-text searchable within seven days of capture. The Internet Archive stores a primary and back-up copy at its data centers on multiple servers.

Note: All web crawlers, including Heritrix, fall short of making a complete index. There is no guarantee that documents placed on agency websites will be captured. Documents with a permanent retention must be transmitted to the ADAH separately. 

How does the ADAH use Archive-It?

The ADAH pays a subscription to collect a certain number of URLs. To archive a website, we provide its seed URL. The ADAH crawls all websites and select social media sites of all state agencies as well as the social media sites of Alabama Representatives and Senators. Social media sites crawls occur four times a year, while website crawls occur two times a year.

The ADAH has assigned descriptive metadata to each seed including website name, agency name, and short descriptions to aid access for researchers. The ADAH generates quarterly reports with statistics such as the total number of seeds crawled, the total number of documents crawled, and the total amount of data crawled in bytes.

How do I access archived websites?

Websites currently preserved by the ADAH are accessible here. If your agency’s website is not being captured, has been redesigned, or its URL has changed, please email a list of the URLs to the following:

Rachel Smith at Rachel.Smith@archives.alabama.gov

Becky Hebert at Becky.Hebert@archives.alabama.gov

Note: Universities and Local Governments are responsible for archiving snapshots of their own websites.

Imagine surfing circa 1999 and looking back on the Y2K hype, or revisiting an older version of your favorite Web site. Use the Wayback Machine to see billions of archived websites including vintage games, grab original source code from archived web pages, or visit websites that no longer exist. Simply type in a URL, select a date range, and begin surfing.

You Don’t Need to Keep It All: Start Decluttering Your Email

To the phrase “You don’t need to keep it all,” I often receive this common response: “I would need to hire an assistant full-time just to manage my email.” While storage may be relatively cheap, think about how long it takes your search engine to find an email among twenty thousand messages. The value of information lies in its accessibility.

How do you begin to declutter your email account? Start by deleting transient emails defined by records that are not essential in documenting agency activities. We previously discussed deleting unsolicited SPAM, distributed messages such as reminders about getting your flu shot, and reference copies in “First Steps to Better Email Management.”

Another example of types of emails that require no documentation for destruction include listserv messages. Set up rules to automatically sort these messages into a separate folder. Also, unsubscribe from groups or even promotional emails that you do not need.

Other types of email easily identified for deletion without documentation include transient records such as accepted/declined meeting requests and read receipts. Even items such as meeting arrangements can be placed in the calendar with the back and forth coordination emails being deleted.

To start finding these types of messages, arrange your emails by “from” which will allow you to select groups of emails and delete them with one click of a button. By arranging your account by sender, you can identify those individuals who do not send email related to the day-to-day operation of government. Some users only send you messages such as “Are you ready for lunch?” Delete emails from these senders in batches.

Email management is not saving all email forever. Spend as little as fifteen minutes every day before or after lunch. Deleting transient emails will help you identify those messages that document your important work in government and will build your confidence as you take additional steps to declutter your account.

First Steps to Better Email Management

Many emails you receive at work are transient records and thus can be deleted. Managing those emails properly can be done in as little as fifteen minutes a day. So where do you begin? You start by deleting emails that you know can be deleted such as unsolicited SPAM or distributed messages sent to groups.

Microsoft Outlook has tools that can help you capture SPAM messages before they reach your inbox. Those messages are stored in a special SPAM or junk mail folder. SPAM messages that arrive in your inbox can be flagged to help your account identify similar future messages, so they go directly to your junk folder. You should check your SPAM folder weekly to ensure that emails created in the ordinary course of business were not misdirected. Otherwise, you can delete your entire SPAM folder weekly to reduce your email account’s clutter and make more efficient use of your account space.

Once you have deleted your SPAM messages, you can then tackle a common transient email – the distributed list email. These are emails where you have been cc’d or bcc’d as part of a larger group of recipients such as all department/agency employees. Often, such messages originate within your immediate workplace and include mass reminders such as “cake in the break room” or “flu shots available today.” Such emails distributed to (not by you) are transient non-substantive messages of short-term usefulness and often are not created as part of the normal functions/activities of your agency. Transient emails can be deleted.

Distributed meeting minutes are also considered transient records. Usually after meetings, minutes are sent to all attendees. As a recipient of these minutes, you can delete those messages because the minutes should be preserved permanently by its creator – recipients are not obligated to preserve those emails because you are merely receiving a reference copy.

Email management is NOT saving all email forever. As email will not manage itself, you must be a proactive manager as email management is your responsibility. Don’t attempt to clean up your email all at once but set aside small intervals of time. See future blogs for additional email clean-up strategies.

What about Government Email?

Can you imagine being able to conduct government business in the State of Alabama without the use of email? In general, public officials are legally obligated to create and maintain records that adequately document agency activities. These government records — including email — facilitate the efficient conduct of government programs and services; ensure effective management of government information; and provide documentation of government business. Considering this, what rules and best practices apply to email when it comes to managing and retaining government records?

Let’s define government email. The state of Alabama issues a professional e-mail account for each new employee and public official. However, Alabama law stipulates that any document is a government record when it is created by a government employee in the course of conducting public business — not just those documents created with and/or stored on government property. If an employee is engaging in government activities with his or her personal email account, those emails are government records. This is one reason that the use of official government email accounts is encouraged when conducting public business.

How long must email be maintained by government agencies? It depends. Email itself is not a record type but a format. Because records retention relies on the information in a record and not on the format, agencies cannot apply one retention to all email messages.

Government email messages must be retained and disposed of according to the Records Disposition Authority (RDA) approved by the appropriate records commission for that agency. For example, if your agency’s RDA requires grant project files to be maintained “six years after submission of the final federal financial report,” then email associated with the grant project file would have that same six-year retention. Email messages are subject to the same retention requirements as the same type of records in another format or medium. Keep in mind that the retention of email is dependent upon the content of the email, not where the email account resides.

Email records are also subject to the same legal requirements regarding access as an agency’s other government records, as established by Alabama’s open records law. Because email created in the conduct of state or local business is public record and rarely subject to restrictions, written communications should be articulated clearly and professionally, leaving banter to the break room.