archiving

#CRL_Leviathan Session 2: Libraries and the Information of Governments

Keynote: Approaching Leviathan: The Dangers and Opportunities of “Big Data” 
John S. Bracken, Director, Journalism and Media Innovation, Knight Foundation
YouTube-logo-full_color Adobe PDF icon

  • How to deal with big data is only half the story > we must also focus on organizational culture and adaptation or we lose track of the importance of culture and people
  • There is so much data now > what’s important is the process, what you do with it, and the talent you build around it > we must adapt and create a bridge between traditional skills and new quantitative approaches
  • There is skepticism about technology and our reliance on it and this is colliding with the emerging culture of break things and focusing on future and the next challenges
  • How does the civic sector do a better job of adaptability to build the tools that people want and need?
  • “Make something people want or move on” outlook is much harder to accomplish in civil society
  • The biggest cognitive switch we need to make is enabling ourselves to make mistakes
  • The Knight foundation works in the space of news and journalism, but links it to the community > learn more about the Knight Foundation here: http://www.knightfoundation.org/

Government Records and Information: Real Risks and Potential Losses 
James A. Jacobs, Data Services Librarian Emeritus, UC San Diego, and technical advisor for CRL Certification Advisory Paneldfd
YouTube-logo-full_color Adobe PDF icon

  •  There are many gaps in what we know: no list of born-digital government information, no list of all government websites, no list of preserved born-digital gov info
  • What we do know: FDLP libraries have preserved millions of volumes of non-digital government information and most born digital information is not held, managed, organized, or preserved by libraries
    • Preservation is at the mercy of budgets and social priorities > risk increases if  persevering agency is the creator and doesn’t have preservation as mission or if preserving agency governed by politicians
  • The production of digital documents is far outpacing what’s being done to preserve these documents
  • Key issues:
    • Versioning
    • The need for persistent URLs
    • The need for temporal context (ex: link to version of document or site that author linked to at time of publication and not updated version)
    • E-government issues (e-gov often hides information behind services > how to we preserve this information)
    • Relying on government for preservation and free access (most agencies do not have the mandate to preserve indefinitely – this is even the case for GPO)
    • Collections need services to provide important context for interpretation
  • When we create dark archives we’re not creating a value for our community > we need to create immediate value for our users
  • Who should preserve?
    • Option one: the government alone
    • Option two: the government with non-governmental partners (ex: GPO + LOCKSS-USDOCS)
    • Option three: non-governmental organizations without government cooperation (ex: Internet Archive)
  • There are different methods for selecting what needs to be preserved (the solutions should be mixed and the issue should be tackled collaboratively)
    • Broad web harvesting (ex: Internet Archive)
    • Focused selection (ex: by agency or title by tile)
    • Digital deposit (ex: deposit by creators to memory institutions)
  • When planning for preservation focus on different user-communities: don’t look at the web and decide what to preserve, look at the web and preserve based on what users will need
  • Every library should participate in digital preservation > it’s about building the value of libraries > collections and services should be reliable and useful > shared collections and services can be built with different contributions – not all libraries have to be data centres
  • Summary of key points:
    • Preserve born digital government information – the technology exists
    • Every library can and should participate
    • We can add value to the information by building collections of use to our user communities

The Digital Future of FDsys and the Federal Depository Library Program: A Public Policy Analysis 
R. Eric Petersen, Specialist in American National Government, Congressional Research Service
YouTube-logo-full_color Adobe PDF icon

  • Challenges
    • Access and service (tangible, digital, or both?)
    • Costs (Less print distribution, but still costs libraries to maintain
    • FDSys – there is no good model for permanent digital retention > we will have to update software and touch digital assets to make sure access continues > ongoing investment and responsibility required > every 8-10 years will require entire overhauls and updates
    • Born digital materials > identification, retention, preservation, service
    • Tangibles > retention, digitization, consolidation, service
  • Lack of consensus around:
    • What is to be captured > how to count – websites / documents vs. records
    • How to capture and by whom > GPO / FDSys, originating agencies, third parties
  • Legislative change is slow without clear agreement regarding the solutions among stakeholders
  • Before Congress will engage, we need clear proposals that are broadly supported and offered by stakeholders and interested parties > they must cover issues such as enduring standards for digital retention, who collects and retains born digital content and tangible content, and how the costs will be managed

Panel Discussion: New Models of Access: The Role of Third Party Aggregators and Publishers
YouTube-logo-full_color Adobe PDF icon

Susan Bokern, VP, Information Solutions, ProQuest

  • We all have different roles to play and there’s enough content to go around
  • ProQuest’s essential role is to add value to content
  • ProQuest is focusing on researchers and the improvement of workflow processes to create new research output > enabling researchers to access content more efficiently, providing tools to improve workflow, visualization and analysis tools, not just about content but also about context
  • The process of adding value begins with market research (surveys, advisory boards, focus groups to identify known and unknown needs) > creating acquisitions strategy to develop collection > preserving content or data > keeping the technology up to date > identifying where and how to obtain the content
  • ProQuest takes preservation seriously > content is stored on their own servers > currently exploring a longer-term storage and preservation solution (ex: Iron Mountain)

Robert Lee, Director of Online Publishing and Strategic Partnerships, East View Information Service

  • East View is an aggregator for academic institutions and a variety of international governments
  • Some example projects: GIS, big data, political rallies ephemera
  • Big focus on content from Russia and China > not usually seeking or producing translations, but going after the information and data that’s not always available elsewhere or not the same as what’s provided in English
  • There is an operational risk is that the information received could later be reclassified
  • In China, content can be made available and digitized very quickly but it can also disappear or be blocked quickly, too
  • Interested in exploring cross-platform solutions for content

Robert Dessau, CEO, voxgov

  • Voxgov harvests materials from over 10K web destinations each day > every 6 mins the system looks for new URLs > 49 diff types of documents (fact sheet, social media, congressional, federal register, speeches, etc.)
  • The collection process has evolved rapidly > learned to identify when a website’s format has changed to maintain quality intake of data > 18-22%, depending on the group, falls into the broken link category
  • Interested in tracking conversations from beginning to end to allow a much deeper and more comprehensive level of research
  • The involvement of third parties in the preservation and access process is inevitable
  • Mining the text we have to bring value has not yet been realized
Advertisements

#CRL_Leviathan Session 1: Libraries and the Records of Governments

The CRL Leviathan conference, Libraries and government information in the age of big data, took place in Chicago on April 24 and 25, 2014. CLA-GIN’s Co-moderator, Catherine McGoveran, attended the conference and has compiled notes of the key points from each presentation. The following are notes from the first session, Libraries and the records of governments. Notes from session two, Libraries and the information of governments, will be posted in the coming days. 

Chicago

Welcome and Keynote: Information, Transparency and Open Government: A Public Policy Perspective
Thomas S. Blanton, Executive Director of the National Security Archive at George Washington University
YouTube-logo-full_color

  • Born-digital document production is far outpacing the physical documents we have in our government archives from the past two centuries
  • There are many barriers that limit an accurate understanding and adoption of open government
  • The Electronic Records Archive – a flagship legacy system – doesn’t come close to being comprehensive
  • Documents from the Clinton administration ordered declassified have not yet reached the shelves of the national archives – a lot more support is needed (financial and personnel) to speed up the declassification process and make these documents available
  • Research libraries are on a trajectory from the collection and preservation of special collections to data curators > our future is in the interactive collaboration with others, crowd-sourcing to make sure the data is available
  • The opening of government data can help make incredibly important revelations, which can lead to better government and consumer decisions
  • The National Archives and Records Administration (NARA) will fail to meet mission unless it becomes an offsite backup for electronic government records
  • Only 2% of what the government creates gets saved at NARA
  • We know of the huge power of the National Security Agency (NSA) to retrieve, store, and link records > NSA does records management well and this expertise should be used by the National Archives for off-site back up of government information > this would fit well with national security mandate and could be a way for the agency to restore trust and engage in the civic duty of preserving and making available government information

Historical Research and Government Records in the Era of Big Data: a Historians Perspective
Matthew J. Connelly, Professor of History, Columbia University
YouTube-logo-full_color

  • The government info available is the function of a political process > it is the relationship between knowledge and power > where do electronic records fit into this?
  • There is a crisis in democratic accountability and national security > it is a national security issue when departments and agencies don’t have functioning / accurate archives, it opens the government to foreign threats and is an issue about which every citizen should care
  • It is doubtful that, as historians, we will ever be able render a complete account of government documents, records, and decisions
  • We don’t know what we don’t know
  • The government should move more aggressively to use data mining to manage records
  • Archives are also sites of expectation, not just memory > they’re about the future

To bring together the records of the past and to house them in buildings where they will be preserved for the use of men and women in the future, a Nation must believe in three things. It must believe in the past. It must believe in the future. It must, above all, believe in the capacity of its own people so to learn from the past so that can gain in judgment in creating their own future.Roosevelt

  • The system is so overloaded that the info that should be protected has suffered because of over protection
  • Historically, secrecy has been in the eye of the beholder, which makes it difficult to set a classification standard that satisfies everyone
  • With the transition to electronic, hundreds of thousands of paper records were lost, because they were not migrated to digital and not kept
  • The budget for declassification has diminished and the budget for keeping secrets has skyrocketed > though there has been a huge growth in the amount of information created, there has been a steep decline in the amount of records declassified
  • The amount the government is currently spending on declassification is 15% of what was spent in the late 90s
  • Data mining can help us identify gaps in the documents released and withheld by government and text analysis can help us identify the trends of declassified words and issues
  • By comparing redacted and later released unredacted documents, we can see the patterns of official secrecy > this could help government find what topics are more sensitive, which and aid the classification and declassification process

Read Matthew Connelly’s recent article on declassification policies, “The Ghost Files”, in a recent issue of  Columbia magazine. Visit Connolly’s Declassification Engine: http://www.declassification-engine.org/

Panel Discussion: Preserving the Electronic Records of Governments: Issues and Challenges
YouTube-logo-full_color Adobe PDF icon
Paul Wester, Jr., Chief Records Officer for the United States Government, National Archives and Records Administration

  • There is high level administrative support for records management (presidential memo) > we need to change how we manage records and make them available to the public and we cannot do things on an individual level any more
  • Directive developed with deadlines and guidelines to transform how records management is done across the government
  • Directive goals: transform the entire record keeping function from analog to digital automated approach
    • Federal agencies must manage all permanent electronic records in an electronic format by December 31, 2019
    • All agencies must manage both permanent and temporary email records in an accessible electronic format by December 31, 2016
  • Agencies must manage documents in automated ways to be effective
  • Training, awareness, and accountability are the main focus of the directive
  • How we manage email will transfer to how we manage other types of electronic records
  • We need to focus on records of relevance and work with universities and archives to do research to set records free and build new connections to records (collaborate to build exhibits, showcase research projects, provide context, etc.) > value could be short term but visibility will be long term

William A. Mayer, Executive for Research Services for the National Archives and Records Administration

  • Focusing on building a national framework for archives research services to reach more people (now limited to physical archives locations)
  • NARA is changing the way staff access and interact with the web and with records > as learning is human-to-human we still need humans to be involved with dealing with records
  • Archives are seen as the end of the records management pipeline, but involvement needs to start farther upstream
  • NARA still has 30 years of paper records to come to Archives
  • While we may need to consider how to bring in more records to do data analysis, we also need to figure out how to get rid of those records that really don’t have value because we don’t have the capacity to preserve everything
  • In terms of web harvesting, one issue for consideration is the capture of content-rich intranet sites
  • NARA would like to engage in more small partnerships around building context-rich interfaces for resources

Cecilia Muir, Chief Operating Officer, Library and Archives Canada

  • Our government planning to ensure that over 98% will have access to high speed internet even in remove parts of country by 2017 (Digital Canada 150, p. 7)
  • Library and Archives Canada (LAC) has a link to at least three initiatives in the government’s Action Plan on Open Government
  • Shared Services Canada is consolidating the government’s digital services
  • LAC mandate:  ensure documentary heritage of Canada is preserved, be the source of enduring knowledge accessible to all, facilitate cooperation among library and archive communities, serve as the continuing memory of government of Canada
  • Departments and agencies focus on managing information of “business value” and LAC receives records of “enduring value”
  • There has been a shift in thinking about the separation between government records and publications > people are less and less concerned about format and now focus on content and value of content
  • A risk-based approached needs to be implemented to manage the increasing amount of information
  • LAC is interested in collaborating with various partners to support research, access, and context building

Paul Wagner, Director General and Chief Information Officer, Information Technology Branch, Library and Archives Canada

  • Documents must be accessible, or at least discoverable, for us to meet our mission
  • We are moving towards digital curation model > the goal is to become a trusted digital repository > this used to simply be a tech based solution (buy right system), now it’s about capacity and ability to work in a digital world
  • We need to have same rigour for digital assets that we have for physical documents and we’re not there yet
  • Not all digital assets need to be in Government of Canada data centres, as not everything is private > we are going to work with the private sector to see how we can manage and preserve these records
  • Context is key, as many clients don’t know how to interact with the data / information > we need to provide the context to them so they can understand > LAC can create user experience to provide context and access while the data may be held somewhere else
  • The data and information we have is valuable, but only in context > user contributed content / analysis that takes massive amounts of data and finds trends and stories are what create the value in that information

_________________________

Updates on NARA records activities can be found on the NARA Records Express Blog: blogs.archives.gov/records-express

Government of Canada Web Archive

Launch of the Government of Canada Web Archive

Library and Archives Canada (LAC) will launch the “Government of Canada Web Archive” on November 20, 2007.  The site can be found at: http://www.collectionscanada.gc.ca/webarchives/

The Library and Archives of Canada Act received Royal Assent on April 22, 2004, allowing Library and Archives Canada (LAC) to collect and preserve a representative sample of Canadian websites. To meet its new mandate, LAC began to harvest the Web domain of the Federal Government of Canada starting in December 2005. As resources permit, this harvesting activity will be undertaken on a semi-annual basis. The harvested website data is stored in the “Government of Canada Web Archive” (GCWA). Client access to the content of the GCWA is provided through searching full text by keyword, by department name and by URL. It is also possible to search by specific format type, (e.g., *.PDF). By the fall 2007, approximately 100 million digital objects (over 4 terabytes) of archived federal government website data will be made accessible via the LAC website.

Library and Archives Canada (LAC) has implemented this first significant Canadian Web archive through the use of open source tools, developed by the International Internet Preservation Consortium (http://www.netpreserve.org), of which LAC is a member. The goal of this organization is to collect, preserve and ensure long-term access to Internet content from around the world through the collaborative development of common tools and techniques for developing Web archives.

——————

Lancement des Archives du Web du gouvernement du Canada

Bibliothèque et Archives Canada lancera « Archives du Web du gouvernement du Canada » le 20 novembre 2007. Vous pourrez consulter le site au http://www.collectionscanada.gc.ca/archivesweb/
La Loi sur la Bibliothèque et les Archives du Canada (BAC) a reçu la sanction royale le 22 avril 2004. Elle permet à BAC de recueillir et de préserver un échantillonnage représentatif des sites Web canadiens. Afin de remplir son nouveau mandat, BAC a commencé à recueillir des sites Web du gouvernement du Canada en décembre 2005. Tant que les ressources le permettront, il procédera à une collecte de sites deux fois par année. Les données des sites Web recueillis sont stockées dans les Archives du Web du gouvernement du Canada (AWGC). L’accès des clients au contenu des AWGC est offert à partir de la recherche plein texte par mot clé, nom de ministère et URL. Il est également possible d’effectuer la recherche par un type de format précis, p. ex., *.PDF. Lorsque le site sera lancé, environ 100 millions d’objets numériques (plus de 4 téraoctets) de données de sites Web archivés seront accessibles à partir du site Web de BAC.
Bibliothèque et Archives Canada a mis en oeuvre ces premières importantes archives Web du Canada en se servant d’outils à source libre, développés par le Consortium international pour la préservation d’Internet (www.netpreserve.org), dont BAC est membre. Le but de cette organisation est de recueillir, de préserver et d’assurer un accès à long terme au contenu d’Internet à l’échelle mondiale, et ce,  grâce à la mise au point de techniques et d’outils communs pour le développement des archives du Web.