Analysis of the Czech web archive: Provenance, authenticity and technical parameters

Vol.11,No.1(2019)

Abstract

Purpose – The article provides an overview of possible input criteria when archiving web pages through web archives and describes what impact their settings may have on the resulting archive data in the content, format, and technical plane. Setting the input parameters for web archiving directly determines the resulting form of archive content, and if research is done over these data, researchers need to know the source of the data. Without this knowledge, it is not possible for researchers to use archival data as representative source.

Design/Methodology/Approach – The basic method for article processing was data analysis of the index, i.e. the list of all digital objects of the Czech Web Archive (the Webarchiv) of the National Library of the Czech Republic, and the input variables in the creation of archival data. Specifically, their provenance, authenticity, or content was investigated. Furthermore, the technical side of the archiving concerns, for example, the setting of the harvesters. The analysis is based on experience and was performed with the actual harvested data.

Results – The article summarises the factors that influence the resulting form of archive data. First, there are factors that directly affect data collection, such as technical settings, resource Collection policy, and legislation. Second, there are factors concerning the handling of archive data, in particular rules for deleting and limiting access to content. The article also describes web archive index analysis that brought a quantified view of the archive and showed the numbers of digital objects, layout of file formats, domain composition, and archive development over time.

Originality/Value – The greatest benefit of the article is a comprehensive overview of the data stored in the Webarchiv, how they are created and what affects their creation. This is crucial for all potential researchers who are interested in working with Webarchiv data and who need to know the source of the data for their research.


Keywords:
web archiving; Webarchiv; big data; data mining; data analysis; digital archiving; web resources; web archiving methods
References

About /robots.txt. (2007). Dostupné z: http://www.robotstxt.org/robotstxt.html

Blumenthal, K. (2018). Access Archive-It's Wayback index with the CDX/C API. Dostupné z: https://support.archive-it.org/hc/en-us/articles/115001790023-Access-Archive-It-s-Wayback-index-with-the-CDX-C-API

Brügger, N., Schroeder, R. (2017). The Web as History. UCL Press. Dostupné z: http://discovery.ucl.ac.uk/1542998/

Corey Davis. (2014). Archiving the Web: A Case Study from the University of Victoria. Code4Lib Journal, Iss 26 (2014), (26).

Costa, M., Gomes, D., & Silva, M. (2017). The evolution of web archiving. International Journal on Digital Libraries, 18(3), 191–205. Dostupné z: https://doi.org/10.1007/s00799-016-0171-9

Cubr, L. (2010). Dlouhodobá ochrana digitálních dokumentů. Praha: Národní knihovna České republiky.

Cubr, L. (2017). Autenticita a digitální informace. Praha: Univerzita Karlova v Praze. Disertační práce.

Graham, M. (2017). Robots.txt meant for search engines don’t work well for web archives. Dostupné z: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/

Hartig, O. (2009). Provenance Information in the Web of Data. LDOW, 538.

Haškovcová, M., Holoubková, M., Kvasnica, J., & Hrdličková, M. (2017). Akvizice českých webových zdrojů. Acta Musei Nationalis Pragae (Historia), 71(3–4), 41–46.

Kahle, B., & Burner, M. (1996, September 15). Arc File Format. Dostupné z: https://archive.org/web/researcher/ArcFileFormat.php

Kvasnica, J. (2015). Budoucnost českého webového archivu. Inforum 2015. Praha: Národní knihovna České republiky.

Masanès, J. (2005). Web archiving methods and approaches: a comparative study. Library Trends, 54(1). Dostupné z: https://muse.jhu.edu/article/193226/summary

Osborne, A. (2018, July 4). Heritrix 3: Introduction. Dostupné z: https://github.com/internetarchive/heritrix3/wiki/Introduction

Shein, E. (2016). Preserving the Internet. Communications of the ACM, 59(1), 26–28. Dostupné z: https://doi.org/10.1145/2843553

Tackling Illegal Content Online. (2017). Dostupné z: https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:52017DC055

Metrics

245

Views

130

PDF (Czech) views