Assistive software for the creation of image dataset in a digital library using machine learning
Vol.15,No.2(2023)
Purpose – This paper describes the possibilities of using assistive software to efficiently create image datasets from digital library documents. The software described, in addition to the usual ways of working with data, uses machine learning features that have the potential to both make the work of annotators easier and to change annotation practices. At the same time, the emphasis is on simplicity and openness of the whole process. The aim is to highlight these elements through practical examples.
Design / methodology / approach – After an introductory section, the possibilities for selecting and separating data from digital library documents are presented. At the same time, the limitations of these approaches are pointed out. Based on these insights, possible approaches and the use of assistive software are then explored in order to overcome these limits. The methods are described based on the practical use of the software in the annotation process. The validation of the machine learning features is performed using, among others, the visualization technique Class Activation Mapping and the F-score metric.
Results – The described approaches and the use of assistive software with machine learning features proved to be very beneficial. The software not only makes the work of the annotators easier but also considerably faster and more accurate. The versatility of the tested machine learning model also proved to be a great positive, allowing to extend the annotation processes beyond the initially assumed use and thus giving room for further research in this area.
Originality / value – The technical paper highlights possible approaches to use assistive software to facilitate the creation of datasets for documents with a limited number of identifiers, such as a digital library, without the need for commercial tools. It also shows practical examples of how machine learning can be used to make these processes more efficient. Examples of how these processes can be used universally are also provided.
datasets; software; machine learning; digital library; annotation
Filip Jebavý
Moravská zemská knihovna v Brně
Filip Jebavý se zabývá problematikou analytických schopností umělých neuronových sítí. V této oblasti se též podílí na několika výzkumných projektech se zaměřením na humanitní vědy a strojové učení. V současnosti pracuje jako vedoucí Odboru správy digitálních dokumentů v Moravské zemské knihovně v Brně.
API Specifications—International Image Interoperability FrameworkTM. (b.r.). Získáno 20. červenec 2023, z https://iiif.io/api/
API v7 · ceskaexpedice/kramerius Wiki. (b.r.). Získáno 20. červenec 2023, z https://github.com/ceskaexpedice/kramerius/wiki/API-v7
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. The MIT Press.
Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., & Cieliebak, M. (2017). Fully Convolutional Neural Networks for Newspaper Article Segmentation. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 414–419. https://doi.org/10.1109/ICDAR.2017.75
Northcutt, C. G., Athalye, A., & Mueller, J. (2021). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks (arXiv:2103.14749). arXiv. http://arxiv.org/abs/2103.14749
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., & Ré, C. (2017). Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, 11(3), 269–282. https://doi.org/10.14778/3157794.3157797
Ratner, A., De Sa, C., Wu, S., Selsam, D., & Ré, C. (2017). Data Programming: Creating Large Training Sets, Quickly(arXiv:1605.07723). arXiv. http://arxiv.org/abs/1605.07723
Registr Krameriů. (b.r.). Získáno 19. červenec 2023, z https://registr.digitalniknihovna.cz/
Ying, X. (2019). An Overview of Overfitting and its Solutions. Journal of Physics: Conference Series, 1168(2), 022022. https://doi.org/10.1088/1742-6596/1168/2/022022
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization (arXiv:1611.03530). arXiv. http://arxiv.org/abs/1611.03530
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright © 2023 Filip Jebavý