Tutorial-DHN2019

View the Project on GitHub Yoonsen/Tutorial-DHN2019

Mining digital libraries

We will show how researchers within the humanities can access and use the cloud-based corpus at the National Library of Norway from within Jupyter Notebook. We cover the following topics:

  1. Show how copyrighted material can be utilized for corpus studies.
  2. Different tools built as modules over an API for defining and accessing a corpus for analysis like character modelling, collocation analysis, clustering, growth diagrams and more.
  3. Demonstrate the benefits of using Jupyter Notebook for researchers without programming background.
  4. Study the connection between texts and library metadata

A problem for many researchers is the use of copyrighted material. However, the actual text is not often required; some features of it may suffice, like bag of words, a participle count or a character model. None of these features challenge the copyright holder. A centralized repository of copyrighted material can provide feature sets that suffice for many kind of analyses.

An API can be used by researchers without programming skills, as well as programmers. While the latter need a documentation of the low-level interface to the cloud, the actual API, the former wants an accessible interface for doing corpus analysis, and Jupyter Notebook provides such an interface via top level functions and commands expressed in a programming language, e.g. Python or R.

Readymade library metadata can be integrated for building corpora based on those data, like Dewey decimal codes or topic words. We will show how metadata can be used to build, select and compare corpora. The participants will be able to build a corpus and do analysis on it.

The participants will experiment with the API, and get a hands on experience with the tools.

The repository can be used directly from a browser by following this link dhn2019 tutorial myBinder version, which uses MyBinder, documents here: Binder docs.

A link to this page Jupyter and corpus tutorial dhn2019