Open Research Computing

2018-05-02

The GUI-based framework of the LCM enables end-users to conduct standardized text mining workflows without any programming skills. However, individual and innovative research designs often demand more flexibility which is hard to achieve with generic pre-defined workflows accessed by point-and-click GUIs. Instead, such research designs can be supported well with script programming languages. The iLCM infrastructure integrates an Open Research Computing (ORC) environment to allow for the extension of LCM’s generic analysis procedures by program scripts. In this way, the iLCM targets requirements of two user groups: beginners of text mining methods who benefit most from GUI-guided workflows as well as more advanced analysts who demand the flexibility of freely customizable scripts.

The ORC component is based on established open-source software such as Jupyter, Docker and Kubernetes, which are adapted to the requirements of the iLCM infrastructure. The development of these open-source frameworks is widely supported and hence provides the reliability of a well-maintained code base.

Notebooks

The ORC component extends the iLCM with an editor environment for program scripts which is operated via a web browser. The web editor enables the creation of scripts along with their documentation. Furthermore, scripts can be directly executed and the produced results can be visualized as output in the same document. The execution of the scripts themselves does not take place in the browser but on a server. This allows the processing of large data objects when server-side resources (memory, CPU) are available.

To fulfill important requirements for CSS, scripts can be extensively documented with markup code within the editor. Results from script executions such as numerical measures, tables or plots can be embedded in the same script document which allows tracing every single step of an analysis. Such actively executable documents, a combination of script code, its documentation, and corresponding results, are called ‘notebooks’. For researchers this allows having the data, analysis scripts and their documentation available in one notebook for publication, sharing and re-use.

The ORC component extends the iLCM in two ways. First, the notebook environment is embedded in a virtual machine ensemble together with the LCM to closely integrate the GUI-paradigm for text analysis and the script paradigm to analyse structured data. Second, an ORC environment is publicly hosted by Gesis, the infrastructure partner of the iLCM project to allow for archiving, sharing and re-using of notebooks.

Various options are available to implement the notebook functionality, most notably R Notebooks and Jupyter Notebooks. For the iLCM, we opted for Jupyter Notebooks. Jupyter Notebooks have a wide community support and enable the use of numerous scripting languages, including R, Python, and Julia.

Notebook Gallery

Gesis will host a public instance of the ORC environment. This ORC instance is then extended with a repository on which notebooks can be published by users. The repository not only secures the long-term availability of published notebooks, it also serves as a notebook gallery that users can browse and search via keywords and metadata. The notebooks can be shared, edited and executed in a SaaS manner. The execution of shared notebooks is enabled by the import of the notebooks into the ORC environment by a process referred to as ‘cloning’. This supports and enhances the reproducibility of research in the field of CSS.

The notebook gallery supports the research process in a number of ways. First of all, the gallery allows researchers to not only easily reproduce results of other researchers, but it also makes it easier to build on existing research through the cloning of notebooks. Furthermore, generic versions of different methods and analysis processes will be published as notebooks, enabling researchers to clone them and adapt them according to their own needs. Finally, the gallery makes it easy to discover learning materials published as notebooks, which supports researchers in learning new methods and analysis techniques.

We expect that the synergetic use of script-based analyses and generic text processing will support researchers in the development of analysis processes according to their project-specific requirements. We also expect a high impact of the architecture for teaching CSS methods.