Intelligent, adaptive self-service Data Science

- Masterarbeit -

Beschreibung:

This master thesis focuses on two main objectives, namely on requirements engineering and implementation of a holistic platform enabling data scientists or engineers an intuitive and easy access to distributed resources of large compute clusters for interactive and efficient data analysis.

Firstly, an investigation of the current situation in terms of working interactively on compute clusters with frameworks for distributed data will be conducted. It is meant to find answers for questions like, what kind of interactive setup are data scientists using on their clusters? Are they using 3rd party solutions or maybe their own, custom made? How satisfied are they with their current setup? What is missing and what needs to be improved? What are their use-cases? An important aspect of the investigation is to find the niche for a custom made solution, that shall be implemented as a part of this thesis.

The information gathered during the investigation need to be analyzed in order to set up requirements and architecture for said platform. After the requirements are specified and the architecture is defined, the platform should be implemented. The main focus shall be put on requirements, that deal with the disadvantages of current solutions, that were found during the investigation, as well as properties or features that were found to be missing in current solutions. Such a platform should provide an intelligent and adaptive way of accessing large and shared computation resources for users within their company in a self-service manner.

It should reduce the amount of work for users (e.g. data scientists, engineers) to the necessary minimum and allow them to fully focus on their main tasks. Cluster administrators should profit from the platform as well, since it should make it asier for them to control resources and provide them with an overview on what is happening inside the cluster. The platform should be flexible and scalable, running on a small cluster consisting of two nodes or on one having hundreds of them. It should provide the necessary tools for efficient processing and analytics and if needed, one should be able to easily add his own libraries and packages. Cluster resources are supposed to be managed smartly and transparently, so that as many users as possible can work simultaneously.

A core aspect of the platform should be the possibility to work with a framework for distributed data, e.g. Apache Spark, from an interactive development environment as well as a central GUI from where the users can start their interactive sessions, specify resources needed for their environment or monitor the cluster usage.

Anforderungen/Kenntnisse:

Apache Spark, Cloud Computing Infrastructures, Data Science Notebooks ...

Bearbeitung:
Rafal Lokuciejewski

Ergebnis:
Die Ausarbeitung kann im Institut für Informationssysteme angefordert werden.

Betreuung:

Prof. Dr. rer.nat. habil. Sven Groppe
Institut für Informationssysteme
Ratzeburger Allee 160 ( Gebäude 64 - 2. OG)
23562 Lübeck
Telefon: 0451 / 3101 5706