FADE (A Framework for Distributed Execution)

aa
Description

Given the massive amounts of data handled at CERN, there is often a need to distribute a set of tasks on a cluster of machines.  These tasks maybe be of varying forms such as running tests for commissioning machines in the accelerator, running monte-carlo simulations and aggregating the results, searching millions of records of data etc. Irrespective of the problem in concern, the generic skeleton of the problem remains the same: leverage multiple machines to distribute computation. Solving this generalized problem would often incur technical hurdles, and it is considered easier to make domain specific assumptions and build a custom in-house solution at a higher cost.

This project aims to provide a generic framework for dispatching tasks to various machines, managing and monitoring them with a unified view of the system irrespective of the scale. In effect, this would be made available via a web API which can be programmed against in one's favourite language. Provided sufficient interest, we will look forward to provide a Java/Python client front-end to this API. The result of this attempt would be a Java/Python framework for implementing a distributed work pool.

Unlike single machine systems, distributed systems face the threat of communication (network) and hardware failure. As a result, a robust application running on multiple machines would need to be fault-tolerant. Solving such a vast problem in a popular programming language over a weekend would be a nightmare! In this project, we aim to build the backend in Erlang - a language known for building robust distributed systems. In the process, we also put the claims behind Erlang to test and evaluate it's usability for applications at CERN. In fact, this idea is born out of a challenge that we have with one of our project supervisors ;)

And finally, to show the efficacy of the framework, we will run and manage monte-carlo simulations using FADE on a cluster of machines.

For more motivation about this project, checkout: http://blog.nachivpn.me/2017/07/challenge-accepted.html

You can also join our Slack channel here: https://join.slack.com/t/cern-webfest17-fade/shared_invite/MjE0OTcxOTUz… :)

Goals of the project

A simple distributed, load-balanced and fault-tolerant implementation of a workpool has already been done by one of us here: https://github.com/nachivpn/pfp-chalmers/blob/master/labC/ft_worker_poo… 

During the weekend, we hope to:

  • Generalize the Erlang implementation further and build a server on top of it
  • Provide a programmable Java/Python client front-end 
  • Provide a web front-end for monitoring the jobs and the cluster

And most important of all:

  • Eat free food at R1
  • Learn some functional programming! ;)
  • Investigate the suitability of Erlang for use in distributed aplications at CERN
Skills being sought

(One or more of)

  • Java/Python: serialization, build management (gradle?), HTTP client libraries
  • Monte carlo simulations and their application at CERN
  • Docker / Swarm (or Kubernetees if it offers some considerable advantage)
  • Basic web technologes for building a webapp front-end for monitoring
  • Erlang (or background in any other functional language)
Prerequisites

Some interest in large scale computing and the ability to speak fluent Dothraki

Contacts
Nachi, nachivpn@gmail.com
Fabio, fabiolu90@gmail.com