Skip to content

Malocher is a lightweight python library for running jobs on a cluster where nodes are accessed via SSH and share a common network storage like traditional NFS or mountable cloud storage.

Notifications You must be signed in to change notification settings

Whadup/malocher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malocher

"Malocher" [maˈloːxɐ] is a German colloquialism from the Ruhr-Area for "worker", particularly used for miners and steel workers.

Malocher is a lightweight python library for running jobs on a cluster where nodes are accessed via SSH and share a common network storage like traditional NFS or mountable cloud storage. We

  • use SSH and paramiko for communication between workers,
  • rely ondill for serializing code and data to a shared filesystem and
  • assume that all python libraries and interpreters are available on all nodes, like when they're also in the NFS.

This way we do not need to use large cluster computing libraries, e.g. from the Apache universe.

Getting started

Installation

Simplest way is to use pip

pip install git+https://github.com/Whadup/malocher

Setting up the malocher-workers

This one's easy:

  • Make sure you can access each malocher node from the supervising node using the same SSH key ssh_private_key.
  • Make sure each malocher-worker, including the supervisor, has access to a shared directory malocher_dir
  • Make sure every malocher-worker, including the supervisor, has the same python environment, e.g. put it into a shared directory.

BTW, we chose the terminology worker/supervisor, because words matter.

Sample

import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import malocher


def fake_experiment(model, data_path=None):
    train = pd.read_csv(os.path.join(data_path, "train.csv"))
    y_train = train["class"]
    X_train = train.drop(columns="class")
    test = pd.read_csv(os.path.join(data_path, "test.csv"))
    y_test = train["class"]
    X_test = train.drop(columns="class")
    model.fit(X_train, y_train)
    return dict(accuracy = model.score(X_test, y_test))

if __name__ == "__main__":
    print("running")
    CONFIGS = {}
    for D in range(1,10):
        MODEL = RandomForestClassifier(max_depth=D)
        # Store our Configuration under the Job's ID
        JOB = malocher.submit(fake_experiment, MODEL, data_path="/home/share/datensaetze/pamono")
        CONFIGS[JOB] = D

    RESULTS = malocher.process_all(
        jobs=CONFIGS.values(),
        ssh_machines=["ls8ws020", "ls8ws021", "ls8ws022", "ls8ws023", "ls8ws024", "ls8ws025"],
        ssh_port=22,
        ssh_username="dummy",
        ssh_private_key="malocher_id_rsa"
    )
    # Retrieve the config by the result's ID
    for JOB, RESULT in RESULTS:
        print(CONFIGS[JOB], RESULT)

Pitfalls

Malocher starts each job in a new python interpreter. When libraries have changed during the experiment, these changes will be reflected in the outputs of the jobs. Try to avoid updating libraries and avoid installing libraries you're currently working on with pip install -e.

Arguments and globals will be stored on disk per job. Avoid loading large amounts of data in the main process to pass to the workers, as this will result in large files and long io waits. It's often better to load the data on the workers. If pre-processing is necessary in the main process, you can serialize it to disk once and manually reload it in the workers.

Software-Cosmos

Malocher is used in Experiment Runner to execute experiments on a number of machines. For experiment tracking, we advice the use of meticulous.

About

Malocher is a lightweight python library for running jobs on a cluster where nodes are accessed via SSH and share a common network storage like traditional NFS or mountable cloud storage.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages