infrastructure-as-opt

The need for faster processing of large amounts of data, in some cases with a notion of locality, drives the design of distributed approaches to machine learning. Two common considerations are: i) which parts of a computation to distribute and reduce/synchronize, and ii) how to minimize communication latencies.

These considerations pose an optimization problem in the “infrastructure vs. model architecture” search space.

To effectively traverse this search space, it is necessary to:

iterate across infrastructures and model architectures, and
measure the “fit” between an infrastructure and a model architecture.

Towards this goal, as an Insight Fellow, I developed:

a one-step provisioning of an on-premises capable k8s infrastructure to streamline infrastructure iteration, and
a set of proof-of-concept modules for constructing objective functions based on latency tracing

infrastructure-as-opt/infrastructure

see further instructions in infrastructure-as-opt/infrastructure.

infrastructure-as-opt/tracing-ml

see further instructions in infrastructure-as-opt/{cpu, gpu}-build-jobs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

infrastructure-as-opt

infrastructure-as-opt/infrastructure

infrastructure-as-opt/tracing-ml

Files

README.md

Latest commit

History

README.md

File metadata and controls

infrastructure-as-opt

infrastructure-as-opt/infrastructure

infrastructure-as-opt/tracing-ml