Batch

The AWSBatchManager allows you to use the AWS Batch service as a Julia cluster.

Requirements

  • An IAM role is setup that allows batch:SubmitJob, batch:DescribeJobs, and batch:DescribeJobDefinitions
  • A Docker image registered with AWS ECR which has Julia and AWSClusterManagers.jl installed.

The AWSBatchManager requires that the running AWS Batch jobs are run using "networkMode=host" which is the default for AWS Batch. This is only mentioned for completeness.

Usage

Let's assume we want to run the following script:

# demo.jl
using AWSClusterManagers: AWSBatchManager
using Distributed

addprocs(AWSBatchManager(4))

println("Num Procs: ", nprocs())

@everywhere id = myid()

for i in workers()
    println("Worker $i: ", remotecall_fetch(() -> id, i))
end

The workflow for deploying it on AWS Batch will be:

  1. Build a docker container for your program.
  2. Push the container to ECR.
  3. Register a new job definition which uses that container and specifies a command to run.
  4. Submit a job to Batch.

Overview

Batch Managers

The client machines on the left (e.g., your laptop) begin by pushing a docker image to ECR, registering a job definition, and submitting a cluster manager batch job. The cluster manager job (JobID: 9086737) begins executing julia demo.jl which immediately submits 4 more batch jobs (JobIDs: 4636723, 3957289, 8650218 and 7931648) to function as its workers. The manager then waits for the worker jobs to become available and register themselves with the manager by executing julia -e 'sock = connect(<manager_ip>, <manager_port>); Base.start_worker(sock, <cluster_cookie>)' in identical containers. Once the workers are available the remainder of the script sees them as ordinary julia worker processes (identified by the integer pid values shown in parentheses). Finally, the batch manager exits, releasing all batch resources, and writing all STDOUT & STDERR to CloudWatch logs for the clients to view or download.

Building the Docker Image

To begin we'll want to build a docker image which contains:

  • julia
  • AWSClusterManagers
  • demo.jl

Example:

FROM julia:1.6

RUN julia -e 'using Pkg; Pkg.add("AWSClusterManagers")'
COPY demo.jl .

CMD ["julia demo.jl"]

Now build the docker file with:

docker build -t 000000000000.dkr.ecr.us-east-1.amazonaws.com/demo:latest .

Pushing to ECR

Now we want to get our docker image on ECR. Start by logging into the ECR service (this assumes your have awscli configured with the correct permissions):

$(aws ecr get-login --region us-east-1)

Now you should be able to push the image to ECR:

docker push 000000000000.dkr.ecr.us-east-1.amazonaws.com/demo:latest

Registering a Job Definition

Let's register a job definition now.

NOTE: Registering a batch job requires the ECR image (see above) and an IAM role to apply to the job. The AWSBatchManager requires that the IAM role have access to the following operations:

  • batch:SubmitJob
  • batch:DescribeJobs
  • batch:DescribeJobDefinitions

Example)

aws batch register-job-definition --job-definition-name aws-batch-demo --type container --container-properties '
{
    "image": "000000000000.dkr.ecr.us-east-1.amazonaws.com/demo:latest",
    "vcpus": 1,
    "memory": 1024,
    "jobRoleArn": "arn:aws:iam::000000000000:role/AWSBatchClusterManagerJobRole",
    "command": ["julia", "demo.jl"]
}'

NOTE: A job definition only needs to be registered once and can be re-used for multiple job submissions.

Submitting Jobs

Once the job definition has been registered we can then run the AWS Batch job. In order to run a job you'll need to setup a compute environment with an associated a job queue:

aws batch submit-job --job-name aws-batch-demo --job-definition aws-batch-demo --job-queue aws-batch-queue

Running AWSBatchManager Locally

While it is generally preferable to run the AWSBatchManager as a batch job, it can also be run locally. In this case, worker batch jobs would be submitted from your local machine and would need to connect back to your machine from Amazon's network. Unfortunately, this may result in networking bottlenecks if you're transferring large amounts of data between the manager (you local machine) and the workers (batch jobs).

Batch Workers

As with the previous workflow, the client machine on the left begins by pushing a docker image to ECR (so the workers have access to the same code) and registers a job definition (if one doesn't already exist). The client machine then runs julia demo.jl as the cluster manager which immediately submits 4 batch jobs (JobIDs: 4636723, 3957289, 8650218 and 7931648) to function as its workers. The client machine waits for the worker machines to come online. Once the workers are available the remainder of the script sees them as ordinary julia worker processes (identified by the integer pid values shown in parentheses) for the remainder of the program execution.

NOTE: Since the AWSBatchManager is not being run from within a batch job we need to give it some extra parameters when we create it.

mgr = AWSBatchManager(
    4,
    definition="aws-batch-worker",
    name="aws-batch-worker",
    queue="aws-batch-queue",
    region="us-west-1",
    timeout=5
)