Tech stack: Apache Airflow, AWS Batch, AWS ECR, AWS S3, Docker, Python
The problem we were trying to solve was to build a scalable sytem where we can run any number of jobs in series or in parallel. These jobs could be python processes, bash scripts, binary executibles etc any piece of software that could be triggered via command line interface.
Initially our system was running all of our python processes on a single EC2 machine hosted in AWS, where each job was triggered via a cron. But that was not scalable because as our number of jobs scaled, the CPU and memory resources on the machine was limited, and required manual resizing.
Airflow to the rescue! Airflow is a great workflow orchestration system with a great visual component. It allows us to programmatically describe our jobs as DAGS (Directed Acyclic graphs). Along with that, Airflow provides an out-of-the-box solutions for monitoring, retry functionality, logging and analytics.
AWS Batch provided us scalability. Integrating the AWS Batch service with Airflow provided us a way to run each job on its own EC2 machine, and the machine is terminated afters its run. Thus, each job has its own CPU and memory resources which can be configured.
Using the 'BatchOperator' we were able to integrate the AWS Batch service with Airflow.
Eventually, our system looked like this: A single EC2 machine to host Airflow (webserver and scheduler). Another machine called the 'Airflow Release' box which basically held the github repos and allowed us to push code changes. For any code change, we would build a docker image on that machine, and push it to ECR(AWS service which serves as a repository for container images) for the AWS Batch service to consume. Each job that was triggered from airflow, would spin up its own machine, run the job, and then terminate it. Pretty cool eh?