Given one ML repository contains 3 training applications, briefly,
root
|__Dockerfile
|__requirements.txt (contains **heavy dependencies**, e.g., numpy, sklearn, etc. needed for all 3 apps)
|__app_0
| |__training_0.py
| |__Dockerfile0
|__app_1
| |__training_1.py
| |__Dockerfile1
|__app_2
| |__training_2.py
| |__Dockerfile2
|__heavy_utils
|__utils.py
There are two approaches to build app_0, app_1 and app_2.
- One container for multiple apps - Build one container using the Dockerfile at root location. There will be some COPY commands at the end of the Dockerfile,
COPY app_0 .
COPY app_1 .
COPY app_2 .
- Multiple containers for multiple apps - Build multiple containers using individual Dockerfile$i inside app_$i.
I tried both approaches with pros and cons.
- One container for multiple apps
Pros: When uploading the image to AWS ECR, The file size is optimized as all 3 apps share some dependencies. Cons: When I plug in the container to Sagemaker training jobs, Sagemaker cannot recognize all 3 apps because docker building only allows one ENTRYPOINT in Dockerfile.
- Multiple containers for multiple apps
Pros: I can give different ECR images to Sagemaker training jobs with individual ENTRYPOINT specified. Cons: Duplicate dependencies across those ECR images.
I'd like to learn,
- Which one is more conventional or if any other better practice?
- Can I specify a custom ENTRYPOINT for Sagemaker training job (like the processing job) after the docker has been built? Specifically, I'm using Sagemaker SDK (sagemaker.estimator.Estimator) to build a Sagemaker pipeline. AFAIK, the entry_point option is only effective outside the container, i.e., run an external script from local or S3, which has a different behavior than entrypoint in sagemaker.processing.Processor.