I have a required to ingest data from multiple on-prem data sources into my Redshift. This ingestion will be a scheduled activity running every 6 hours in a day. The process should be able to identify the delta records and load only new/changed records in Redshift. In all these processes, restart option should also be made available.I am trying to do this using either entirely AWS services or with a combination of python programs and aws services.
My idea is to setup a data flow from external sources to s3, then temporarily launch a ec2 instance for any data processing/wrangling requirement, then write the curated data back to s3, terminate the ec2 instance and load data into redshift using datapipeline.
Can you suggest some pointers to start with. If you have experience with a similar project , do share your experiences. Also if possible, please share a design and associated code for reference.