How do you handle failed deployments in a CI CD pipeline without disrupting production

Question

This question basically seeks to know how deployment failures are handled in CI/CD pipelines as not to adversely reflect on the production environment. Here, the strategy focuses on how failures or unsuccessful deployments can be rolled back as quickly as possible to prevent the impact on live services. In answering, techniques would include the use of blue-green deployments and canary releases, along with automated rollback features, as well as best practices for maintaining continuous uptime and stability during deployment errors.

Gagana · Answer 1 · Nov 3, 2024

Handling failed deployments in a CI/CD pipeline without disrupting production involves several strategies. Here are some best practices:

Best practices:
Blue-green or canary deployments: which allow you to create two identical environments, having one active and one that does not have traffic. During deployment, you upload the new version to the idle environment, which has the color blue, then route traffic only when tested to be stable. With the canary deploy, you deploy updates on a small percentage of customers and monitor for issues then roll out fully.

Roll Back Mechanism Implement: Automatically set rollback policies in case of a failed deployment. This is achievable through the tools such as Kubernetes that allow rolling updates and roll back to the previous stable version in case of a new deployment failure.

Feature Toggles: Features can be toggled on/off with feature flags so that new features do not have to be redeployed. This enables release on demand and provides for an easy disable of the feature if something goes wrong without having to revert to an entire deployment.

Testing Environments: Test, extensively, in a staging environment identical to production. This comprises unit, integration, and load tests, which helps spot problems early and means a lower chance of breaking the production environment.

Monitor and Alert: With tools like Prometheus and Grafana, you can monitor your critical metrics and quickly see where the problems are with an alerting system configured. That way, any failures can be caught very early in the process so that end users are never impacted.

CI/CD Pipeline Guardrails: When defining your pipeline, implement guardrails at critical steps—such as automatically running smoke tests immediately after deployment. This early testing can quickly detect issues, preventing faulty code from progressing further down the pipeline and reaching production.