Articles
| Open Access |
https://doi.org/10.37547/ijmef/Volume06Issue01-07
Engineering Practices For Ensuring Resilience In Scalable Cloud Systems
Abstract
The article systematizes engineering practices that ensure resilience in scalable cloud architectures: designing for failures, automated recovery, observability, reliability management through SLOs and error budgets, as well as experimental verification of stability using chaos methods Engineering. A practice-oriented taxonomy of approaches is proposed and how to link technical measures with manageable reliability goals is demonstrated.
Keywords
Sustainability, reliability, cloud systems
References
AWS. AWS Well-Architected Framework - Reliability Pillar [Electronic resource]. - Amazon Web Services, 2024. - Mode Access: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html
Beyer B., Jones C., Petoff J., Murphy NR Site Reliability Engineering: How Google Runs Production Systems [Electronic resource]. - Google Research, 2016. - Mode access: https://research.google/pubs/site-reliability-engineering-how-google-runs-production-systems/
Majors C., Fong L., Miranda G. Observability Engineering: Achieving Production Excellence. - Sebastopol: O'Reilly Media, 2022. - 432 p.
Basiri A., Behl A., De Rooij R., Hochstein L., Kosewski L., Reynolds J., Rosenthal C. Chaos Engineering // IEEE Software. - 2016. - Vol. 33, No. 3. - P. 35-41. - DOI: 10.1109/MS.2016.60.
Nygard MT Release It! (2nd ed.): Design and Deploy Production-Ready Software. - Raleigh: Pragmatic Bookshelf, 2018. - 368 p.
Rosenthal C., Jones N., Basiri A., et al. Chaos Engineering: Building Confidence in System Behavior through Experiments. - Sebastopol: O'Reilly Media, 2017. - 304 p.
Article Statistics
Copyright License
Copyright (c) 2026 Damir Rakhmaev

This work is licensed under a Creative Commons Attribution 4.0 International License.