Wednesday, March 22, 2023

TOGAF - Data Management, Change Management and Security Capability Model

References:

https://www.visual-paradigm.com/guide/togaf/togaf-91-framework/#togaf-reference-model

https://pubs.opengroup.org/architecture/togaf9-doc/m/mobile-index.html

https://circle.visual-paradigm.com/docs/togaf-adm-guide-through/phase-h-architecture-change-management/

https://pubs.opengroup.org/architecture/togaf9-doc/arch/

https://pubs.opengroup.org/togaf-standard/integrating-risk-and-security/integrating-risk-and-security_3.html

http://grahamberrisford.com/00EAframeworks/03TOGAF/AMTOGpapers/Business%20architecture%20(inc%20capabilities%20and%20value%20streams)%20in%20TOGAF.htm#_Toc13582782

https://pubs.opengroup.org/architecture/togaf91-doc/arch/chap46.html

https://www.elementallinks.com/2009/02/02/open-groups-enterprise-architecture-frameworktogaf-9-released-today/

https://www.knowledgehut.com/blog/it-service-management/togaf-application-architecture

https://www.knowledgehut.com/blog/it-service-management/togaf-phases

https://conexiam.com/togaf-phase-h-applying-agile/

https://wiki.glitchdata.com/index.php/TOGAF:_ADM_Architecture_Change_Management

https://daoudisamir.com/deliverables-of-the-togaf-9-2-adm/

https://www.slideshare.net/sullivan_p/a-summary-of-togafs-architecture-capability-framework

https://www.internet-security-scan.com/security-architecture/security-and-risk-management-in-togaf.php

https://www.delegata.com/wp-content/uploads/2018/02/integrating-risk-and-security-within-a-togaf-enterprise-architecture-g152.pdf

https://www.diva-portal.org/smash/get/diva2:1215841/FULLTEXT02

https://www.isaca.org/resources/isaca-journal/issues/2017/volume-4/enterprise-security-architecturea-top-down-approach

https://circle.visual-paradigm.com/docs/togaf-adm-guide-through/phase-h-architecture-change-management/

Saturday, February 25, 2023

AWS Services Usages - End to End Architecture

End to End Architecture

Hits the browser -

Goes to internet Service provider

It has DNS resolver

Resolver call DNS root name Server

Root server looks the request and check for ".com"
Send .com TLD (name server), goes back to Service provider (recursive resolver)

Get the name server
Send this name server (.com TLD) and gets the harkeninformation.com

If it is hosted in AWS
Route 53 server will get Www.harkeninformation.com and return ip address to ISP resolver
Once ISP get the ip address it send back to browser

Browser makes a request to webserver which is hosting application
Webserver will display the request page after fulfilling requirement

Route 53

Amazon Route 53 is a highly available and scalable cloud Domain Name System (DNS) web service.
Manage DNS records for domain names
Connect user requests to infrastructure in AWS and outside of AWS

AWS CloudFront

is a web service that speeds up distribution of your static and dynamic web content, such as . html, . css, . js, and image files, to your users. CloudFront delivers your content through a worldwide network of data centers called edge locations.
Edge location - Edge locations are connected to the AWS Regions through the AWS network backbone to improved content acceleration.
It has multiple servers and cache the content on servers, use across all geography
Static Content

Store static content in S3 with expiry timeline

No one can access S3 directly, put OAI (origin access identity) through Cloud front

Dynamic content

CF direct to ALB / ELB

No one directly access to ALB and NLB, add custom header from CloudFront to ALB/NLB and then ALB/NLB can validate the header and allow the request

Security

These are applicable before request goes to ALB/NLB

AWS Shield

AWS Artifact is your go-to, central resource for compliance-related information that matters to you. It provides on-demand access to AWS’ security and compliance reports and select online agreements. It is free of cost
Protect Layer3 and layer 4 attacks

AWS Shield Advance

It is not free
On demand DDOS response team from AWS

It covers any financial losses due to data attacks

AWS WAF

It can be integrated with CloudFront or ALB or API gateway
It does GEO block

If want to block certain IP users from certain countries or malicious IPS, it can be blocked
It protects cross site scripting, SQL injection, OS unknown vulnerabilities.
Create rules in WAF, use these rules to protect all external or internal threat

Load Balancing

Have 3 option for load balancing point of view

Elastic Load Balancing / Network Load Balancing / API Gateway

When request route to ALB or NLB from CloudFront

ALB distribute traffic across AZ to increase availabilities of the application

Choosing right options

It provides two static IPS so if you want to route request from on perm to AWS

Needs to whitelist certain IPS which will correspondence to the NLB

Each static IPS across Availabilities Zone
If you need fixed IPS than use NLB
NLB works at layer 3 and layer 4
It route the traffic extremely fast
If you need high performance routing use NLB
It does not support Lambda so you can use NLB for lambda integration (Lambda as a target group)

IPS are dynamic so these IPS cannot be whitelist
Used for advances routing mostly use for Microservices

Path based routing

www.abc.com/nameid
Routed to based on nameid

Host based routing

www.abc.com/ or www.xyz.com
It help route based routing to different host as well

When you need to do advance routing, lambda as backend, container as backend, it is good for microservices, must use ALB

Api Gateway

It commonly used in MS
Lot of similarities like ALB
Additional features

Rate limiting
Request response mapping - transformation
Cache
Authentication in API
Traffic has reached ALB

We choose ALB then

Add app and web layer

Add EC2 or compute layer

Compute Layer

Have 3 option for compute layers (EC2/ECS with EC2 or Fargate launch type/Lambda)

Add EC2 - Want to Add Compute Layer

Want to scale Compute layer

Select Auto scaling group which can help you to Scale Compute Layer
CPU utilization will reach to threshold alert will trigger and auto scale will add new EC2 instances at compute layer

Mainly use for lift and shift
From in-perm virtual machine to cloud virtual machine

ECS EC2

It is managed service
It is manual provisioning

ECS Fargate

It is serverless offering from ECS

Containers

When to use containers

During the use of micro services
During the batch job

Use of containers

It scales the individual servers quickly
There is no OS so they can boot quickly, it uses the hostess
If want to scale individual services, use Containers

Lambda

Mainly use for event based
Traffic is unpredictable
Event must be triggers to execute Lambda functions
You will use compute service when event triggers
It pay per use
Scale up

Add provision concurrency if want to rapid scale up (it does have cold start problem)

Databases Layer

Have 3 option for DB layers (RDS/install FSX on EC2 and have SQL instance and DB/Aurora)

It is Managed services
RDS MySQL multi-AZ, so can have RDS primary in one AZ environment and others as standby
Standby RDS do the synch replication from primary RDS
If one failed other one is ready for use (automatic failover)
Read replica across AZ or regions
It is asynch application, but it is slower
Region’s failures always have to be done manually

It will be more than 5 minutes

OS level access is not available in RDS MySQL
Storage is scalable

Setup threshold and do auto scale

Rights scalable

Change the instance type

RDS has EC2 as underlying instance
So the bigger size of instance will be available for use
Service needs to bring down and upgrade to bigger instance type

Read replica in same AZ or different AZ or regions

That can offload a lot of read request without any interruptions
Increase the number of read replica and then take care of read issues
Use read replica when data is changing frequently

Elastic Cache - use when data is not changing often

Redis

In memory DB

Store queries that can help scale the read issues

Memcached

Aurora

It is propriety solution services from AWS
It is cheaper and faster than RDS
Read replica across AZ or regions
Service Failure in one region can be supported by other regions with low latency
RTO/RPO is less than 5 minutes

Read replica asynch in another region is always available
In case of switch over of a region failure RTO/RPO can be less than 5 minutes (to another read replica)

Storage is scalable

Setup threshold and do auto scale

Install FSX on EC2 and have SQL instance and DB

Installed DB on FSX
For linux based use EFS
Provides more flexibilities and control
You can mount MySQL as well

Dynamo DB

AWS Managed Service
Usages for session stores
It has DAX

DAX has elastic cache for Dynamo DB

Active Directory

Have 3 options

AWS managed AD

Managed Active director in AWS

It can authenticate any AWS users to access the service using AWS AD
It can delegate to on-perm AD for user authentication

AD Connector

It cannot authenticate itself but it can delegates to on-perm AD for user authentication
After authentication it allows users to access services

Cognito

Mainly use for Web App or Mobile App to be authenticating users

references:

Wednesday, February 15, 2023

AWS SRE Strategy

Definition of site reliability engineering

Site reliability engineering (SRE) is the practice of using software tools to automate IT infrastructure tasks such as system management and application monitoring. Organizations use SRE to ensure their software applications remain reliable amidst frequent updates from development teams. SRE especially improves the reliability of scalable software systems because managing a large system using software is more sustainable than manually managing hundreds of machines.

Why is site reliability engineering important?

Site reliability describes the stability and quality of service that an application offers after being made available to end users. Software maintenance sometimes affects software reliability if technical issues go undetected. For example, when developers make new changes, they might inadvertently impact the existing application and cause it to crash for certain use cases.

The following are some benefits of site reliability engineering (SRE) practices:

Improved collaboration:

SRE improves collaboration between development and operations teams. Developers often have to make rapid changes to an application to release new features or fix critical bugs. On the other hand, the operations team has to ensure seamless service delivery. Hence, the operations team uses SRE practices to closely monitor every update and promptly respond to any issues that arise due to changes.

Enhanced customer experience:

Organizations use an SRE model to ensure software errors do not impact the customer experience. For example, software teams use SRE tools to automate the software development lifecycle. This reduces errors, meaning the team can prioritize new feature development over bug fixes.

Improved operations planning:

SRE teams accept that there's a realistic chance for software to fail. Therefore, teams plans for the appropriate incident response to minimize the impact of downtime on the business and end users. They can also better estimate the cost of downtime and understand the impact of such incidents on business operations

Observability in site reliability engineering

Observability is a process that prepares the software team for uncertainties when the software goes live for end users. Site reliability engineering (SRE) teams use tools to detect abnormal behaviors in the software and, more importantly, collect information that helps developers understand what causes the problem.

Observability involves collecting the following information with SRE tools:

Metrics: Metrics are quantifiable values that reflect an application's performance or system health. SRE teams use metrics to determine if the software consumes excessive resources or behaves abnormally.

Logs: SRE software generates detailed, timestamped information called logs in response to specific events. Software engineers use logs to understand the chain of events that lead to a particular problem.

Traces: Traces are observations of the code path of a specific function in a distributed system. For example, checking out an order cart might involve the following:

o   Tallying the price with the database

o   Authenticating with the payment gateway

o   Submitting the orders to vendors

Traces consist of an ID, name, and time. They help software developers detect latency issues and improve software performance.

Monitoring in site reliability engineering

Monitoring is a process of observing predefined metrics in an application. Developers decide which parameters are critical in determining the application's health and set them in monitoring tools. Site reliability engineering (SRE) teams collect critical information that reflects the system performance and visualize it in charts.

In SRE, software teams monitor these metrics to gain insight into system reliability.

Latency: it describes the delay when the application responds to a request. For example, a form submission on a website takes 3 seconds before it directs users to an acknowledgment webpage.

Traffic: Traffic measures the number of users concurrently accessing your service. It helps software teams accordingly budget computing resources to maintain a satisfactory service level for all users.

Errors: An error is a condition where the application fails to perform or deliver according to expectations. For example, when a webpage fails to load or a transaction does not go through, SRE teams use software tools to automatically track and respond to errors in the application.

Saturation: Saturation indicates the real-time capacity of the application. A high level of saturation usually results in degrading performance. Site reliability engineers monitor the saturation level and ensure it is below a particular threshold.

Key principles in site reliability engineering

The following are some key principles of site reliability engineering (SRE):
Application monitoring:
SRE teams accept that errors are a part of the software deployment process. Instead of striving for a perfect solution, they monitor software performance in terms of service-level agreements (SLAs), service-level indicators (SLIs), and service-level objectives (SLOs). They observe and monitor performance metrics after deploying the application in production environments.
Gradual change implementation:
SRE practices encourage the release of frequent but small changes to maintain system reliability. SRE automation tools use consistent but repeatable processes to do the following:
o   Reduce risks due to changes
o   Provide feedback loops to measure system performance
o   Increase speed and efficiency of change implementation
Automation for reliability improvement:
SRE uses policies and processes that embed reliability principles in every step of the delivery pipeline. Some strategies that automatically resolve problems include the following:
o   Developing quality gates based on service-level objectives to detect issues earlier
o   Automating build testing using service-level indicators
o   Making architectural decisions that ensure system resiliency at the outset of software development

Key metrics for site reliability engineering

Site reliability engineering (SRE) teams measure the quality-of-service delivery and reliability using the following metrics.:

Service-level objectives:

Service-level objectives (SLOs) are specific and quantifiable goals that you are confident the software can achieve at a reasonable cost to other metrics, such as the following:

o Uptime, or the time a system is in operation

o System throughput

o System output

o Download rate, or the speed at which the application loads

An SLO promises delivery through the software to the customer. For example, you set a 99.95% uptime SLO for your company's food delivery app.

Service-level indicators:

Service-level indicators (SLIs) are the actual measurements of the metric an SLO defines. In real-life situations, you might get values that match or differ from the SLO. For example, your application is up and running 99.92% of the time, which is lower than the promised SLO.

Service-level agreements:

The service-level agreements (SLAs) are legal documents that state what would happen when one or more SLOs are not met. For example, the SLA states that the technical team will resolve your customer's issue within 24 hours after a report is received. If your team could not resolve the problem within the specified duration, you might be obligated to refund the customer.

Error budgets:

Error budgets are the noncompliance tolerance for the SLO. For example, an uptime of 99.95% in the SLO means that the allowed downtime is 0.05%. If the software downtime exceeds the error budget, the software team devotes all resources and attention to stabilize the application.

1. AWS High Availability Strategy

Availability Definition

High Availability: Run instances for the same application across multiple Availability Zone (Auto Scaling Group multi-AZ and Load Balancer multi-AZ). High Availability usually goes hand in hand with horizontal scaling. High availability, running your application / system in at least 2 data centers (== Availability Zones). The goal of high availability is to survive a data center loss. The high availability can be passive (Multi AZ for example). The high availability can be active (for horizontal scaling)

The following elements help you implement highly available systems:

Redundancy: ensuring that critical system components have another identical component with the same data, that can take over in case of failure. Horizontally scale your application to improve reliability. Dynamically acquire computing resources to meet the demand you are monitoring.

Monitoring: identifying problems in production systems that may disrupt or degrade service4. Monitor the demand, capacity, utilization and size of our application using tool.

Failover: the ability to switch from an active system component to a redundant component in case of failure, imminent failure, degraded performance or functionality.

For a typical microservices architecture, the focus for disaster recovery should be on the downstream services that maintain the state of the application. For example, these can be file systems, databases, or queues, for example. When creating a disaster recovery strategy, organizations most commonly plan for the recovery time objective and recovery point objective.

Recovery time objective is the maximum acceptable delay between the interruption of service and restoration of service. This objective determines what is considered an acceptable time window when service is unavailable and is defined by the organization.

Recovery point objective is the maximum acceptable amount of time since the last data recovery point. This objective determines what is considered an acceptable loss of data between the last recovery point and the interruption of service and is defined by the organization

Automatic recovery and Failback: mitigate disruptions, the ability to switch back from a redundant component to the primary active component, when it has recovered from failure.

Deployments / Automate Change: Use automation to deploy, develop, and modify your application. Manual steps lead to poor results, reduce toil wherever possible. Implement change management in a way that de-conflicts potential changes. The ability to deploy applications with minimum downtime (using Terragrunt/Terraform).

Test: Create test failure and recovery procedures.

Availabilities best practices	Description	Service Implementation	Usage
Load Balancer	Distributes incoming connections across a group of Servers/Services and distribute traffic between them. While designing applications, use LB when possible. This is the first step to achieve high availabilities.	Network Load Balancer, TCP/TLS/ UDP (it supports millions of requests and work at layer 4). A load balancer serves as the single point of contact for clients. Clients send requests to the load balancer, and the load balancer sends them to targets, in one or more Availability Zones. To configure your load balancer, you create target groups, and then register targets with your target groups.	Load balancer is most effective if you ensure that each enabled Availability Zone has at least one registered target. This should be configured for more than one AZ Create two AZ, enable the Availability Zones for the load balancer, load balancer starts routing traffic to the registered targets in that Availability Zone. Ensure that each enabled Availability Zone has at least one registered target. Load Balancer endpoints should be in the public subnet so make sure one public subnet in each of the Availability Zones. Create Security group and associate security groups with load balancer. Allow traffic on the port in both directions. LB will keep checking health of the targets, if target is unhealthy, LB stop sending request to unhealthy targets. Make sure EKS cluster have been created before LB configure. Create two AZ and enable for LB Configure a target group Register targets in an Availability Zone Configure a load balancer and a listener Create security groups for your ALB Associate security groups with load balancer.
Compute	Compute services are also known as Infrastructure-as-a-Service (IaaS). AWS Compute, provides a virtual server instance and storage and APIs that let users migrate workloads to a virtual machine.	Amazon EC2 and other services that let you provision computing resources, provide high availability features such as load balancing, auto-scaling and provisioning across Amazon Availability Zones (AZ), representing isolated parts of an Amazon data center.	When consumers may huge volume of requests within a defined time period and system have been deployed in multiple regions with multiple AZ AWS EC2 Auto Scaling groups are configured to launch instances that automatically join Kubernetes cluster. Create two AZs in the same region Spread worker nodes and workload across multiple AZs Create EKS cluster Create EC2 Auto Scaling groups will attach to EKS AWS Compute Service Level Agreement Region-Level SLA - at leas99.99% Instance-Level SLA - at least 99.5%
Databases	NoSQL DB, key-value NoSQL database designed to run high-performance applications at any scale	Creates replica set, a group of Mongod instances that hold the same data. The purpose of replication is to ensure high availability, in case one of the servers goes down. In the case of replica sets, the reference deployment launches multiple servers in respective different Availability Zones. When a primary instance fails, one of the secondary instances from another Availability Zone becomes the new primary node, thereby guaranteeing automatic failover.	For new platforms, creates replication deployed in multiple Availabilities Zone
Amazon EKS	Amazon Elastic Kubernetes Service (Amazon EKS), It runs Kubernetes control and data plane instances across multiple Availability Zones to ensure high availability in an AWS Region.	Amazon EKS automatically detects and replaces unhealthy control plane instances, and it provides automated version upgrades and patching for them. This control plane consists of at least two API server nodes (master cluster) and three etc. nodes (server node) that run across three Availability Zones within a region. Amazon EKS uses the architecture of AWS Regions to maintain high availability.	Managed node groups automate the provisioning and lifecycle management of EC2 nodes. Use the EKS API (Terraform), to create, scale, and upgrade managed nodes One VPC and subnets must exist or can be created using terraform template before creating an Amazon EKS cluster. Each cluster runs in its own. Create master node for each AZ. Amazon Elastic Kubernetes Service (EKS), the maximum number of pods per node depends on the node type and ranges from 4 to 737. In this solution number of pods will be define inside the terraform Script. A new Kubernetes version can get update EKS cluster to the latest version. Delete the resources associated with EKS cluster If not has been used. Create 2 AZ, create a VPC for AZ Create two or more Subnets in a single VPC Create EKS cluster, use Kubernetes Cluster Auto-scaler to scale nodes Create master cluster and server node and create POD under node group Create min 2 nodes in 2 separate availability zone Always balanced EKS cluster
AWS Lambda	AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you	Lambda runs your function in multiple Availability Zones to ensure that it is available to process events in case of a service interruption in a single zone.	configure your function to connect to a virtual private cloud (VPC) in your account, specify subnets in multiple Availability Zones to ensure high availability.
Apache Kafka	Apache Kafka is an open-source distributed event streaming platform. To achieve high availabilities with a multi-region Kafka cluster	This solution creates a cluster that spans two different regions, and if the main availability zones is unavailable for some reason, the service automatically changes to work on the other availability zones.	This ensures that other copies are available even if an availability zone experiences failures. One typical deployment pattern (all active) is in a single AWS Region with 2 Availability Zones (AZs). One Kafka cluster is deployed in each AZ along with Apache Zookeeper and Kafka producer and consumer instances AWS and Kafka cluster deployment: Kafka producers and Kafka cluster are deployed on each AZ. Data is distributed evenly across two Kafka clusters. Kafka consumers aggregate data from different Kafka clusters.
MSK	Amazon Managed Streaming for Apache Kafka, Amazon MSK provides the control-plane operations, such as those for creating, updating, and deleting clusters. It usages Apache Kafka data-plane operations, It runs open-source versions of Apache Kafka	Amazon MSK automatically provisions, configures, and manages your Apache Kafka cluster operations and Apache Zookeeper nodes. All clusters are distributed across multiple AZs (three is the default), are supported by Amazon MSK’s service-level agreement, and are supported by automated systems that detect and respond to issues within cluster infrastructure and Apache Kafka software.	If a component fails, Amazon MSK automatically replaces it without downtime to your applications. Amazon MSK manages the availability of Apache ZooKeeper nodes so don’t need to start, stop, or directly access the nodes. It also automatically deploys software patches as needed to keep your cluster up to date and running smoothly
Service Mesh (Gloo Mesh)	Gloo Mesh is enterprise Istio with multi-cluster and multi-mesh management capabilities across multiple clusters and VMs. It is controlling, securing and observing the traffic flow between your micro services, regardless of where they are running.	A Gloo Mesh setup consists of one management cluster and one or more workload clusters that run services meshes which are registered with and managed by the management cluster. The management cluster serves as the management plane, and the workload clusters serve as the data plane. Gloo Mesh can discover meshes/workloads, establish federated identity, enable global traffic routing and load balancing.	Gloo Mesh can discover services, coordinate service meshes, configure, and observe behavior, federate policies, and enforce security consistently. Istio software for enabling the core building blocks of external authorization and request routing. Load Balancer endpoints provide access for clients to the application over the internet, allow reachability for Pods deployed in EKS from where the EnvoyProxy, oauth2-proxy, and application are pulled. Install Istio agents on EKS Cluster. Configure Virtual Service to route traffic Enable OIDC Provider Using Ingress Gateway an Envoy Reverse Proxy transparently relaying authorization requests Create Namespace for Deployment Deploy Envoy Proxy in dedicated namespace Deploy OIDC proxies in dedicated namespace Istio services, deployed in the istio-system Kubernetes namespace
AWS API Gateway	Amazon API Gateway is an AWS service for creating, publishing, maintaining, monitoring, and protecting REST APIs at any scale.	Supported by Route 53 routing policies, direct traffic from a APIs to more than one infrastructure in different regions, the service allows you to balance requests according to the infrastructure capacity of each region.	To prevent your APIs from being overwhelmed by too many requests, API Gateway throttles requests to your APIs. Specifically, API Gateway sets a limit on a steady-state rate and a burst of request submissions against all APIs in your account.
Mongo DB	MongoDB is an open source, NoSQL database that provides support for JSON-styled, document-oriented storage systems.	It supports a flexible data model that enables you to store data of any structure, and provides a rich set of features, including full index support, sharding (distribution of data across multiple nodes), and replication (Creates Replica set in different Availability Zones).	Self Service Deployment: Build a MongoDB cluster by automating configuration and deployment tasks. The Quick Start reference deployment a self-service deployment of the MongoDB replica set cluster.
MongoDB Atlas	MongoDB Atlas is a global cloud document database service for modern applications. Deploying a fully managed MongoDB helps to ensure availability, scalability, and security compliance.	AWS managed, full managed DB, use MongoDB Atlas instead of deploying this Quick Start. MongoDB Atlas creates a new VPC for your managed databases and automates potentially time-consuming administration tasks such as managing, monitoring, and backing up your MongoDB deployments.	Database replica sets (primary and secondary), Failover of a primary replica, deploying Atlas in multiple cloud zones and cloud providers, adjusting the level of availability guarantees using write and read concerns
Amazon ECS	is a regional service that simplifies running containers in a highly available manner across multiple Availability Zones within an AWS Region.	Amazon ECS includes multiple scheduling strategies that place containers across your clusters based on your resource needs (for example, CPU or RAM) and availability requirements.	Single points of failure (SPOF) are commonly eliminated with an N+1 or 2N redundancy configuration, where N+1 is achieved via load balancing among active–active nodes, and 2N is achieved by a pair of nodes in active–standby configuration. Using a scalable technique, load balanced cluster or assuming an active–standby pair.
Dynamo DB	Amazon Dynamo DB is a fast and flexible NoSQL database service boasting high availability, high durability. Dynamo DB enables customers to offload the administrative burdens (hardware provisioning, setup and configuration, throughput capacity planning, replication, software patching, or cluster scaling) of operating and scaling distributed databases to AWS.		Dynamo DB designed internally to automatically partition data, replication and incoming traffic across multiple partitions. Partitions are stored on numerous backend servers distributed across 2 availability zones within a single region. Use Global tables: replicate Dynamo DB tables automatically in selected AWS Regions and multi-master read/write capability with eventual consistency Use Dynamo DB Accelerator (DAX): DAX is an in-memory caching service (10x faster than Dynamo DB). Configure Client wait: Change Dynamo DB clients wait time to 3 seconds for a response before timeout. AWS Service Level Agreements (SLAs): AWS promises a monthly uptime percentage of 99.99% for Dynamo DB

2. AWS High Scalability Strategy

Scalability Definition

The ability to workload to perform its agreed function when load or scope changes. An application / system can handle greater loads by adapting. Scalability is linked but different to High Availability. There are two kinds of scalability:

Vertical Scalability

It is increasing the size of the instance. There’s usually a limit to how much you can vertically scale (hardware limit)

Vertical Scaling: Increase instance size (= scale up / down)

Horizontal Scalability (= elasticity)

increasing the number of instances / systems for your application. Horizontal scaling implies distributed systems. This is very common for web applications / modern applications. It’s easy to horizontally scale.

Horizontal Scaling: Increase number of instances (= scale out / in)

Auto Scaling

Types of Autoscaling - In AWS, a scaling plan is a set of instructions for scaling up or scaling down your resources. Use a scaling plan to configure auto scaling for related or associated scalable resources in a matter of minutes. First, determine the consistency of usage patterns, as well as the frequency and intensity of traffic spikes. Then define your priorities.

· Demand/Reactive Scaling: When using a reactive autoscaling method, resources are scaled up and down in response to surges in traffic. Demand-based scaling is highly responsive to fluctuating traffic and helps accommodate traffic spikes you cannot predict.

· Predictive Scaling: A predictive autoscaling method uses machine learning and artificial intelligence tools to evaluate traffic loads and anticipate when you’ll need more or fewer resources. combine AWS Auto Scaling with Amazon EC2 Auto Scaling to scale resources throughout many applications with predictive scaling. This includes three sub-options:

o Load Forecasting: This predictive method analyzes history for up to 14 days to forecast what demand for the following two days. Updated every day, the data is created to reflect one-hour intervals.
Scheduled Scaling Actions: This option adds or removes resources according to a load forecast. This keeps resource use stable and set at your pre-defined value.

o Maximum Capacity Behavior: Designate a minimum and a maximum capacity value for every resource, and AWS Auto Scaling will keep each resource within that range. This gives AWS some flexibility within set parameters. And, you can control if applications can add more resources when demand is forecasted to be above maximum capacity

· Scheduled Scaling: Users may choose the time range based on which additional resources will be added. Scheduled autoscaling is a hybrid approach that operates in real-time, predicts known changes in traffic loads, and responds to such changes at predetermined intervals. Scaling events can be set to occur automatically at a certain date and time. This is especially helpful in situations where you can accurately forecast demand. What’s different about this strategy is that following a schedule predicts the number of available resources at a given time in advance

· Manual Scaling – In Manual Scaling, the number of instances is manually adjusted. You can manually increase or decrease the number of instances through a CLI or console. Manual Scaling is a good choice when your user doesn’t need automatic Scaling.

· Dynamic Scaling – This is yet another type of Auto Scaling in which the number of EC2 instances is changed automatically depending on the signals receive. Dynamic Scaling is a good choice when there is a high volume of unpredictable traffic.

AWS Cluster Autoscaler: Responsible for ensuring that our cluster has enough nodes to schedule your pods without wasting resources

· Scale Up Event: CA Watches for pods in pending state due to insufficient resources and it creates new worker nodes and schedules pods on those worker nodes.

· scaling In event: Autoscaler CA Watches for nodes which are underutilized and terminate those nodes. it saves wastage of resources.

Scaling best practices	Description	When to use?
Async over Sync	While designing applications, use asynchronous communication across assets, when possible. This is the first step in scaling individual assets as per usage.	Distributed architecture with two or more assets communicating with each other to share data.
Caching	Make use of caching services to avoid unnecessary scaling for repetitive requests.	When consumers may request for the same data repeatedly within a defined time period.
Monitoring	Every asset should follow the Observability best practices, and by default incorporate monitoring of CPU, Memory, IO and Network usage.	For new platforms, incorporate observability practices mentioned in the previous page.
Databases	NoSQL DB, key-value NoSQL database designed to run high-performance applications at any scale	Creates replica set, a group of Mongod instances that hold the same data. The purpose of replication is to ensure high availability, in case one of the servers goes down. In the case of replica sets, the reference deployment launches multiple servers in respective different Availability Zones. It also provides failover
Dynamo DB	Amazon Dynamo DB is a fast and flexible NoSQL database service boasting high availability, high durability. Dynamo DB enables customers to offload the administrative burdens (hardware provisioning, setup and configuration, throughput capacity planning, replication, software patching, or cluster scaling) of operating and scaling distributed databases to AWS.	Configure auto scaling in DynamoDB. Set the minimum and maximum levels of read and write capacity in addition to the target utilization percentage. Use database replication in different AZ.
Replication	Replicate databases for recovery as well as to off load reads to multiple instances. When necessary, create a READ replica of the transactional database. When it makes a valid case, implement CQRS pattern to make the best use of data and resources.	When read-only consumers need a copy of the transactional data in the same or a different format, with high throughput/volume SLA.
Prefer Eventual consistency (BASE) over ACID	A BASE data store values availability (since that’s important for scale), but it doesn’t offer guaranteed consistency of replicated data at write time.	In cases where data can be momentarily obsolete, it is advisable to prefer BASE database over ACID database. This helps in scaling databases horizontally.
Sharding / tenancy	Split the application/databases (rows or table wise) by function so they can scale individually.	When different functions or tenants have different SLA and transactional needs. Your application may have compliance requirements to segregate data owned by different tenants, but if the scaling requirements vary based on tenant, then apply this principle.
Canary deployments	The ability to deploy applications with minimum downtime (Terragrunt/Terraform). Roll out new versions only to a small subset of the servers, and redirect users to new servers, without bringing the entire application down. As users migrate to the new version, implement autoscaling methods for both new and old infrastructure. Refer Canary deployments.	When application has high availability SLO and almost zero downtime expectations. Also helps roll out features quickly and rollback as needed, helping with stability.
Rollback	Incorporate the ability to rollback a specific feature of an application, from source control. As part of the rollback, any infrastructure template (Terraform etc.,) associated with the previous version of the application, should also be executed in order to deploy the resources accordingly.	Tag each and every asset with the release version information, to be able to rollback as needed.
Load Balancer	Distributes incoming connections across a group of Servers/Services and distribute traffic between them. While designing applications, use LB when possible. This is the first step to achieve high availabilities. The Network Load Balancer is API-compatible with the Application Load Balancer Use Load balancer to distributes incoming connections across a group of Servers/Services and distribute traffic.	Each Network Load Balancer provides a single IP address for each Availability Zone. The IP-per-AZ feature reduces latency with improved performance, improves availability through isolation and providing automatic failover. Use EC2 auto Scaling group with an Elastic Load Balancing within multiple Availability Zones for load-balanced application. Use “IP mode” for load balancers
Compute	Compute services are also known as Infrastructure-as-a-Service (IaaS). AWS Compute, provides a virtual server instance and storage and APIs that let users migrate workloads to a virtual machine.	Ability to increase or decrease the compute capacity of your application. Instructs an Auto Scaling group to either launch or terminate Amazon EC2 instances. To Scale AWS resources including Amazon EC2 instances, Amazon DynamoDB tables and indexes use AWS Auto Scaling. Setup application scaling for multiple resources across multiple services.
Amazon EKS	Amazon Elastic Kubernetes Service (Amazon EKS), It runs Kubernetes control and data plane instances across multiple Availability Zones to ensure high availability in an AWS Region. The Kubernetes Cluster Autoscaler automatically adjusts the number of nodes in the cluster.	When pods fail or are rescheduled onto other nodes. The Cluster Autoscaler is typically installed as a Deployment in existing cluster. It uses leader election to ensure high availability, but scaling is done by only one replica at a time. Use Kubernetes Cluster Auto-scaler, to scale Amazon EKS cluster, automatically adjusts the number of nodes in cluster when pods fail or are rescheduled onto other nodes. Use clusters with large numbers of worker nodes
AWS Lambda	AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you	When function receives a request while it's processing a previous request, Lambda launches another instance of your function to handle the increased load. Lambda automatically scales to handle 1,000 concurrent executions per region. This quota can be increased as well
Asyc/ Apache Kafka	Apache Kafka is an open-source distributed event streaming platform. Running your Kafka deployment on Amazon EC2 provides a high performance, scalable solution for streaming data. Possible deployments consideration factors like number of messages, message size, monitoring, failure handling, and any operational issues.	Creates a cluster that spans two different regions, and if the main availability zones is unavailable for some reason, the service automatically changes to work on the other availability zones. Use an auto-scaling policy for scaling Kafka (Kafka cluster). Use consumer group to scale data consumption from a Kafka topic. Use new instance of the application to scale a Kafka Stream application.
MSK	Amazon Managed Streaming for Apache Kafka, Amazon MSK provides the control-plane operations, such as those for creating, updating, and deleting clusters.	Autoscaling, to automatically expand cluster's storage in response to increased usage, configure an Application Auto-Scaling policy for Amazon MSK. Scale up to Set up a three-AZ cluster.
Service Mesh (Gloo Mesh)	Gloo Mesh is enterprise Istio with multi-cluster and multi-mesh management capabilities across multiple clusters and VMs. It is controlling, securing and observing the traffic flow between your micro services, regardless of where they are running. An Istio service mesh is logically split into a data plane and a control plane. Service meshes control egress traffic from pod to use endpoints in the local Availability Zone. Istio is service mesh built on Envoy proxy that currently provides a routing feature. Istio service mesh options available for EKS. Service mesh sits on top of Kubernetes infrastructure and is responsible to communicate between services. It manages the network traffic between services. Istio enables routing traffic to the pods or services within the same Availability Zone.	Multi-cluster dynamic routing and Dynamic scaling to multiple nodes in different availabilities zone Use dynamic scaling to multiple nodes in different availabilities zone. Use two-node EKS cluster. Set up Istio for the cluster. Use Istio top of the EKS cluster. Enable Istio and associate with app services, it enables Istio to use Availability Zone information.
AWS API Gateway	Amazon API Gateway is an AWS service for creating, publishing, maintaining, monitoring, and protecting REST APIs at any scale. AWS API Gateway acts as a proxy to the backend operations that you have configured.	Amazon API Gateway will automatically scale (multiple availabilities zone and different regions) to handle the amount of traffic your API receives.
Mongo DB	MongoDB is an open source, NoSQL database that provides support for JSON-styled, document-oriented storage systems.	Creates MongoDB replica set. Use self-service deployment of the MongoDB replica set cluster. Use horizontal scaling to overcomes the limitations of single nodes and avoids single points of failure.
MongoDB Atlas	MongoDB Atlas is a global cloud document database service for modern applications.	Deploying a fully managed MongoDB helps to ensure availability, scalability, and security compliance. MongoDB atlas support cluster auto-scaling. Cluster auto-scaling is an intelligent and fully automated capacity management service in MongoDB Atlas.

3. AWS High Scalability Strategy