Amazon Web Services (AWS) is perhaps the largest cloud provider on the web, though it faces fierce competition from Google’s Cloud Platform and Microsoft’s Azure. In this page we will look at various information regarding and resources relating to AWS.
Tooling
- Official AWS CLI.
- Official AWS Serverless Application Model Framework.
- Visual Studio Code Official AWS Toolkit Extension.
Fundamentals: Core Concepts Course
- Amazon has a free fundamentals course one can use to learn about AWS. I’d recommend it. This section provides some brief highlights from the materials.
Introduction
- The course consists of five modules each of which offers the same sections: Intro, Mental Model, Concepts, Conclusion, and Further Reading.
- AWS has five core (pillar) mental models which are covered in the course:
- 1. Operational Excellence
- 2. Security
- 3. Reliability
- 4. Performance Efficiency
- 5. Cost Optimization
Operational Excellence
- “The operational excellence pillar focuses on how you can continuously improve your ability to run systems, create better procedures, and gain insights.”
- This often involves significant automation.
- Infrastructure as Code (IaC) – “the process of managing your infrastructure through machine-readable configuration files.”
- “Instead of manually provisioning services, you create templates that describe the resources you want.”
- Implemented on AWS through CloudFormation (consumes JSON or YAML) and Cloud Development Kit (CDK) which allows the use of one’s preferred languages (e.g. JS, Python, Java).
- Observability – “the process of measuring the internal state of your system…usually done to optimize it to some desired end state.”
- Steps to achieve observability:
- 1. Collection
- 2. Analytics
- 3. Action.
- Steps to achieve observability:
- Collection – “the process of aggregating all metrics necessary when assessing the state of your system.”
- Types of Metrics
- Infrastructure
- Emitted and collected by AWS CloudWatch.
- Structured logs emitted by AWS that can be collected by AWS CloudWatch Logs.
- Application
- Created within one’s software, can be collected by AWS CloudWatch Custom Metrics.
- Regular software logs that can be managed using AWS CloudWatch Logs.
- Account
- Automatically logged by AWS, managed via AWS CloudTrail.
- Infrastructure
- Types of Metrics
- Analytics – The analysis of gathered data.
- AWS offers a number of ways to analyze this data (some overlapping):
- CloudWatch Log Insight – “lets you interactively search and analyze your CloudWatch data”.
- Athena – “a serverless query service”, can be used with logs stored on S3.
- RDS – managed relational database service, can be used to analyze structured data.
- RedShift – managed petabyte-scale data warehouse service, for large amounts of structured data.
- Elasticsearch Service – managed Elasticsearch, can be used to analyze log-based data.
- AWS offers a number of ways to analyze this data (some overlapping):
- Action – Take action based on what you learn from your analysis of the data you have collected.
- Monitoring & Alarming
- CloudWatch Alarms – “notify you when a system has breached the safety threshold for a particular metric….can set off either a manual or automated mitigation.”
- Dashboard
- CloudWatch Dashboards – “track and improve service performance over time”.
- Data-driven Decisions – “track performance and business KPIs to make data-driven product decisions”.
- Monitoring & Alarming
Security
- Zero Trust Model – “all application components and services are considered discrete and potentially malicious entities.”
- To reiterate, “we need to apply security measures at all levels of our system.”
- Key Concepts:
- 1. Identity and Access Management (IAM)
- 2. Network Security
- 3. Data Encryption
- Identity and Access Management (IAM) – “the service responsible for tracking identities and access in system.”
- AWS IAM is Amazon’s service for managing identity and authentication.
- “Access is managed using IAM policies which enforce access boundaries for agents within AWS
- Three Fundamental Components to an IAM Policy:
- 1. “the PRINCIPAL(s) specifies WHO permissions are given to”
- 2. “the ACTION(s) specifies WHAT is being performed”
- 3. “the RESOURCE(s) specifies WHICH properties are to be accessed”
- Principle of Least Privilege – “every agent should only have the minimal permissions necessary to accomplish their function.”
- Identity-Based Policies – “associated to a principal”.
- Resource-Based Policies – “associated to a resource”.
- “Whether a principal has the permission to perform an action for a particular resource depends on whether the principal’s identity-based policy allows them to do so and whether the resource’s resource-based policy (if it exists) does not forbid them to do so.”
- “Note that this is a major simplification of the IAM permission model. There are many additional policy types that affect whether access can be granted. These can include permission boundaries, organization service control policies, access control lists, and session policies.”
Network Security
- Network Level Security
- “The fundamental network-level primitive in AWS is the Amazon Virtual Private Cloud (VPC). This is a logical network which you define and can provision resources into.”
- Some of its components include subnets, route tables, and an internet gateway.
- “To safeguard your traffic in your VPC, you can divide your resources into public-facing resources and internal resources.”
- Application Load Balancer (ALB) – Can be used as “a proxy service” to “reduce the attack surface” by “handl[ing] all internet-facing traffic.”
- AWS Web Application Firewall (WAF) – Used “to further restrict traffic into your network.”
- Resource Level Security
- Some “Individual AWS resources also have network security controls…The most common control is…a security group.”
- “Security groups are virtual firewalls you can use to enforce traffic flowing into and out of your resource.”
- “Use security groups to only allow traffic from specific ports and trusted sources to your instance.”
Data Encryption
- “Data encryption is the process of encoding information in such a way that it is unintelligible to any third party that does not possess the key necessary to decypher the data.”
- Zero Trust Model for Data – “means encrypting our data everywhere, both in transit and at rest.”
- Encryption in Transit
- “All storage and database services within AWS provide HTTPS endpoints that support the encryption of data in transit.”
- “AWS also offers network services that can help enforce encryption in transit for your own services.”
- Example: AWS Application Load Balancer (ALB).
- Encryption at Rest – “involves encrypting the data within systems.”
- “All AWS storage and database services support encryption at rest. Most of these services have encryption turned on by default.”
- Amazon Key Management Service (KMS) – “This is a central key management service that gives you the ability to create Customer Managed Keys (CMK) to encrypt your data.”
- Customer Managed Keys provide the following benefits in addition to encryption:
- Use a custom key store
- Create an audit trail for encrypted resource
- Using AWS CloudTrail
- Enforcement of automatic key rotation
Reliability
- Building “services that are resilient to both service and infrastructure disruptions.”
- “When thinking about reliability in the cloud, it is useful to think in terms of blast radius…you want to minimize the blast radius of any individual component.”
- Failure is not an if but a when – it will happen!
- Limit blast radius by:
- 1. Fault Isolation
- 2. Limits
- Fault Isolation – “using redundant independent components separated through fault isolation zones.”
- Fault Isolation Zones on AWS:
- 1. Resource and Request
- 2. Availability Zone
- 3. Region
- Resource and Request
- “AWS services partition all resources and requests on a given dimension like the resource ID. These partitions are referred to as cells. Cells are design to be independent and contain failures inside a single cell.”
- AWS uses multiple techniques to make this isolation happen, it “happens transparently every time you make a request or create a resource…”
- Availability Zone
- “An AWS availability zone (AZs) is a completely independent [geographically distant] facility with dedicated power, service, and network capabilities.”
- “Fault isolation is achieved at the AZ level by deploying redundant instances of your service through multiple AZs.”
- Region
- “Each region is a completely autonomous data center, comprised of two or more AZs.”
- Having multi-region resources increases availability but also complexity.
- AWS has some services to simplify this complexity including Route53, DynamoDB Global Tables, and S3 Cross-Region Replication.
- Fault Isolation Zones on AWS:
- Limits – “constraints that can be applied to protect your services from excessive load.”
- This load may be from an external source (DDoS) or internal (misconfiguration).
- Service Quota – “service-specific limits on a per-account per-region basis.”
- Two Types of Limits
- Soft Limits – can increase w/request to AWS
- Hard Limits – cannot increase
- Use AWS Service Quotas to monitor one’s limits and to request increases.
- Tooling: CloudWatch, manual, scripts, AWS Trusted Advisor, and awslimitchecker.
Performance Efficiency
- “The performance efficiency pillar focus on how you can run services efficiently and scalably in the cloud.”
- “…it is useful to think of your services as cattle, not pets.”
- Traditional servers were treated like pets – individually named and configured.
- “The cloud way of thinking about servers is as cattle…No single server should be essential to the operation of the service.”
- Selection – AWS provides 175+ services to choose from, allowing one to optimize the service utilized to the need.
- DM: Humorously, this is also one of the more frustrating aspects of AWS – the multiplicity of options.
- AWS Main Service Categories:
- Compute – “service that will process your data”
- Storage – “static storage of data”
- Database – “organized storage of data”
- Network – “how your data moves around”
- Type of Service
- Compute
- Virtual Machine (VM)
- Container
- Serverless
- Storage
- Block Storage – EBS
- File System – EFS
- Object Store – S3
- Archival Storage – S3 Glacier
- Database
- Relational (ACID)
- Non-Relational
- Data Warehouse
- Data Indexing/Searching
- Compute
- Degree of Management
- “The primary difference between various AWS services of the same type can be found in their degree of management.”
- Generally one can have control over something or have something managed, as one increases in one direction one generally simultaneously decreases in the opposite direction. For example, if one has more control the service is generally less managed. If one is using more of a managed service then one generally has less control.
- Compute
- High Control / Low Managed – EC2
- Medium Control / Medium Managed – Elastic Beanstalk
- Low control / High Managed – Lightsail
- Storage
- The decision is generally based on type of service, there aren’t as many options within a single type.
- Database
- Higher Control / Lower Managed – RDS
- Lower Control / Higher Managed – Aurora
- “At the end of the day, the choice of a specific service depends largely on your familiarity with the underlying technology and your preference for a more or less managed experience.”
- “The primary difference between various AWS services of the same type can be found in their degree of management.”
- Configuration
- “The configuration depends on the specific performance characteristics you wish to achieve which differs for each service category.”
- Compute:
- VM – CPU and memory determined by size of instance and instance family.
- Container – CPU and memory can be set individually.
- Serverless – Can only set memory, “CPU increases linearly to the amount of memory available”.
- Additional Potential Constraints: Network Capacity, Instance Storage Resource Availability.
- Storage:
- Block Storage
- Latency affected by volume type (e.g. SSD vs HDD)
- “Throughput is proportional to volume size for most volume types”
- “IOPS capacity is proportional to volume size for most volume types”
- File System Service
- “Latency and IOPS are affected by your choice of performance modes”
- “Throughput is affected by your choice of using provisioned throughput”
- Object Store
- Latency affected by geographic distance to bucket endpoint
- “Throughput is affected by the use of throughput optimized APIs such as multipart upload”
- Archival Store
- Latency affected by graphic distance to bucket endpoint, retrieval method
- “Throughput is affected by the use of throughput optimized APIs such as multipart upload”
- “IOPS is not configurable”
- Block Storage
- Database:
- Relational – “determined by your choice of EC2 instance”
- Non-Relational (e.g. DynamoDB) – “determined by configuration options”
- Data Warehouse (e.g. Redshift) – “determined by your choice of underlying EC2 instance”
- Indexing Solution (e.g. Elasticsearch Service) – “determined by your choice of EC2 instance”
Scaling
- Types of Scaling
- Vertical
- Horizontal
- Vertical Scaling – Upgrade smaller instance to larger instance.
- Advantages:
- Simplicity
- No Clustering
- Disadvantages:
- Lower Upper Limit
- Single Point of Failure
- Advantages:
- Horizontal Scaling – Increase number of instances.
- Advantages:
- Higher Upper Limit
- No Single Point of Failure
- Disadvantages:
- Complexity
- Advantages:
Cost Optimization
- “achieve business outcomes while minimizing costs.”
- “think of cloud spend in terms of OpEx instead of CapEx.”
- OpEx – “ongoing pay-as-you-go model”
- CapEx – “one-time purchase model”
- Changes to Cost Optimization Process:
- 1. Pay For Use
- 2. Cost Optimization Lifecycle
- Pay For Use – “you only pay for the capacity that you use.”
- Common Ways to Optimize Pay For Use Cloud Spend:
- 1. Right Sizing
- 2. Serverless
- 3. Reservations
- 4. Spot Instances
- Right Sizing – use what you need, not more.
- Tooling: AWS Compute Optimizer (for EC2 sizing)
- Serverless – only charged when service is actually running.
- Reservations – commit to pay for x capacity to receive y discount.
- Spot Instances – When unused EC2 capacity is available, utilize it at a sharp discount.
- “The tradeoff when using a spot instance is that EC2 can reclaim the capacity at any moment…two-minute termination notice before this happens.”
- Common Ways to Optimize Pay For Use Cloud Spend:
- Cost-Optimization Lifecycle – “the continuous process of improving cloud spend over time.”
- Three-Step Workflow:
- 1. Review.
- 2. Track.
- 3. Optimize.
- Review
- Analyze where your costs are coming from.
- Tooling: AWS Cost Explorer, AWS Cost & Usage Report.
- Track
- Group costs along important “dimensions”.
- Common Tag Categories (groupings):
- Application ID
- Automation Opt-In/Opt-Out – “whether a resource should be included in an automated activity such as starting, stopping, or resizing instances.”
- Cost Center/Business Unit – “typically for cost allocation and tracking.”
- Owner – “Identifies who is responsible for the resource…typically the technical owner.”
- Tooling: Cost Allocation Tags, AWS Budgets
- Optimize
- Use Pay For Use techniques to optimize spending.
- Three-Step Workflow:
AWS GovCloud
- Workloads that “must meet FedRAMP, DoD SRG, ITAR, CJIS or other strict compliance requirements, or they contain data classified as Controlled Unclassified Information (CUI).”
- “AWS GovCloud (US) Regions are isolated AWS regions operated by employees who are U.S. citizens on U.S. soil.”