Post

AWS Database: RDS, Aurora, ElastiCache, DynamoDB, S3, Athena, Redshift, AWS Glue, AWS Neptune, ElasticSearch

Databases

Choosing the Right Database

  • Read-heavy, write heavy, or database workload? Throughput needs? Will it change, does it need to scale or fluctuate during the day?
  • How much data to store and for how long? Will it grow? average object size? How are they accessed?
  • Data durability? Source of truth for the data ?
  • Latency requirements? Concurent users?
  • Data model? How will you query the data? Joins? Structured? Semi-Structured?
  • Strong schema? More flexibility? Reporting? Search? RDBMS? NoSQL?
  • License costs? Switch to Cloud Native DB such as Aurora?

    Database Types

  • RDBMS (= SQL / OLTP): RDS, Aurora – great for joins
  • NoSQL database: DynamoDB (~JSON), ElastiCache (key / value pairs), Neptune (graphs) – no joins, no SQL
  • Object Store: S3 (for big objects) / Glacier (for backups / archives)
  • Data Warehouse (= SQL Analytics / BI): Redshift (OLAP), Athena
  • Search: ElasticSearch (JSON) – free text, unstructured searches
  • Graphs: Neptune – displays relationships between data

    RDS Overview

  • Managed PostgreSQL / MySQL / Oracle / SQL Server
  • Must provision an EC2 instance & EBS volume type and size
  • Support for Read Replicas and Multi AZ
  • Security through IAM, Security Groups, KMS, SSL in transit
  • Backup / Snapshot / Point in time restore feature
  • Managed and Scheduled maintenance
  • Monitoring through CloudWatch
  • Use case: Store relational datasets (RDBMS / OLTP), perform SQL queries, transactional inserts / update / delete is available

    RDS for solutions architect

  • Operations: small downtime when failover happens, when maintenance happens, scaling in read replicas / ec2 instance / restore EBS implies manual intervention, application changes
  • Security: AWS responsible for OS security, we are responsible for setting up KMS, security groups, IAM policies, authorizing users in DB, using SSL
  • Reliability: Multi AZ feature, failover in case of failures
  • performance: depends on EC2 instance type, EBS volume type, ability to add Read Replicas. Storage auto-scaling & manual scaling of instances
  • Cost: Pay per hour based on provisioned EC2 and EBS

    Aurora Overview

  • Compatible API for PostgreSQL / MySQL
  • Data is held in 6 replicas, across 3 AZ
  • Auto healing capability
  • Multi AZ, Auto Scaling Read Replicas
  • Read Replicas can be Global
  • Aurora databae can be global for DR or latency purpose
  • AUto scaling of storage from 10GB to 128 TB
  • Define EC2 instance type for aurora instances
  • Same security / monitoring / maintenance features as RDS
  • Aurora Serveless - for unpredictable / intermittent workloads
  • Aurora Multi-Master - for continous writes failover
  • use case: same as RDS, but with less maintenance / more flexibility / more performance

    Aurora for Solutions Architect

  • Operations: less operations, auto scaling storage
  • Security: AWS responsible for OS security, we are responsible for setting up KMS, security groups, IAM policies, authorizing users in DB, using SSL
  • Reliability: Multi AZ, high available, possibly more than RDS, Aurora Serverless option, Aurora Multi-Master option
  • Performance: 5x performance (according to AWS) due to architectural optimations. Upto 15 read replica (only 5 for RDS)
  • Cost: Pay per hour based on EC2 and storage usage. Possibly lower costs compared to Enterprise grade databases such as Oracle

    ElastiCache Overview

  • Managed Redis / Memcached (similar offering as RDS, but for caches)
  • In-memory data store, sub-milisecond latency
  • Must provision an EC@ instance type
  • Support for Clustering (Redis) and Multi AZ, Read Replicas (sharding)
  • Security through IAM, Security Groups, KMS, Redis Auth
  • Backup / Snapshot / Point in time restore feature
  • Managed and Scheduled maintenance
  • Monitoring through CloudWatch
  • Use case: Key/Value store, frequent reads, less writes, cache results for DB queries, store session data for websits, cannot use SQL

    ElastiCache for Solutions Architect

  • Operation: Same as RDS
  • Security: AWS responsible for OS security, we are responsible for setting up KMS< security groups, IAM policies, users (Redis Auth), using SSL
  • Reliability: Clustering, Multi AZ
  • Performance: Sub-milisecond performance, in memory, read replicas for sharding, very polular cache option
  • Cost: Pay per hour based on EC2 and storage usage

    DynamoDB Overview

  • AWS proprietary technology, managed NOSQL database
  • Serverless, provisioned capacity, auto scaling, on demand capacity (Nov 2018)
  • Can replace ElastiCache as a key/value store (strong sessiong data for example)
  • Highly Available, Multi AZ by default, Read and Writes are decoupled, AX for read cache
  • Reads can be eventually consistent or strongly consistent
  • Security, authentication and authorization is done through IAM
  • DynamoDB streams to integrate with AWS Lambda
  • Backup / Restore feature, GlobalTable feature
  • Monitoring through CloudWatch
  • Can only query on primary key, sort key or indexes
  • use Case: Servrless applications development (small documents 100s KB), distributed serverless cache, doesn’t have SQL query language available, has transactions capability from Nov 2018

    DynamoDB for Solutions Architect

  • Operations: no operations needed, auto scaling capability, serverless
  • Security: full security through IAM policies
  • Reliabilty: Multi AZ, Backups
  • Performance: single digit milisecond performance, DAX for caching reads, performance doesn’t degrade if your application scales
  • Cost: Pay per provisioned capacity and storage usage (no need to guess in advance any capacity - can use auto scaling)

    S3 Overview

  • S3 is a… key / value store for objects
  • Great for big objects, not so great for small objects
  • Serverless, scales infinitely, max object size is 5TB
  • Strong consistency
  • Tiers: S3 Standard, S3 IA, S3 One Zone IA, Glacier for backups
  • Features: Versioning, Encryption, Cross Region Replication, etc…
  • Security: IAM, Bucket Policies, ACL
  • Encryption: SSE-S3, SSE-KMS, SSE-C, client side encryption, SSL in transit
  • Use Case: static files, key value store for big iles, website hosting

    S3 for Solutions Architect

  • Operations: no operations needed
  • Security: IAM bucket Policies, ACL, Encryption (Server/Client), SSL
  • Reliability: 99.99999999% durability / 99.99% availability, Multi AZ, CRR
  • Performance: scales to thousands of read / writes per second, transfer acceleration / multi-part for big files
  • Cost: pay per storage usage, network cost, requests number

    Athena Overview

  • Fully Serverless database with SQL capabilities
  • Used to query data in S3
  • Pay per query
  • Output results back to S3
  • Secured through IAM
  • use Case; one time SQL queries, serverleess queries on S3, log analytics

    Athena for Solutions Architect

  • Operations: no operations needed, serverless
  • Security: IAM + S3 security
  • Reliability: managed service, uses Presto engine, highly available
  • Performance: queries scale based on data size
  • Cost: pay per query / per TB of data scanned, serverless

    Redshift Overview

  • Redshift is based on PostgreSQL, but it’s not used for OLTP
  • It’s OLAP - online analytical processing (analytics and data warehousing)
  • 10x better performance than other data warehouses, scale to PBs of data
  • Columnar storage of data (instead of row based)
  • Massively Parallel Query Execution (MPP)
  • Pay as you go based on the instances provisioned
  • Has a SQL interface for performing the queries
  • BL tools such as AWS Quicksight or Tableau integrate with it

    Redshift Continued…

  • Data is loaded from S3, DynamoDB, DMS, other DBs…
  • From 1 node to 128 nodes, up to 28TB of space per node
  • Leader node: for query planning, results aggreegation
  • Compute node: for performing the queries, send results to leader
  • Redshift Spectrum: perform queries directly againt S3 (no need to load)
  • Backup & Restore, Security VPC / IAM / KMS, Monitoring
  • Redshift Enhanced VPC routing: COPY / UNLOAD goes through VPC

    Redshift – Snapshots & DR

  • Redshift has no “Multi-AZ” mode
  • Snapshots are point-in-time backups of a cluster stored internally in S3
  • Snapshots are incremental (only what has changed is saved)
  • You can restore a snapshot into a new cluster
  • Automated: every 8 hours, every 5 GB, or on a schedule. Set retention
  • Manual: snapshot is reetained until you delete it
  • You can configure Amazon Redshift to automatically copy snapshots (automated or manual) of a cluster to another AWS Region

    Loading data into Redshift

  • Amazon Kinesis Data Firehose
  • S3 using COPY command
  • EC2 Instance JDBC driver
    1
    2
    3
    
    copy customer
    from 's3://mybucket/mydata’
    iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
    

    Redshift Spectrum

  • Query data that is already in S3 wiithout loading it
  • Must have aRedshift cluster available to start the query
  • The query is then submitted to thousands of redshift spectrum nodes

    Redshift for Solutions Architect

  • Operations: like RDS
  • Security: IAM, VPC, KMS, SSL (like RDS)
  • Reliability: auto healing features, cross-region snapshot copy
  • Performance: 10x performance vs other data warehousing, compression
  • Cost: pay per node provisioned, 1/10 th of the cost vs other warehouses
  • vs Athena: faster queries / joins / aggregations thanks to indexes
  • Remember: Redshift = Analytics / BI / Data Warehouse

    AWS Glue

  • Managed extract, transform, and load (ETL) service
  • Useful to prepare and transform data for analytics
  • Fully serverless service

    Glue Data Catalog

  • Glue Data Catalog: catalog of datasets

    Neptune

  • Fully managed graph database
  • When do we use Graphs?
    • High relationship data
    • Social Networking: Users friends with Users, replied to comment on post of user and likes other comments.
    • Knowledge graphs (Wikipedia)
  • Highly available across 3 AZ, with up to 15 read replicas
  • Point-in-time recovery, continuous backup to Amazon S3
  • Support for KMS encryption at rest + HTTPS

    Neptune for Solutions Architect

  • Operations: similar to RDS
  • Security: IAM, VPC, KMS, SSL (similar to RDS) + IAM Authentication
  • Reliability: Multi-AZ, clustering
  • Performance: best suited for graphs, clustering to improve performance
  • Cost: pay per node provisioned (similar to RDS)
  • Remember: Neptune = Graphs

    ElasticSearch

  • Example: In DynamoDB, you can only find by primary key or indexes.
  • With ElasticSearch, you can search any field, even partially matches
  • It’s common to use ElasticSearch as a complement to another database
  • ElasticSearch also has some usage for Big Data applicationss3
  • You can provision a cluster of instances
  • Built-in integrations: Amazon Kinesis Data Firehose, AWS IoT, and Amazon CloudWatch Logs for data ingestion
  • Security through Cognito & IAM, KMS encryption, SSL & VPC
  • Comes with Kibana (visualization) & Logstash (log ingestion) – ELK stack

    ElasticSearch for Solutions Architect

  • Operations: similar to RDS
  • Security: Cognito, IAM, VPC, KMS, SSL
  • Reliability: Multi-AZ, clustering
  • Performance: based on ElasticSearch project (open source), petabyte scale
  • Cost: pay per node provisioned (similar to RDS)
  • Remember: ElasticSearch = Search / Indexing
This post is licensed under CC BY 4.0 by the author.