Shawn Albert

Menu

Close

Modern Serverless Data Platform

A modern data lakehouse architecture leveraging Apache Iceberg and AWS services to create a scalable, serverless data platform with automated deployment and governance.

Apache IcebergData LakehouseAWS Lake FormationAWS GlueAWS AthenadbtGitHub ActionsDockerAWS ECRAWS ECSInfrastructure as Code (IaC)AWS CDKAWS RDS
Build dbt Docker Image
AWS Cloud
AWS RDS
MySQL
AWS RDS
PostgreSQL
Incremental 
Update
Raw Data 
BRONZE
SILVER
GOLD
Insert Overwrite
Iceberg Metadata
AWS Glue
ETL Job
Iceberg Metadata
Glue Data Catalog
Incremental
Update
Staging &
Intermediate Data
Iceberg Metadata
Glue Data Catalog
Facts &
Data Marts
Glue Data Catalog
AWS Lake Formation
Incremental
Update
AWS Athena
Spark Workgroup
Incremental
Update
AWS Athena
Spark Workgroup
AWS ECS Cluster
AWS Fargate
dbt
dbt SparkSQL
LF-tags for data governance
AWS ECR
AWS Cloud
Development Kit
AWS CloudFormation
cdk deploy

APACHE ICEBERG DATA LAKEHOUSE

Mobile
GitHub Actions iconGitHub Actions
GitHub
Data Scientist
Business User
Data Engineer
Desktop
Push Docker image to
container registry

Overview

Engineered a modern data platform leveraging Apache Iceberg and AWS services to create a scalable, serverless data lakehouse architecture. The platform implements a Bronze-Silver-Gold data paradigm with automated deployment through GitHub Actions and comprehensive data governance using AWS Lake Formation.

Architecture Components

  1. Data Ingestion Layer
    • AWS RDS (MySQL and PostgreSQL) as source systems
    • AWS Glue ETL jobs for data extraction and loading
    • AWS Cloud Development Kit (CDK) for infrastructure deployment
  2. Data Processing Layer
    • Apache Iceberg for table format management
    • AWS Lake Formation for data access control
    • AWS Glue Data Catalog for metadata management
    • AWS Athena with Spark workgroups for data processing
  3. CI/CD Pipeline
    • GitHub Actions for automated deployment
    • Docker for containerization
    • AWS ECR for container registry
    • AWS ECS for container orchestration
  4. Data Transformation
    • dbt for data transformations
    • SparkSQL for data processing
    • AWS Fargate for serverless compute

Technical Implementation

Data Layer Architecture

The platform implements a multi-layer data architecture that aligns with dbt best practices:

Data Lake Layerdbt LayerDescriptionImplementation Details
BronzeRawInitial ingestion layer where raw data from source systems is loaded with minimal or no transformation

• Raw data loaded from source systems (RDS MySQL/PostgreSQL)
• Original schema and data preserved
• Insert/overwrite patterns via Glue ETL
• Source system metadata capture

SilverStagingCleaned and standardized data layer, focused on removing duplicates, standardizing formats, and generally improving data quality

• Automated data cleaning pipelines
• Duplicate record removal
• Format standardization and type casting
• Basic data quality validation rules
• Schema enforcement

SilverIntermediateLayer that holds enriched data and additional transformations to create relationships between entities, often used for further downstream aggregations

• Entity relationship mapping and joins
• Data enrichment transformations
• Preparation for aggregation layers
• Complex business logic implementation
• Derived field calculations

GoldFactsHighly curated layer representing analytics-ready data in a fact table format, typically containing metrics or aggregations for specific use cases

• Core metric calculations and aggregations
• Fact table generation and modeling
• Performance-optimized table structures
• Incremental processing logic
• Data validation rules

GoldMartsFinal layer, designed for specific business domains and reporting, tailored to business requirements or specific use cases

• Domain-specific data models
• Reporting-ready dimensional tables
• Business logic implementation
• Self-service analytics views
• Documentation and data dictionaries

Key Features

The modern data platform provides a robust foundation for data operations through:

  • Serverless Architecture

    • Pay-per-use compute resources
    • Automatic scaling
    • Minimal operational overhead
  • Data Quality and Governance

    • Automated data validation with dbt tests
    • Fine-grained access controls and column-level security
    • Data discovery and cataloging
    • Automated schema evolution
  • Performance Optimization

    • Query optimization with Apache Iceberg
    • Partition management
    • Compute resource optimization
  • Automated Deployments

    • GitHub Actions for CI/CD with automated testing, validation, and Docker image deployment
    • AWS CDK for Infrastructure as Code, ensuring consistent, version-controlled infrastructure
    • Environment provisioning with automated resource management
  • Increased Development Efficiency

    • Standardized data models across projects
    • DataOps-driven CI/CD pipeline
    • Auto-generated documentation via dbt models

Technical Outcomes

  • Implemented end-to-end data pipeline automation aligned with dbt best practices
  • Established data quality frameworks across transformation layers
  • Created maintainable and version-controlled data transformations
  • Built modular infrastructure supporting iterative development