GCP Professional Data Engineer Prep.
Preparing for the GCP Professional Data Engineer exam? Don’t know where to start? This post is the GCP Professional Data Engineer Certification Study Guide (with links to each objective in the exam domain).
I have curated a detailed list of articles from the Google documentation and other blogs for each objective of the Google Cloud Platform certified Professional Data Engineer exam. Please share the post within your circles so it helps them to prepare for the exam.
GCP Professional Data Engineer Course Material
Pluralsight (Free trial) | GCP Professional Data Engineer Prep |
Coursera (Professional Cert.) | Prep. for Professional Data Engineer |
Udemy | Professional Data Engineer 2022 |
GCP Professional Data Engineer Practice Test
Whizlabs Exam Questions | Professional data engineer (220 Qs) |
Udemy Practice Tests | Practice Test for Data Engineer (150 Qs) |
GCP Professional Data Engineer Other Materials
LinkedIn Learning | Become a cloud data engineer |
Amazon e-book (PDF) | GCP Data Engineer study guide |
Check out all the other GCP certificate study guides
Full Disclosure: Some of the links in this post are affiliate links. I receive a commission when you purchase through them.
Section 1. Designing data processing systems
1.1 Selecting the appropriate storage technologies. Considerations include:
Mapping storage systems to business requirements
Data modeling
Introduction to data models in Cloud Datastore
Trade-offs involving latency, throughput, transactions
The trade-off between high throughput and low latency
Distributed systems
Distributed systems in Google Cloud
Schema design
Schema design for time-series data
1.2 Designing data pipelines. Considerations include:
Data publishing and visualization (e.g., BigQuery)
Visualizing BigQuery data in a Jupyter notebook
Visualize BigQuery data using Data Studio
Batch and streaming data (e.g., Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Pub/Sub, Apache Kafka)
Create a batch processing job on GCP Dataflow
Building Batch Data Pipelines on GCP
Coding a batch processing pipeline with Dataflow & Apache Beam
Run an Apache Spark batch workload
Build a Dataflow pipeline: PubSub to Cloud Storage
Streaming pipelines with Scala and Kafka on GCP
Online (interactive) vs. batch predictions
Online versus batch prediction
Job automation and orchestration (e.g., Cloud Composer)
Automating infrastructure with Cloud Composer
Choose Cloud Composer for service orchestration
1.3 Designing a data processing solution. Considerations include:
Choice of infrastructure
System availability and fault tolerance
Breake down Cloud SQL’s 3 fault tolerance mechanisms
Compute Engine Service Level Agreement (SLA)
Use of distributed systems
Capacity planning
Capacity management with load balancing
Hybrid cloud and edge computing
Announcing Google Distributed Cloud Edge and Hosted | Google Cloud Blog
Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)
Building batch data pipelines on GCP
Serverless computing solutions
At least once, in-order, and exactly once, etc., event processing
Exactly-once processing in Google Cloud Dataflow
1.4 Migrating data warehousing and data processing. Considerations include:
Awareness of current state and how to migrate a design to a future state
The four phases of a data center migration to the cloud
Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)
Transfer service for on-premises data overview
Overview of on-premises to GCP migration
Validating a migration
Amazon link (affiliate)
Section 2. Building and operationalizing data processing systems
2.1 Building and operationalizing storage systems. Considerations include:
Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Datastore, Memorystore)
What Cloud Bigtable is good for?
Storage costs and performance
Life cycle management of data
2.2 Building and operationalizing pipelines. Considerations include:
Data cleansing
Google Cloud Dataprep: Prepare data of any size
Batch and streaming
Transformation
Creating a data transformation pipeline with Cloud Dataprep
Data acquisition and import
Real-time CDC replication into BigQuery
Best practices for importing and exporting data
Integrating with new data sources
2.3 Building and operationalizing processing infrastructure. Considerations include:
Provisioning resources
Monitoring pipelines
Using monitoring for Dataflow pipelines
Using the Dataflow monitoring interface
Adjusting pipelines
Testing and quality control
Testing Dataflow pipelines with Cloud Spanner Emulator
Section 3. Operationalizing machine learning models
3.1 Leveraging pre-built ML models as a service. Considerations include:
ML APIs (e.g., Vision API, Speech API)
Detect labels in an image by using client libraries
Transcribe speech to text by using the Cloud Console
Customizing ML APIs (e.g., AutoML Vision, Auto ML text)
Label images by using AutoML Vision
AutoML natural language API tutorial
Conversational experiences (e.g., Dialogflow)
3.2 Deploying an ML pipeline. Considerations include:
Ingesting appropriate data
Retraining of machine learning models (AI Platform Prediction and Training, BigQuery ML, Kubeflow, Spark ML)
Automated Model retraining with Kubeflow pipelines
Use Dataproc, BigQuery, and Apache Spark ML
Continuous evaluation
Continuous evaluation overview
3.3 Choosing the appropriate training and serving infrastructure. Considerations include:
Distributed vs. single machine
Distributed training structure
Use of edge compute
Bringing intelligence to the edge with Cloud IoT
Hardware accelerators (e.g., GPU, TPU)
Using GPUs for training models in the cloud
Using TPUs to train your model
3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:
Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)
Impact of dependencies of machine learning models
Common sources of error (e.g., assumptions about data)
Assumptions of common Machine Learning models
Section 4. Ensuring solution quality
4.1 Designing for security and compliance. Considerations include:
Identity and access management (e.g., Cloud IAM)
Identity and Access Management
Data security (encryption, key management)
Encryption at rest in Google Cloud
Encryption in Transit in Google Cloud
Cloud Key Management Service deep dive
Ensuring privacy (e.g., Data Loss Prevention API)
Cloud Data Loss Prevention (DLP) API client library
Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))
HIPAA compliance on Google Cloud Platform
4.2 Ensuring scalability and efficiency. Considerations include:
Building and running test suites
Using Cloud Build as a test runner
Using Cloud Build as a test runner
Pipeline monitoring (e.g., Cloud Monitoring)
Monitoring your Dataflow pipelines
Assessing, troubleshooting, and improving data representations and data processing infrastructure
Troubleshooting service infrastructure
Resizing and autoscaling resources
Autoscaling groups of instances
4.3 Ensuring reliability and fidelity. Considerations include:
Performing data preparation and quality control (e.g., Dataprep)
A peek into data preparation using Google Cloud Dataprep
Improve data quality for ML and analytics with Cloud Dataprep
Verification and monitoring
Validating data at scale for machine learning
Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
Disaster recovery scenarios for data
Breaking down Cloud SQL’s 3 fault tolerance mechanisms
Choosing between ACID, idempotent, eventually consistent requirements
Balancing Strong and Eventual Consistency with Datastore
4.4 Ensuring flexibility and portability. Considerations include:
Mapping to current and future business requirements
Best practices for enterprise organizations
Designing for data and application portability (e.g., multicloud, data residency requirements)
Meet data residency requirements with Google Cloud
Hybrid and multi-cloud patterns and practices
Data staging, cataloging, and discovery
This brings us to the end of the GCP Professional Data Engineer Study Guide.
What do you think? Let me know in the comments section if I have missed out on anything. Also, I love to hear from you about how your preparation is going on!
In case you are preparing for other GCP certification exams, check out the GCP study guide for those exams.
Follow Me to Receive Updates on CGP Exam
Want to be notified as soon as I post? Subscribe to the RSS feed / leave your email address in the subscribe section. Share the article to your social networks with the below links so it can benefit others.