GCP Professional Data Engineer Exam Study Guide

GCP Professional Data Engineer Prep.

Preparing for the GCP Professional Data Engineer exam? Don’t know where to start? This post is the GCP Professional Data Engineer Certification Study Guide (with links to each objective in the exam domain).

I have curated a detailed list of articles from the Google documentation and other blogs for each objective of the Google Cloud Platform certified Professional Data Engineer exam. Please share the post within your circles so it helps them to prepare for the exam.

GCP Professional Data Engineer Course Material

Pluralsight (Free trial)	GCP Professional Data Engineer Prep
Coursera (Professional Cert.)	Prep. for Professional Data Engineer
Udemy	Professional Data Engineer 2022

GCP Professional Data Engineer Practice Test

Whizlabs Exam Questions	Professional data engineer (220 Qs)
Udemy Practice Tests	Practice Test for Data Engineer (150 Qs)

GCP Professional Data Engineer Other Materials

LinkedIn Learning	Become a cloud data engineer
Amazon e-book (PDF)	GCP Data Engineer study guide

Check out all the other GCP certificate study guides

Full Disclosure: Some of the links in this post are affiliate links. I receive a commission when you purchase through them.

Section 1. Designing data processing systems

1.1 Selecting the appropriate storage technologies. Considerations include:

Mapping storage systems to business requirements

Cloud storage options

Data modeling

Schema and data model

Data model

Introduction to data models in Cloud Datastore

Trade-offs involving latency, throughput, transactions

The trade-off between high throughput and low latency

Optimize database service

Distributed systems

Distributed systems in Google Cloud

Schema design

Schema design for time-series data

Schema design best practices

1.2 Designing data pipelines. Considerations include:

Data publishing and visualization (e.g., BigQuery)

Visualizing BigQuery data in a Jupyter notebook

Visualize BigQuery data using Data Studio

Batch and streaming data (e.g., Dataflow, Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Pub/Sub, Apache Kafka)

Create a batch processing job on GCP Dataflow

Building Batch Data Pipelines on GCP

Coding a batch processing pipeline with Dataflow & Apache Beam

Run an Apache Spark batch workload

Hadoop ecosystem in GCP

Build a Dataflow pipeline: PubSub to Cloud Storage

Streaming pipelines with Scala and Kafka on GCP

Online (interactive) vs. batch predictions

Online versus batch prediction

Job automation and orchestration (e.g., Cloud Composer)

Automating infrastructure with Cloud Composer

Choose Cloud Composer for service orchestration

1.3 Designing a data processing solution. Considerations include:

Choice of infrastructure

Processing large-scale data

System availability and fault tolerance

Breake down Cloud SQL’s 3 fault tolerance mechanisms

Compute Engine Service Level Agreement (SLA)

Use of distributed systems

Google Distributed Cloud

Capacity planning

Manage capacity and quota

Capacity management with load balancing

Hybrid cloud and edge computing

What is Hybrid Cloud?

Announcing Google Distributed Cloud Edge and Hosted | Google Cloud Blog

Architecture options (e.g., message brokers, message queues, middleware, service-oriented architecture, serverless functions)

Building batch data pipelines on GCP

What is Pub/Sub?

Serverless computing solutions

At least once, in-order, and exactly once, etc., event processing

Exactly-once processing in Google Cloud Dataflow

At least once delivery

Ordering messages

1.4 Migrating data warehousing and data processing. Considerations include:

Awareness of current state and how to migrate a design to a future state

The four phases of a data center migration to the cloud

Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)

Transfer service for on-premises data overview

Transfer appliance

Overview of on-premises to GCP migration

Validating a migration

Verify a migration

Amazon link (affiliate)

Section 2. Building and operationalizing data processing systems

2.1 Building and operationalizing storage systems. Considerations include:

Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Datastore, Memorystore)

What Cloud Bigtable is good for?

Key features of Cloud Spanner

Cloud SQL use cases

BigQuery key features

Datastore overview

Memorystore features

Storage costs and performance

Cloud storage pricing

Performance optimization

Life cycle management of data

Data lifecycle

2.2 Building and operationalizing pipelines. Considerations include:

Data cleansing

Google Cloud Dataprep: Prepare data of any size

Batch and streaming

Streaming pipelines

Working with data pipelines

Transformation

Creating a data transformation pipeline with Cloud Dataprep

Data acquisition and import

Real-time CDC replication into BigQuery

Best practices for importing and exporting data

Integrating with new data sources

What is data integration?

2.3 Building and operationalizing processing infrastructure. Considerations include:

Provisioning resources

Resource manager

Monitoring pipelines

Using monitoring for Dataflow pipelines

Using the Dataflow monitoring interface

Adjusting pipelines

Setting pipeline options

Testing and quality control

Test GCP Dataflow pipeline

Testing Dataflow pipelines with Cloud Spanner Emulator

Section 3. Operationalizing machine learning models

3.1 Leveraging pre-built ML models as a service. Considerations include:

ML APIs (e.g., Vision API, Speech API)

Detect labels in an image by using client libraries

Detect text in images

Transcribe speech to text by using the Cloud Console

Customizing ML APIs (e.g., AutoML Vision, Auto ML text)

Label images by using AutoML Vision

AutoML natural language API tutorial

Conversational experiences (e.g., Dialogflow)

Dialogflow quickstart

3.2 Deploying an ML pipeline. Considerations include:

Ingesting appropriate data

Introduction to loading data

Retraining of machine learning models (AI Platform Prediction and Training, BigQuery ML, Kubeflow, Spark ML)

Training overview

Automated Model retraining with Kubeflow pipelines

Use Dataproc, BigQuery, and Apache Spark ML

Continuous evaluation

Continuous evaluation overview

3.3 Choosing the appropriate training and serving infrastructure. Considerations include:

Distributed vs. single machine

Distributed training structure

Use of edge compute

Bringing intelligence to the edge with Cloud IoT

Hardware accelerators (e.g., GPU, TPU)

Using GPUs for training models in the cloud

Using TPUs to train your model

3.4 Measuring, monitoring, and troubleshooting machine learning models. Considerations include:

Machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics)

Machine Learning glossary

Impact of dependencies of machine learning models

Data dependencies

Common sources of error (e.g., assumptions about data)

Assumptions of common Machine Learning models

Section 4. Ensuring solution quality

4.1 Designing for security and compliance. Considerations include:

Identity and access management (e.g., Cloud IAM)

Identity and Access Management

Overview of IAM

Data security (encryption, key management)

Encryption at rest in Google Cloud

Encryption in Transit in Google Cloud

Cloud Key Management Service deep dive

Ensuring privacy (e.g., Data Loss Prevention API)

Cloud Data Loss Prevention

Cloud Data Loss Prevention (DLP) API client library

Legal compliance (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), FedRAMP, General Data Protection Regulation (GDPR))

HIPAA compliance on Google Cloud Platform

COPPA compliance

FedRAMP marketplace

GDPR and Google Cloud

4.2 Ensuring scalability and efficiency. Considerations include:

Building and running test suites

Using Cloud Build as a test runner

Pipeline monitoring (e.g., Cloud Monitoring)

Monitoring your Dataflow pipelines

Assessing, troubleshooting, and improving data representations and data processing infrastructure

Troubleshooting service infrastructure

Global infrastructure

Resizing and autoscaling resources

gcloud compute disks resize

Resizing a cluster

Autoscaling groups of instances

4.3 Ensuring reliability and fidelity. Considerations include:

Performing data preparation and quality control (e.g., Dataprep)

A peek into data preparation using Google Cloud Dataprep

Improve data quality for ML and analytics with Cloud Dataprep

Verification and monitoring

Validating data at scale for machine learning

Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)

Disaster recovery scenarios for data

Restartable jobs

Breaking down Cloud SQL’s 3 fault tolerance mechanisms

Choosing between ACID, idempotent, eventually consistent requirements

Balancing Strong and Eventual Consistency with Datastore

4.4 Ensuring flexibility and portability. Considerations include:

Mapping to current and future business requirements

Best practices for enterprise organizations

Designing for data and application portability (e.g., multicloud, data residency requirements)

Meet data residency requirements with Google Cloud

Hybrid and multi-cloud patterns and practices

Data staging, cataloging, and discovery

What is Data Catalog?

This brings us to the end of the GCP Professional Data Engineer Study Guide.

What do you think? Let me know in the comments section if I have missed out on anything. Also, I love to hear from you about how your preparation is going on!

In case you are preparing for other GCP certification exams, check out the GCP study guide for those exams.

Follow Me to Receive Updates on CGP Exam

Want to be notified as soon as I post? Subscribe to the RSS feed / leave your email address in the subscribe section. Share the article to your social networks with the below links so it can benefit others.