Common Technical Questions
Describe a data pipeline project you have worked on, including technologies and challenges faced.

How do you handle skewed data in Spark?

What’s the difference between cache() and persist() in Spark—when would you not use them?

How do you tune Spark jobs (shuffle partitions, broadcast joins, etc.)?

Explain how you’d implement Change Data Capture (CDC) in a PySpark pipeline.

Discuss your experience with Microsoft Azure for data engineering projects (e.g., Azure Data Factory, Azure Databricks).

Write a recursive CTE to flatten a parent-child hierarchy in SQL.

How do you manage performance bottlenecks with millions of rows—indexing, query plans, etc.?

How do you ensure data security and compliance, particularly on cloud platforms like Azure (discuss encryption, RBAC, Key Vault)?

How do you handle NULL values in columns using PySpark (e.g., replace with preceding non-NULL value)?

T-SQL logic to implement slowly changing dimension type 2 (SCD2) without using MERGE.

Design a schema for tracking insurance claims lifecycle and handling versioning.

Trade-offs between Data Flows and Stored Procedures in ADF (Azure Data Factory).

Approaches to historical data archiving in large-scale data warehouses.

How do you optimize costs when running data workloads on cloud platforms (Azure)?

Discuss partitioning strategies in Synapse and how they can fail.

Experience integrating healthcare-sensitive data with security and compliance controls.

Question

Common Technical Questions
Describe a data pipeline project you have worked on, including technologies and challenges faced.

How do you handle skewed data in Spark?

What’s the difference between cache() and persist() in Spark—when would you not use them?

How do you tune Spark jobs (shuffle partitions, broadcast joins, etc.)?

Explain how you’d implement Change Data Capture (CDC) in a PySpark pipeline.

Discuss your experience with Microsoft Azure for data engineering projects (e.g., Azure Data Factory, Azure Databricks).

Write a recursive CTE to flatten a parent-child hierarchy in SQL.

How do you manage performance bottlenecks with millions of rows—indexing, query plans, etc.?

How do you ensure data security and compliance, particularly on cloud platforms like Azure (discuss encryption, RBAC, Key Vault)?

How do you handle NULL values in columns using PySpark (e.g., replace with preceding non-NULL value)?

T-SQL logic to implement slowly changing dimension type 2 (SCD2) without using MERGE.

Design a schema for tracking insurance claims lifecycle and handling versioning.

Trade-offs between Data Flows and Stored Procedures in ADF (Azure Data Factory).

Approaches to historical data archiving in large-scale data warehouses.

How do you optimize costs when running data workloads on cloud platforms (Azure)?

Discuss partitioning strategies in Synapse and how they can fail.

Experience integrating healthcare-sensitive data with security and compliance controls.

Optum