Skip to main content

Databricks Data Engineer Associate

The Databricks Data Engineer Associate certification validates the ability to build and maintain data pipelines using the Databricks Lakehouse Platform. It focuses on using Spark, Delta Lake, and SQL to ingest, transform, and manage data for downstream analytics. Holding the symbol DTB_DEA indicates a professional's proficiency in implementing reliable and scalable data engineering solutions on Databricks.



---------- Question 1
In Unity Catalog, a data engineer needs to manage access to data. What is the difference between a Managed Table and an External Table regarding data deletion when the table is dropped?
  1. For both tables, the data is permanently deleted from the cloud storage when the DROP TABLE command is executed.
  2. Dropping a Managed Table deletes both the metadata and the underlying data files; dropping an External Table only deletes the metadata.
  3. Dropping an External Table deletes the underlying data files; dropping a Managed Table only removes the entry from the catalog.
  4. Unity Catalog does not support dropping tables; users must manually delete files from the storage bucket first.

---------- Question 2
You need to perform a complex aggregation using PySpark to calculate the average transaction value per customer over a rolling 7-day window. The dataset is large and requires efficient windowing. Which PySpark module and function should be used to define this rolling window and calculate the metric?
  1. pyspark.sql.Window with the rowsBetween or rangeBetween functions
  2. pyspark.sql.functions.avg combined with a simple group_by clause
  3. pyspark.ml.feature.Bucketizer for grouping time intervals
  4. pyspark.sql.DataFrame.cube to generate all possible combinations

---------- Question 3
A data engineer is configuring Auto Loader to ingest a massive volume of files from a cloud landing zone into a Delta table. The source system generates millions of small JSON files daily. Which Auto Loader configuration is most appropriate to efficiently track and ingest these files without incurring high cloud costs or performance degradation caused by listing millions of objects in the directory?
  1. Enable the cloudFiles.useNotifications option to use file discovery via cloud events
  2. Set the spark.sql.shuffle.partitions to exactly 2000 for the stream
  3. Use the standard Directory Listing mode with an hourly trigger interval
  4. Manually partition the landing zone by hour to speed up the listing process

---------- Question 4
When configuring a Databricks Job to run a production pipeline, an engineer wants to minimize the management overhead of the underlying infrastructure and ensure that the cluster scales automatically based on the workload. Which compute option should they select for the job tasks?
  1. A fixed-size All-Purpose cluster.
  2. A Serverless compute for Jobs.
  3. A Single Node cluster with manual scaling enabled.
  4. A shared High Concurrency cluster.

---------- Question 5
A data engineering team is implementing a complex pipeline and wants to use Lakeflow Spark Declarative Pipelines (Delta Live Tables). What is a primary advantage of using this declarative approach compared to writing traditional Spark code for data pipelines, especially regarding data quality and dependency management?
  1. It requires manual management of checkpoint locations for all streams.
  2. It automatically handles task orchestration and data lineage while allowing for integrated data quality checks (expectations).
  3. It removes the need for Delta Lake and uses standard Parquet files for all stages.
  4. It only supports batch processing and does not allow for streaming data ingestion.

---------- Question 6
You are writing a PySpark transformation to extract nested fields from a complex JSON structure. The column 'customer_info' contains a nested field 'address' which further contains 'zip_code'. What is the correct PySpark DataFrame syntax to create a new column named 'zip' by extracting this nested value?
  1. df.withColumn('zip', col('customer_info.address.zip_code'))
  2. df.select('customer_info').getItem('address').getItem('zip_code')
  3. df.withColumn('zip', df['customer_info']['address']['zip_code'])
  4. Both A and C are correct
  5. None of the above

---------- Question 7
In a Medallion Architecture, a data engineer is responsible for designing the Silver layer. A downstream consumer complains that the data in the Silver layer contains duplicate records and inconsistent date formats. Which statement best describes the purpose of the Silver layer and how the engineer should address these quality issues?
  1. The Silver layer is for raw data storage; duplicates should be handled in the Gold layer instead.
  2. The Silver layer should provide cleaned, filtered, and augmented data; the engineer should apply deduplication and standardization logic.
  3. The Silver layer is a temporary staging area that should be deleted after the Gold layer is processed.
  4. The Silver layer should only contain aggregated data for BI reports; raw records must remain in Bronze.

---------- Question 8
An engineer is analyzing a slow-running Spark query in the Spark UI. They notice a significant discrepancy between the Max and Min task execution times in a single stage, with one task taking much longer than the others. Which performance issue does this pattern most likely indicate?
  1. Insufficient cluster memory across all executors
  2. Data Skew in the join or group-by keys
  3. The Spark Driver being undersized for the workload
  4. Too many small files in the source directory

---------- Question 9
A production Databricks Workflow consists of four sequential tasks. The second task fails due to a transient cloud connectivity issue. After resolving the underlying connectivity problem, the data engineer wants to complete the job run without re-executing the first task, which was successful. Which feature of Databricks Workflows should be used?
  1. Manually delete the first task from the job definition and run the entire job again.
  2. Use the Repair and Rerun feature to resume the workflow from the failed task.
  3. Clone the entire job and set the first task to always succeed regardless of its logic.
  4. Trigger a new Run Now for the job and wait for it to skip the successful tasks automatically.

---------- Question 10
Under the Unity Catalog governance model, what is the critical operational difference between a Managed Table and an External Table when a data engineer executes a DROP TABLE command on a specific dataset?
  1. A dropped managed table only removes the metadata from the catalog
  2. A dropped external table removes both the metadata and the data files
  3. A dropped managed table removes both the metadata and the physical data
  4. There is no difference; both retain physical files for disaster recovery


Are they useful?
Click here to get 270 more questions to pass this certification at the first try! Explanation for each answer is included!

Follow the below LINKEDIN channel to stay updated about 89+ exams!

Comments

Popular posts from this blog

Microsoft Certified: Azure Fundamentals (AZ-900)

The Microsoft Certified: Azure Fundamentals (AZ-900) is the essential starting point for anyone looking to validate their foundational knowledge of cloud services and how those services are provided with Microsoft Azure. It is designed for both technical and non-technical professionals ---------- Question 1 A new junior administrator has joined your IT team and needs to manage virtual machines for a specific development project within your Azure subscription. This project has its own dedicated resource group called dev-project-rg. The administrator should be able to start, stop, and reboot virtual machines, but should not be able to delete them or modify network configurations, and crucially, should not have access to virtual machines or resources in other projects or subscription-level settings. Which Azure identity and access management concept, along with its appropriate scope, should be used to grant these specific permissions? Microsoft Entra ID Conditional Access, applied at...

Google Associate Cloud Engineer

The Google Associate Cloud Engineer (ACE) certification validates the fundamental skills needed to deploy applications, monitor operations, and manage enterprise solutions on the Google Cloud Platform (GCP). It is considered the "gatekeeper" certification, proving a candidate's ability to perform practical cloud engineering tasks rather than just understanding theoretical architecture.  ---------- Question 1 Your team is developing a serverless application using Cloud Functions that needs to process data from Cloud Storage. When a new object is uploaded to a specific Cloud Storage bucket, the Cloud Function should automatically trigger and process the data. How can you achieve this? Use Cloud Pub/Sub as a message broker between Cloud Storage and Cloud Functions. Directly access Cloud Storage from the Cloud Function using the Cloud Storage Client Library. Use Cloud Scheduler to periodically check for new objects in the bucket. Configure Cloud Storage to directly ca...

ISACA Certified Cloud Security Professional (CCSP)

The ISACA Certified Cloud Security Professional (CCSP) validates advanced technical skills in designing and managing secure cloud environments. it covers cloud data security, platform security, and compliance requirements across various cloud service models. Holding the symbol ISC_CCSP demonstrates a professional's expertise in protecting organizational data in complex cloud architectures. ---------- Question 1 A global financial institution needs to store highly confidential customer transaction data in a multicloud environment. The organization requires a solution that provides cryptographic isolation for data at rest and in use, with an emphasis on strong key management that is independent of any single cloud provider. Which data security technology and key management strategy would best meet these stringent requirements for maximum control and security? Utilizing cloud provider managed encryption keys with hardware security modules HSMs provided by each respective cloud pro...