The Databricks Data Engineer Associate certification validates the ability to build and maintain data pipelines using the Databricks Lakehouse Platform. It focuses on using Spark, Delta Lake, and SQL to ingest, transform, and manage data for downstream analytics. Holding the symbol DTB_DEA indicates a professional's proficiency in implementing reliable and scalable data engineering solutions on Databricks.
---------- Question 1
In Unity Catalog, a data engineer needs to manage access to data. What is the difference between a Managed Table and an External Table regarding data deletion when the table is dropped?
- For both tables, the data is permanently deleted from the cloud storage when the DROP TABLE command is executed.
- Dropping a Managed Table deletes both the metadata and the underlying data files; dropping an External Table only deletes the metadata.
- Dropping an External Table deletes the underlying data files; dropping a Managed Table only removes the entry from the catalog.
- Unity Catalog does not support dropping tables; users must manually delete files from the storage bucket first.
---------- Question 2
You need to perform a complex aggregation using PySpark to calculate the average transaction value per customer over a rolling 7-day window. The dataset is large and requires efficient windowing. Which PySpark module and function should be used to define this rolling window and calculate the metric?
- pyspark.sql.Window with the rowsBetween or rangeBetween functions
- pyspark.sql.functions.avg combined with a simple group_by clause
- pyspark.ml.feature.Bucketizer for grouping time intervals
- pyspark.sql.DataFrame.cube to generate all possible combinations
---------- Question 3
A data engineer is configuring Auto Loader to ingest a massive volume of files from a cloud landing zone into a Delta table. The source system generates millions of small JSON files daily. Which Auto Loader configuration is most appropriate to efficiently track and ingest these files without incurring high cloud costs or performance degradation caused by listing millions of objects in the directory?
- Enable the cloudFiles.useNotifications option to use file discovery via cloud events
- Set the spark.sql.shuffle.partitions to exactly 2000 for the stream
- Use the standard Directory Listing mode with an hourly trigger interval
- Manually partition the landing zone by hour to speed up the listing process
---------- Question 4
When configuring a Databricks Job to run a production pipeline, an engineer wants to minimize the management overhead of the underlying infrastructure and ensure that the cluster scales automatically based on the workload. Which compute option should they select for the job tasks?
- A fixed-size All-Purpose cluster.
- A Serverless compute for Jobs.
- A Single Node cluster with manual scaling enabled.
- A shared High Concurrency cluster.
---------- Question 5
A data engineering team is implementing a complex pipeline and wants to use Lakeflow Spark Declarative Pipelines (Delta Live Tables). What is a primary advantage of using this declarative approach compared to writing traditional Spark code for data pipelines, especially regarding data quality and dependency management?
- It requires manual management of checkpoint locations for all streams.
- It automatically handles task orchestration and data lineage while allowing for integrated data quality checks (expectations).
- It removes the need for Delta Lake and uses standard Parquet files for all stages.
- It only supports batch processing and does not allow for streaming data ingestion.
---------- Question 6
You are writing a PySpark transformation to extract nested fields from a complex JSON structure. The column 'customer_info' contains a nested field 'address' which further contains 'zip_code'. What is the correct PySpark DataFrame syntax to create a new column named 'zip' by extracting this nested value?
- df.withColumn('zip', col('customer_info.address.zip_code'))
- df.select('customer_info').getItem('address').getItem('zip_code')
- df.withColumn('zip', df['customer_info']['address']['zip_code'])
- Both A and C are correct
- None of the above
---------- Question 7
In a Medallion Architecture, a data engineer is responsible for designing the Silver layer. A downstream consumer complains that the data in the Silver layer contains duplicate records and inconsistent date formats. Which statement best describes the purpose of the Silver layer and how the engineer should address these quality issues?
- The Silver layer is for raw data storage; duplicates should be handled in the Gold layer instead.
- The Silver layer should provide cleaned, filtered, and augmented data; the engineer should apply deduplication and standardization logic.
- The Silver layer is a temporary staging area that should be deleted after the Gold layer is processed.
- The Silver layer should only contain aggregated data for BI reports; raw records must remain in Bronze.
---------- Question 8
An engineer is analyzing a slow-running Spark query in the Spark UI. They notice a significant discrepancy between the Max and Min task execution times in a single stage, with one task taking much longer than the others. Which performance issue does this pattern most likely indicate?
- Insufficient cluster memory across all executors
- Data Skew in the join or group-by keys
- The Spark Driver being undersized for the workload
- Too many small files in the source directory
---------- Question 9
A production Databricks Workflow consists of four sequential tasks. The second task fails due to a transient cloud connectivity issue. After resolving the underlying connectivity problem, the data engineer wants to complete the job run without re-executing the first task, which was successful. Which feature of Databricks Workflows should be used?
- Manually delete the first task from the job definition and run the entire job again.
- Use the Repair and Rerun feature to resume the workflow from the failed task.
- Clone the entire job and set the first task to always succeed regardless of its logic.
- Trigger a new Run Now for the job and wait for it to skip the successful tasks automatically.
---------- Question 10
Under the Unity Catalog governance model, what is the critical operational difference between a Managed Table and an External Table when a data engineer executes a DROP TABLE command on a specific dataset?
- A dropped managed table only removes the metadata from the catalog
- A dropped external table removes both the metadata and the data files
- A dropped managed table removes both the metadata and the physical data
- There is no difference; both retain physical files for disaster recovery
Are they useful?
Click here to get 270 more questions to pass this certification at the first try! Explanation for each answer is included!
Follow the below LINKEDIN channel to stay updated about 89+ exams!

Comments
Post a Comment