Databricks Certified Machine Learning Associate

The Databricks Certified Machine Learning Associate validates the skills required to implement machine learning workflows using Databricks and MLflow. It covers data preparation, model training, and the tracking of experimental results to ensure reproducible ML outcomes. Professionals with the symbol DTB_MLA are recognized for their ability to build and manage basic machine learning pipelines in a collaborative environment.

---------- Question 1

A data scientist is analyzing a dataset containing financial transaction amounts for a fraud detection model. The distribution of transaction amounts is highly skewed, with a long tail of very large values, indicating the presence of significant outliers. A direct application of linear models or models sensitive to feature scales might be negatively impacted by these characteristics. To mitigate the influence of these extreme values and normalize the distribution for better model performance, what are the most appropriate data transformation techniques to consider?

Removing all transactions above the 95th percentile and then applying min-max scaling to the remaining data.
Applying a log transformation to the transaction amounts and then removing outliers based on the interquartile range (IQR).
Using a standard scaler for the transaction amounts and then imputing any missing values with the mean.
Applying a log transformation to the transaction amounts and then assessing the transformed data for further outlier treatment if necessary.

---------- Question 2

A medical research team is developing a machine learning model to detect a rare disease using patient data. The dataset is severely imbalanced, with only 1% of patients actually having the disease (positive class). The primary objective of this model is to maximize the identification of true positive cases (patients with the disease) to ensure early intervention, while keeping the number of false positives (healthy patients incorrectly diagnosed) at a manageable level to avoid unnecessary anxiety and costly follow-up procedures. Which data balancing technique and evaluation metric would be most appropriate for this specific scenario?

Use undersampling of the majority class, and prioritize accuracy as the main evaluation metric.
Use oversampling techniques like SMOTE on the minority class, and prioritize the F1-score as the main evaluation metric.
Adjust class weights in the model training algorithm, and prioritize recall (sensitivity) while monitoring precision.
Remove all samples from the majority class to balance the dataset, and prioritize Log Loss as the main evaluation metric.

---------- Question 3

A data science team has successfully deployed a champion model for customer churn prediction in a production environment. Subsequently, a new challenger model has been developed which demonstrates a marginal improvement in AUC on offline validation data. Before completely replacing the current champion, the team intends to conduct A/B testing in production to validate the challengers real-world performance. Both the champion and challenger models have been meticulously registered in the Unity Catalog Model Registry. To facilitate agile switching, version control, and streamlined management during the A/B test and the eventual promotion of the challenger, which MLflow feature within Unity Catalog should the team leverage?

Manually update the model serving endpoint configuration to point to the new challenger model artifact path.
Archive the current champion model version and promote the challenger model to production stage.
Utilize MLflow model aliases to assign a champion alias to the current production model and a challenger alias to the new model, enabling easy programmatic switching.
Create a new model entry in the Unity Catalog Model Registry for the challenger model and manage its serving independently.

---------- Question 4

An e-commerce platform wants to deploy a new product recommendation model. They need to provide immediate, personalized recommendations to users browsing the website, which requires very low latency and high availability. Furthermore, they want to test the performance of a new challenger model against the existing champion model on a small, controlled segment of live user traffic before fully rolling it out to all users. This approach enables data-driven decision making for model updates and ensures minimal disruption. Which model deployment strategy and Databricks serving capability should be employed to achieve both real-time personalized recommendations and A/B testing of model versions effectively?

Deploy both models for batch inference overnight and update recommendations periodically for all users.
Deploy both models as streaming inference jobs using Delta Live Tables and route user requests through this continuous pipeline.
Deploy the champion model to a Databricks Model Serving endpoint for real-time inference, and configure a separate endpoint for the challenger model, directing a small percentage of live traffic to it for A/B testing.
Integrate the challenger model directly into the application backend and manually switch between models based on user identification numbers.

---------- Question 5

A data science team has developed a new high-performing churn prediction model and wants to integrate it into a production system that requires robust versioning, staged deployment, and clear lineage tracking across different environments. They need to promote the model from a development stage to a staging environment for testing and then eventually to production. Furthermore, they anticipate needing to run A/B tests between current and challenger models. Which Databricks MLflow capability, when combined with Unity Catalog, provides the most comprehensive and governed approach for managing this model lifecycle, including promotion and A/B testing with aliases?

Using the workspace-level MLflow Model Registry to register model versions and manually move them between workspace-specific stages.
Employing the MLflow Client API to log models as artifacts in an MLflow Run and then manually deploying them to production endpoints.
Leveraging the Unity Catalog Model Registry to register model versions, manage aliases, and transition models across environments.
Packaging the model as a custom Docker image and deploying it directly to a Kubernetes cluster for versioning and staging.

---------- Question 6

A data science team is implementing a robust MLOps strategy for their predictive maintenance models. They require a system where every model training run is fully reproducible and traceable, capturing all parameters, metrics, and the model artifact itself. This information must be easily queryable for auditing, comparison, and deployment purposes. Which MLflow Client API function calls are essential to manually log the learning rate, the achieved RMSE metric, and the final trained scikit-learn model artifact within a single MLflow Run, adhering to these MLOps best practices?

mlflow.start_run(), mlflow.log_metric(key='rmse', value=model_rmse), mlflow.log_param(key='learning_rate', value=0.01), mlflow.sklearn.save_model(sk_model=trained_model, path='model')
mlflow.active_run(), mlflow.log_params({'learning_rate': 0.01}), mlflow.log_metrics({'rmse': model_rmse}), mlflow.log_artifact(local_path='trained_model.pkl')
mlflow.start_run(), mlflow.log_param('learning_rate', 0.01), mlflow.log_metric('rmse', model_rmse), mlflow.sklearn.log_model(sk_model=trained_model, artifact_path='model')
mlflow.run_id(), mlflow.set_tag('learning_rate', 0.01), mlflow.set_tag('rmse', model_rmse), mlflow.upload_file(local_path='trained_model.pkl', artifact_path='model')

---------- Question 7

A manufacturing company uses a vast network of IoT sensors to monitor the health of critical machinery. They want to build a system that continuously processes the incoming sensor data stream, predicts potential equipment failures in near real-time, and makes these predictions immediately available for operational dashboards and alerting systems. The solution needs to be robust, scalable, and capable of handling data quality issues and schema evolution automatically. How can Databricks and specifically Delta Live Tables DLT facilitate this streaming inference pipeline?

Deploy a real-time model serving endpoint that directly consumes the raw sensor data stream and performs inference, then writes results to a batch table.
Use Delta Live Tables DLT to define a streaming pipeline that ingests raw sensor data, performs necessary feature engineering and data quality checks, applies a pre-trained machine learning model for inference, and continuously updates a Delta table with predictions for consumption by dashboards.
Perform daily batch inference jobs on the accumulated sensor data in a Delta table, and then manually refresh the dashboards with the updated predictions.
Utilize a traditional Spark Streaming job with custom Python scripts for data processing and model inference, storing results in a non-Delta table.

---------- Question 8

A bank is building a machine learning model to detect fraudulent transactions. The dataset is highly imbalanced, with a tiny fraction of transactions (less than 1%) being fraudulent. The business goal is critical: to minimize false negatives (i.e., failing to detect actual fraud) even if it means a slight increase in false positives, because missing fraud is far more costly than false alarms. Which approach would be most effective for addressing the data imbalance and selecting an appropriate evaluation metric for this specific problem?

Use SMOTE to oversample the minority class and evaluate the model primarily using accuracy.
Undersample the majority class and focus on the R-squared metric for model evaluation.
Employ techniques like class weighting during model training or oversampling/undersampling, and prioritize evaluation metrics such as Recall, F1-score, or Area Under the Receiver Operating Characteristic Curve (ROC/AUC).
Ignore the imbalance, train a standard algorithm like Linear Regression, and evaluate using Mean Absolute Error (MAE).

---------- Question 9

A global e-commerce company manages various machine learning projects across different geographical regions and Databricks workspaces. They are developing a unified fraud detection system that requires consistent features like customer purchase history and device fingerprints, accessible by multiple data science teams. They plan to use Databricks AutoML for rapid model development. Which strategy for creating and managing feature store tables would best support this organizational structure and why?

Create separate feature store tables within each regional workspace, as this provides workspace-specific isolation and easier management for individual teams.
Utilize Unity Catalog to create and manage feature store tables at the account level, enabling centralized feature discovery, sharing, and governance across all workspaces.
Store features directly within Delta Lake tables in each project's workspace, relying on manual data synchronization between workspaces for feature consistency.
Use an external feature store solution entirely separate from Databricks and integrate it via custom APIs, sacrificing the native integration benefits with AutoML and Unity Catalog.
Create feature store tables as local CSV files within individual notebooks to allow for quick iteration, then manually merge them for model training.

---------- Question 10

A data scientist is analyzing a large dataset of customer transaction amounts for an e-commerce platform. The dataset exhibits a highly right-skewed distribution with a long tail, indicating the presence of a few extremely high transaction values that could significantly impact downstream model training for customer segmentation. The goal is to remove these extreme outliers to prevent them from unduly influencing the model, without discarding too much valid data from the main distribution. Which method for outlier detection and removal would generally be more appropriate in this specific scenario, and why?

Using the standard deviation method to remove data points falling beyond 3 standard deviations from the mean, as it is simple and widely understood.
Applying the Interquartile Range (IQR) method to remove data points outside 1.5 times the IQR from the first and third quartiles, as it is more robust to skewed distributions.
Directly applying a log transformation to the transaction amounts without removing outliers, as this transformation naturally reduces the impact of large values.
Removing the top 1% of transaction values directly, assuming these are always outliers and will not impact the data distribution in any other way.

Are they useful?

Click here to get 288 more questions to pass this certification at the first try! Explanation for each answer is included!

Follow the below LINKEDIN channel to stay updated about 89+ exams!

Digital Intelligence Network

IT certifications practice questions

Search This Blog

Databricks Certified Machine Learning Associate

Comments

Post a Comment

Popular posts from this blog

Microsoft Certified: Azure Fundamentals (AZ-900)

Google Associate Cloud Engineer

CompTIA Cybersecurity Analyst (CySA+)