🎄

CertoMetrics - 9% OFF Special Discount Offer - Ends In:

0d 00h 00m 00s
Coupon code: SALE2026

Databricks Certified Data Engineer Professional (Data-Engineer-Professional)

Get full access to the updated question bank and confidently prepare for your exam.

Vendor

Databricks

Certification

Data Engineer

Content

333 Qs

Status

Verified

Updated

1 day ago

Test the Practice Engine

Experience our interactive testing environment with free demo questions

Launch Free Demo
Best Value Bundle

Premium Bundle

Complete Success Suite

$83 $49

Save $34 Instantly

  • Full PDF + Interactive Engine Everything you need to pass
  • All Advanced Question Types Drag & Drop, Hotspots, Case Studies
  • Priority 24/7 Expert Support Direct line to certification leads
  • 90 Days Free Priority Updates Stay current as exams change

Success Metric

98.4% Pass Rate

Verified by 15k+ Students
Secure Checkout
Popular

Standard Simulation

Practice Engine

$44

One-Time Payment

  • Web-Based (Zero Install)
  • Real Testing Environment Virtual & Practice Modes
  • Interactive Engine Drag & Drop, Hotspots
  • 60 Days Free Updates

Compatible with All Devices

Chrome
Verified Secure Checkout

Basic Tier

PDF Study Guide

$39

Digital Access

  • Exam Questions (PDF)
  • Mobile Friendly
  • 60 Days Updates
Download Free Sample PDF

Verified 67-Question Preview (Data-Engineer-Professional)

Secure Checkout

Verified Community

The CertoMetrics Standard.

Recommend the #1 platform for verified Databricks certification resources.

Success Network

Help a Colleague Succeed.

Invite a peer to get their own updated Data-Engineer-Professional prep kit.

Exam Overview

The Databricks Certified Data Engineer Professional certification validates an individual's advanced expertise in designing, building, and managing robust, scalable data pipelines on the Databricks Lakehouse Platform. This credential signifies a deep understanding of production-grade data engineering practices, including complex data ingestion, transformation, and optimization strategies using Delta Lake, Apache Spark, and Databricks tools like Delta Live Tables. Earning this certification demonstrates the ability to implement secure, governed, and high-performance data solutions crucial for enterprise-level analytics and machine learning workloads. It's a testament to your capability in driving data initiatives, enhancing data quality, and contributing significantly to an organization's data-driven success, positioning you as a leader in the modern data landscape.

Questions

60

Passing Score

700/1000

Duration

120 Minutes

Difficulty

Expert

Level

Professional

Skills Measured

Designing and Building Robust, Scalable Data Pipelines
Implementing Advanced Data Transformations with Delta Lake and Spark
Optimizing Performance and Troubleshooting Complex Data Workloads
Ensuring Data Governance, Security, and Compliance within the Lakehouse
Architecting and Deploying Scalable Lakehouse Solutions with Best Practices

Career Path

Target Roles

Senior Data Engineer Cloud Data Architect ETL Developer (Advanced)

Common Questions

Is the material up to date?

Yes. We update our question bank weekly to match the latest Databricks standards. You get free updates for 90 days.

What format do I get?

You get instant access to both the **PDF** (for reading) and our **Premium Test Engine** (for exam simulation).

Is there a guarantee?

Absolutely. If you fail the Data-Engineer-Professional exam using our materials, we offer a full money-back guarantee.

When do I get the download?

Instantly. The download link is available in your dashboard immediately after payment is confirmed.

Free Study Guide Samples

Previewing updated Data-Engineer-Professional bank (67 Questions).

QUESTION 1

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constraints and multi-table inserts to validate records on write.

Which consideration will impact the decisions made by the engineer while migrating this workload?

A
Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.
B
Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake’s upsert functionality.
C
Committing to multiple tables simultaneously reQuires taking out multiple table locks and can lead to a state of deadlock.
D
All Delta Lake transactions are ACID compliant against a single table, and Databricks does not enforce foreign key constraints.

Correct Option: D

The correct answer is D. When migrating a workload from a relational database system to the Databricks Lakehouse, several fundamental differences must be considered.

  • Delta Lake transactions are ACID compliant against a single table: A core feature of Delta Lake is its transactional capabilities, but these guarantees (Atomicity, Consistency, Isolation, Durability) apply to operations within a single Delta table. Unlike traditional relational databases that might support distributed transactions across multiple tables, Databricks requires different strategies for ensuring consistency across multiple tables (e.g., orchestrating writes, idempotent operations, or using a medallion architecture). This directly impacts how 'multi-table inserts' from the source system are handled.
  • Databricks does not enforce foreign key constraints: While Unity Catalog allows defining PRIMARY KEY and FOREIGN KEY constraints for informational purposes (e.g., for documentation, optimization hints for query engines, or integration with BI tools), these constraints are currently not enforced during data writes. This means that the data validation logic based on foreign keys in the source system must be explicitly reimplemented in the Lakehouse (e.g., using data quality checks, Delta Live Tables expectations, or explicit validation queries) to prevent referential integrity violations.

Therefore, understanding that ACID guarantees are per-table and that foreign key constraints are not enforced for validation are critical considerations for the migration.



Reference: https://docs.delta.io/latest/delta-transactions.html, https://docs.databricks.com/en/unity-catalog/manage-tables.html#constraints-on-delta-lake-tables
QUESTION 2

A data architect has heard about Delta Lake’s built-in versioning and time travel capabilities. For auditing purposes, they have a reQuirement to maintain a full record of all valid street addresses as they appear in the customers table.

The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.

Which piece of information is critical to this decision?

A
Data corruption can occur if a Query fails in a partially completed state because Type 2 tables reQuire setting multiple fields in a single update.
B
Shallow clones can be combined with Type 1 tables to accelerate historic Queries for long-term versioning.
C
Delta Lake time travel cannot be used to Query previous versions of these tables because Type 1 changes modify data files in place.
D
Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.

Correct Option: D

✅ Option D (Correct)

Delta Lake's time travel feature, while powerful for accessing recent history and providing data recovery, is not optimally designed for indefinite, long-term versioning of all data changes, especially for fine-grained auditing purposes on frequently changing attributes like street addresses. Retaining an extremely long history of all data files to support time travel across many years can significantly increase storage costs (as old data files are retained for extended periods) and query latency (as reconstructing historical states from a very large transaction log and numerous small data files becomes more complex and resource-intensive). For explicit, long-term auditing requirements of historical attribute values, a Type 2 Slowly Changing Dimension (SCD) approach is often more performant and scalable, as it explicitly models the history within the table itself (e.g., using start/end dates and flags), making historical queries more direct and efficient.

❌ Why the other choices are incorrect:

  • Option A is incorrect: Delta Lake is ACID compliant. Transactions either complete fully or are rolled back, preventing data corruption from partially completed updates, regardless of whether it's a Type 1 or Type 2 operation.
  • Option B is incorrect: Shallow clones create a copy of a table's metadata pointing to the same data files. While useful for creating instant snapshots, they do not inherently accelerate historic queries for long-term versioning. Time travel itself is the mechanism for historic queries, and its performance for long-term history is the concern.
  • Option C is incorrect: This statement is factually inaccurate. Delta Lake time travel is specifically designed to query previous versions of tables, even when changes logically 'modify' data in place (in Delta Lake, updates typically involve writing new files and marking old ones for deletion after a retention period). The transaction log tracks all changes, allowing any past version within the retention period to be queried.


Reference: https://docs.delta.io/latest/delta-time-travel.html
QUESTION 3

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impressions led to monetizable clicks.

In the code below, Impressions is a streaming DataFrame with a watermark ("event_time", "10 minutes")



The data engineer notices the Query slowing down significantly.

Which solution would improve the performance?

A
Joining on event time constraint: clickTime >= impressionTime AND clickTime <= impressionTime interval 1 hour
B
Joining on event time constraint: clickTime + 3 hours < impressionTime - 2 hours
C
Joining on event time constraint: clickTime == impressionTime using a leftOuter join
D
Joining on event time constraint: clickTime >= impressionTime - interval 3 hours and removing watermarks

Correct Option: A

A common cause for performance degradation in Spark Structured Streaming stream-stream joins is the unbounded growth of state. Without proper management, Spark must store all past data from both streams to find potential matches, leading to memory issues and slow queries. To optimize stream-stream joins, Spark Structured Streaming requires two conditions:
  1. Both input streams must have watermarks defined on their event time columns. The question states 'Impressions' has a watermark. We assume 'Clicks' would also be watermarked.
  2. A time constraint must be specified in the join condition, typically in the form of a time window (e.g., eventTime1 BETWEEN eventTime2 AND eventTime2 + INTERVAL). This constraint allows Spark to leverage the watermarks to prune old state that will no longer match any incoming data, significantly reducing memory usage and improving performance.
Option A introduces such a time-based join condition: clickTime >= impressionTime AND clickTime <= impressionTime + interval 3 hours. This accurately defines a reasonable window (within 3 hours after an impression) during which a click is considered correlated, enabling Spark to efficiently manage the join state.

Reference: https://docs.databricks.com/en/structured-streaming/stream-stream-joins.html
QUESTION 4

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

A
Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.
B
The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.
C
Only workspace administrators can grant "Owner" privileges to a group.
D
A user can only transfer job ownership to a group if they are also a member of that group.

Correct Option: A

Why Option A is Correct

In Databricks, the IS_OWNER permission is unique. A job can only have exactly one owner, and that owner must be an individual workspace user or a service principal. You cannot assign a group to be the owner of a job.

This restriction exists primarily for execution context. Jobs frequently use the "Run As" feature to execute under the identity and permissions of the job owner. Because a group is a collection of users rather than a single identity with its own specific credentials, a job cannot "run as" a group. To give a group full administrative control over a job, you must grant them the CAN_MANAGE permission instead.

QUESTION 5

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:


An analyst who is not a member of the auditing group executes the following Query:

SELECT * FROM user_ltv_no_minors

Which statement describes the results returned by this Query?

A
All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.
B
All age values less than 18 will be returned as null values, all other columns will be returned with the values in user_ltv.
C
All values for the age column will be returned as null values, all other columns will be returned with the values in user_ltv.
D
All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

Correct Option: A

✅ Option A (Correct)Reasoning: The view definition includes a WHERE clause with a CASE statement. The CASE statement checks WHEN is_member("auditing") THEN TRUE ELSE age >= 18 END. The question states that the analyst is not a member of the 'auditing' group. Therefore, is_member("auditing") evaluates to FALSE. This causes the ELSE age >= 18 condition to be applied. The query SELECT * FROM user_ltv_no_minors will thus return all columns (email, age, ltv) for records where age is 18 or greater. Records not meeting this condition (i.e., age < 18) will be omitted.

Option A correctly states that all columns will be displayed normally for records with an age greater than 17 (which is equivalent to age >= 18), and records not meeting this condition will be omitted.❌ Why the other choices are incorrect:

  • Option B is incorrect: The CASE statement is in the WHERE clause, which filters rows based on a condition, rather than modifying column values. It does not set age values to null.
  • Option C is incorrect: Similar to option B, the WHERE clause filters rows. It does not nullify all values in the age column. The SELECT statement explicitly selects age without modification.
  • Option D is incorrect: The condition derived from the WHERE clause for non-auditing members is age >= 18. Option D states "age greater than 18" (age > 18), which would exclude records where age is exactly 18. Option A's "age greater than 17" correctly includes age = 18.



Reference: https://docs.databricks.com/sql/language-manual/sql-ref-functions-builtin.html#is_member
QUESTION 6

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:

key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG

There are 5 uniQue topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.

Which solution meets the reQuirements?

A
All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.
B
Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.
C
Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.
D
Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.

Correct Option: C

Partitioning the Delta Lake table by the 'topic' field is the most effective solution. This creates distinct directories in the underlying storage for each topic (e.g., '/delta_table/topic=registration/'). This isolation allows for several benefits:

  • Access Restriction: Access Control Lists (ACLs) can be applied to the specific 'topic=registration' directory at the cloud storage level (e.g., S3, ADLS Gen2), or through Databricks table ACLs, to restrict who can read or modify PII data.
  • Retention Policy: Delete statements can efficiently target only the PII data. For example, a DELETE statement like DELETE FROM my_table WHERE topic = 'registration' AND timestamp < current_timestamp() - INTERVAL '14' DAYS would leverage partition pruning, quickly removing old PII records without affecting other topics. Non-PII topics would simply not have this delete operation applied, thus retaining them indefinitely.



Reference: https://docs.delta.io/latest/delta-partitioning.html
QUESTION 7

The data governance team is reviewing code used for deleting records for compliance with GDPR. The following logic has been implemented to propagate delete reQuests from the user_lookup table to the user_aggregates table.



Assuming that user_id is a uniQue identifying key and that all users that have reQuested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?

A
No; the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command.
B
No; files containing deleted records may still be accessible with time travel until a VACUUM command is used to remove invalidated data files.
C
No; the change data feed only tracks inserts and updates, not deleted records.
D
Yes; Delta Lake ACID guarantees provide assurance that the DELETE command succeeded fully and permanently purged these records.

Correct Option: B

When records are deleted from a Delta Lake table, the DELETE command logically removes them by marking the underlying data files as no longer part of the table's current version. However, these files are not immediately physically removed from storage. Delta Lake's time travel feature allows users to query previous versions of the table, meaning the 'deleted' records can still be accessed through older table versions. To permanently remove the physical data files and make them inaccessible even via time travel, the VACUUM command must be explicitly executed. This is crucial for compliance requirements like GDPR where data must be truly 'forgotten'.



Reference: https://docs.databricks.com/delta/time-travel.html#delete-files-with-vacuum, https://docs.databricks.com/delta/data-retention.html
QUESTION 8

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:



After the database was successfully created and permissions configured, a member of the finance team runs the following code:



If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

A
A logical table will persist the Query plan to the Hive Metastore in the Databricks control plane.
B
An external table will be created in the storage container mounted to /mnt/finance_eda_bucket.
C
A managed table will be created in the DBFS root storage container.
D
An managed table will be created in the storage container mounted to /mnt/finance_eda_bucket.

Correct Option: D

The initial CREATE DATABASE finance_eda_db LOCATION '/mnt/finance_eda_bucket'; command establishes that all managed tables created within finance_eda_db will store their data in subdirectories under the specified location, which is /mnt/finance_eda_bucket. The subsequent CREATE TABLE finance_eda_db.tx_sales AS SELECT * FROM sales WHERE state = "TX"; command creates a table without an explicit LOCATION clause for the table itself. This makes tx_sales a managed table. Consequently, its data will be stored within the database's designated location.

Therefore, a managed table named tx_sales will be created, and its data will reside in the storage container mounted to /mnt/finance_eda_bucket.

Reference: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-database.html
QUESTION 9

The data engineering team has been tasked with configuring connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user groups already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

A
No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.
B
"Read" permissions should be set on a secret key mapped to those credentials that will be used by a given team.
C
"Read" permissions should be set on a secret scope containing only those credentials that will be used by a given team.
D
"Manage" permissions should be set on a secret scope containing only those credentials that will be used by a given team.

Correct Option: C

Option C (Correct)
Reasoning: Databricks secret access control is managed at the secret scope level, not individual secret keys. To grant minimum necessary access, a secret scope should be created containing only the credentials relevant to a specific team. Setting "Read" permissions on this scope for the corresponding Databricks group allows team members to retrieve and use the credentials without granting excessive privileges like managing the scope or modifying secrets. This adheres to the principle of least privilege.

Why the other choices are incorrect:

  • Option A is incorrect: Granting all users administrator access is a significant security risk and does not represent the minimum necessary access. Administrators have full control over the workspace, including all secrets.
  • Option B is incorrect: Databricks secret permissions are applied to secret scopes, not to individual secret keys. You cannot set permissions on a single key; permissions are inherited from the scope.
  • Option D is incorrect: "Manage" permissions are too broad. They allow users to read, write, and manage permissions on the secret scope itself. The requirement is only to use (read) the credentials, not to administer the secret scope.



Reference: https://docs.databricks.com/en/security/secrets/index.html#secret-access-control
QUESTION 10

What is the retention of job run history?

A
It is retained until you export or delete job run logs
B
It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3
C
It is retained for 60 days, during which you can export notebook run results to HTML
D
It is retained for 60 days, after which logs are archived

Correct Option: C

Option C (Correct)

Reasoning: Databricks retains job run history, including run outputs and logs, for a default period of 60 days. During this retention period, users can access the job run details in the UI and programmatically, which includes the ability to view and export notebook run results. For instance, notebook outputs can be downloaded or viewed directly. After 60 days, the run history is automatically deleted.

Why the other choices are incorrect:

  • Option A is incorrect: Job run history is not retained indefinitely until explicitly exported or deleted. There is a default retention period of 60 days.
  • Option B is incorrect: The retention period is 60 days, not 30 days. While job run logs can be delivered to external storage like DBFS or S3 through log delivery configurations, the default internal retention is 60 days.
  • Option D is incorrect: While the 60-day retention period is correct, the statement that logs are "archived" afterward is inaccurate. After 60 days, the run history and logs are deleted by default, not automatically archived within Databricks in a readily accessible form. Users must configure external log delivery for archiving beyond this period.


Reference: https://docs.databricks.com/en/workflows/jobs/view-job-details.html
QUESTION 11

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.

Which statement describes the contents of the workspace audit logs concerning these events?

A
Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identify these events.
B
Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.
C
Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.
D
Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 12

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and Query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive Queries.

In which location can one review the timeline for cluster resizing events?

A
Workspace audit logs
B
Driver's log file
C
Ganglia
D
Cluster Event Log

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 13

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?

A
The five Minute Load Average remains consistent/flat
B
CPU Utilization is around 75%
C
Network I/O never spikes
D
Total Disk Space remains constant

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 14

The data engineer is using Spark's MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the Spark UI's Storage tab to signal that a cached table is not performing optimally?

A
On Heap Memory Usage is within 75% of Off Heap Memory Usage
B
The RDD Block Name includes the “*” annotation signaling a failure to cache
C
Size on Disk is > 0
D
The number of Cached Partitions > the number of Spark Partitions

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 15

Review the following error traceback:



Which statement describes the error being raised?

A
There is a syntax error because the heartrate column is not correctly identified as a column.
B
There is no column in the table named heartrateheartrateheartrate
C
There is a type error because a column object cannot be multiplied.
D
There is a type error because a DataFrame object cannot be multiplied.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 16

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

A
Run source env/bin/activate in a notebook setup script
B
Install libraries from PyPI using the cluster UI
C
Use %pip install in a notebook cell
D
Use %sh pip install in a notebook cell

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 17

What is the first line of a Databricks Python notebook when viewed in a text editor?

A
%python
B
// Databricks notebook source
C
# Databricks notebook source
D
-- Databricks notebook source

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 18

Incorporating unit tests into a PySpark application reQuires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.

Which benefit offsets this additional effort?

A
Improves the Quality of your data
B
Validates a complete use case of your application
C
Troubleshooting is easier since all steps are isolated and tested individually
D
Ensures that all steps interact correctly to achieve the desired end result

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 19

What describes integration testing?

A
It validates an application use case.
B
It validates behavior of individual elements of an application,
C
It reQuires an automated testing framework.
D
It validates interactions between subsystems of your application.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 20

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response that the job run reQuest has been submitted successfully includes a field run_id.

Which statement describes what the number alongside this field represents?

A
The job_id and number of times the job has been run are concatenated and returned.
B
The globally uniQue ID of the newly triggered run.
C
The number of times the job definition has been run in this workspace.
D
The job_id is returned in this field.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 21

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.
If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

A
All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.
B
Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.
C
Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.
D
All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 22

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from Query failures and keeps costs low?

A
Cluster: New Job Cluster;Retries: Unlimited;Maximum Concurrent Runs: 1
B
Cluster: New Job Cluster;Retries: Unlimited;Maximum Concurrent Runs: Unlimited
C
Cluster: Existing All-Purpose Cluster;Retries: Unlimited;Maximum Concurrent Runs: 1
D
Cluster: New Job Cluster;Retries: None;Maximum Concurrent Runs: 1

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 23

A Delta Lake table was created with the below Query:



Realizing that the original Query had a typographical error, the below code was executed:

ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store

Which result will occur after running the second command?

A
The table reference in the metastore is updated.
B
All related files and metadata are dropped and recreated in a single ACID transaction.
C
The table name change is recorded in the Delta transaction log.
D
A new Delta transaction log is created for the renamed table.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 24

The data engineering team has configured a Databricks SQL Query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.

The below Query is used to create the alert:



The Query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

A
The total average temperature across all sensors exceeded 120 on three consecutive executions of the Query
B
The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the Query
C
The source Query failed to update properly for three consecutive minutes and then restarted
D
The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the Query

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 25

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

A
Use Repos to make a pull reQuest use the Databricks REST API to update the current branch to dev-2.3.9
B
Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.
C
Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch
D
Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull reQuest to sync with the remote repository

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 26

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().

Which of the following statements is correct?

A
DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.
B
By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.
C
The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.
D
The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 27

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:

df = spark.read.format("parQuet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

A
date = spark.conf.get("date")
B
import sysdate = sys.argv[1]
C
date = dbutils.notebooks.getParam("date")
D
dbutils.widgets.text("date", "null")date = dbutils.widgets.get("date")

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 28

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

A
"Can Manage" privileges on the reQuired cluster
B
Cluster creation allowed, "Can Restart" privileges on the reQuired cluster
C
Cluster creation allowed, "Can Attach To" privileges on the reQuired cluster
D
"Can Restart" privileges on the reQuired cluster

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 29

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

A
preds.write.mode("append").saveAsTable("churn_preds")
B
preds.write.format("delta").save("/preds/churn_preds")
C
Option C
D
Option D

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 30

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

A
%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
B
Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
C
%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
D
%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 31

A Delta table of weather records is partitioned by date and has the below schema:

date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a Query with the below filter:

latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

A
All records are cached to an operational database and then the filter is applied
B
The ParQuet file footers are scanned for min and max statistics for the latitude column
C
The Hive metastore is scanned for min and max statistics for the latitude column
D
The Delta log is scanned for min and max statistics for the latitude column

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 32

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both DEEP and SHALLOW CLONE, development tables are created using SHALLOW CLONE.

A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that VACUUM was run the day before.

Which statement describes why the cloned tables are no longer working?

A
Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.
B
Running VACUUM automatically invalidates any shallow clones of a table; DEEP CLONE should always be used when a cloned table will be repeatedly Queried.
C
The data files compacted by VACUUM are not tracked by the cloned metadata; running REFRESH on the cloned table will pull in recent changes.
D
The metadata created by the CLONE operation is referencing data files that were purged as invalid by the VACUUM command.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 33

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.



Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

A
The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.
B
The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.
C
Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.
D
One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 34

A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.

Given the current implementation, which method can be used?

A
Execute a Query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and lime travel functionality.
B
Parse the Delta Lake transaction log to identify all newly written data files.
C
Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.
D
Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 35

A view is registered with the following code:



Both users and orders are Delta Lake tables.

Which statement describes the results of Querying recent_orders?

A
The versions of each source table will be stored in the table transaction log; Query results will be saved to DBFS with each Query.
B
All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is Queried.
C
All logic will execute at Query time and return the result of joining the valid versions of the source tables at the time the Query finishes.
D
All logic will execute at Query time and return the result of joining the valid versions of the source tables at the time the Query began.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 36

A data engineer is performing a join operation to combine values from a static userLookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

A
userLookup.join(streamingDF, ["user_id"], how="right")
B
streamingDF.join(userLookup, ["user_id"], how="inner")
C
userLookup.join(streamingDF, ["user_id"), how="inner")
D
userLookup.join(streamingDF, ["user_id"], how="left")

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 37

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:



Choose the response that correctly fills in the blank within the code block to complete this task.

A
withWatermark("event_time", "10 minutes")
B
awaitArrival("event_time", "10 minutes")
C
await("event_time + ‘10 minutes'")
D
slidingWindow("event_time", "10 minutes")

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 38

A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.

The proposed directory structure is displayed below:



Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

A
No; Delta Lake manages streaming checkpoints in the transaction log.
B
Yes; both of the streams can share a single checkpoint directory.
C
No; only one stream can write to a Delta Lake table.
D
No; each of the streams needs to have its own checkpoint directory.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 39

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

A
Decrease the trigger interval to 5 seconds; triggering batches more freQuently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
B
Decrease the trigger interval to 5 seconds; triggering batches more freQuently may prevent records from backing up and large batches from causing spill.
C
The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
D
Use the trigger once option and configure a Databricks job to execute the Query every 10 seconds; this ensures all backlogged records are processed with each batch.
E
Decrease the trigger interval to 5 seconds; triggering batches more freQuently may prevent records from backing up and large batches from causing spill.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 40

Which statement describes the default execution mode for Databricks Auto Loader?

A
Cloud vendor-specific Queue storage and notification services are configured to track newly arriving files; new files are incrementally and idempotently loaded into the target Delta Lake table.
B
New files are identified by listing the input directory; the target table is materialized by directly Querying all valid files in the source directory.
C
Webhooks trigger a Databricks job to run anytime new data arrives in a source directory; new data are automatically merged into target tables using rules inferred from the data.
D
New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 41

Which statement describes the correct use of pyspark.sQl.functions.broadcast?

A
It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
B
It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
C
It caches a copy of the indicated table on all nodes in the cluster for use in all future Queries during the cluster lifetime.
D
It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 42

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

A
Stage’s detail screen and Query’s detail screen
B
Stage’s detail screen and Executor’s log files
C
Driver’s and Executor’s log files
D
Executor’s detail screen and Executor’s log files

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 43

An upstream source writes ParQuet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:



Assume that the fields customer_id and order_id serve as a composite key to uniQuely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

A
Each write to the orders table will only contain uniQue records, and only those records without duplicates in the target table will be written.
B
Each write to the orders table will only contain uniQue records, but newly written records may have duplicates already present in the target table.
C
Each write to the orders table will only contain uniQue records; if existing records with the same key are present in the target table, these records will be overwritten.
D
Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 44

A junior data engineer on your team has implemented the following code block.



The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a uniQue key for this table.

When this Query is executed, what will happen with new records that have the same event_id as an existing record?

A
They are merged.
B
They are ignored.
C
They are updated.
D
They are inserted.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 45

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?

A
The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.
B
Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.
C
Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.
D
Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 46

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?

A
The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.
B
A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.
C
The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.
D
An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 47

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.

The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.

Which statement exemplifies best practices for implementing this system?

A
Isolating tables in separate databases based on data Quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.
B
Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.
C
Storing all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.
D
Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 48

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.

Which approach will ensure that this reQuirement is met?

A
When a database is being created, make sure that the LOCATION keyword is used.
B
When the workspace is being configured, make sure that external cloud object storage has been mounted.
C
When data is saved to a table, make sure that a full file path is specified alongside the USING DELTA clause.
D
When tables are created, make sure that the UNMANAGED keyword is used in the CREATE TABLE statement.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 49

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical Queries.

The data engineering team has been made aware of new reQuirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.

Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?

A
Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic Queries.
B
Configure a new table with all the reQuisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.
C
Create a new table with the reQuired schema and new fields and use Delta Lake's deep clone functionality to sync up changes committed to one table to the corresponding table.
D
Replace the current table definition with a logical view defined with the Query logic currently writing the aggregate table; create a new table to power the customer-facing application.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 50

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

A
post_time
B
date
C
post_id
D
user_id

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 51

The downstream consumers of a Delta Lake table have been complaining about data Quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes.

A junior engineer has written the following code to add CHECK constraints to the Delta Lake table:



A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed.

Which statement explains the cause of this failure?

A
The current table schema does not contain the field valid_coordinates; schema evolution will need to be enabled before altering the table to add a constraint.
B
The activity_details table already exists; CHECK constraints can only be added during initial table creation.
C
The activity_details table already contains records that violate the constraints; all existing data must pass CHECK constraints in order to add them to an existing table.
D
The activity_details table already contains records; CHECK constraints can only be added prior to inserting values into a table.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 52

What is true for Delta Lake?

A
Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.
B
Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.
C
Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on Query filters.
D
Z-order can only be applied to numeric values stored in Delta Lake tables.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 53

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Which statement describes this implementation?

A
The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.
B
The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.
C
The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.
D
The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 54

A team of data engineers are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data Quality checks. One member of the team suggests reusing these data Quality rules across all tables defined for this pipeline.

What approach would allow them to do this?

A
Add data Quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.
B
Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.
C
Maintain data Quality rules in a separate Databricks notebook that each DLT notebook or file can import as a library.
D
Maintain data Quality rules in a Delta table outside of this pipeline's target schema, providing the schema name as a pipeline parameter.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 55

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has reQuested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

A
Can manage
B
Can edit
C
Can run
D
Can read

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 56

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the marketing group executes the following Query?

SELECT * FROM email_ltv -

Which statement describes the results returned by this Query?

A
Three columns will be returned, but one column will be named "REDACTED" and contain only null values.
B
Only the email and ltv columns will be returned; the email column will contain all null values.
C
The email and ltv columns will be returned with the values in user_ltv.
D
Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 57

The data governance team has instituted a reQuirement that all tables containing Personal Identifiable Information (PII) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.

The following SQL DDL statement is executed to create a new table:



Which command allows manual confirmation that these three reQuirements have been met?

A
DESCRIBE EXTENDED dev.pii_test
B
DESCRIBE DETAIL dev.pii_test
C
SHOW TBLPROPERTIES dev.pii_test
D
DESCRIBE HISTORY dev.pii_test

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 58
A
Yes; Delta Lake ACID guarantees provide assurance that the DELETE command succeeded fully and permanently purged these records.
B
No; files containing deleted records may still be accessible with time travel until a VACUUM command is used to remove invalidated data files.
C
Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.
D
No; the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 59

The data architect has decided that once data has been ingested from external sources into the

Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.

The following logic was executed to grant privileges for interactive Queries on a production database to the core engineering group.

GRANT USAGE ON DATABASE prod TO eng;

GRANT SELECT ON DATABASE prod TO eng;

Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?

A
Group members are able to create, Query, and modify all tables and views in the prod database, but cannot define custom functions.
B
Group members are able to list all tables in the prod database but are not able to see the results of any Queries on those tables.
C
Group members are able to Query and modify all tables and views in the prod database, but cannot create new tables or views.
D
Group members are able to Query all tables and views in the prod database, but cannot create or edit anything in the database.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 60

A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.

The user attempts and fails to accomplish this by adding an expectation to the report table definition.



Which approach would allow using DLT expectations to validate all expected records are present in this table?

A
Define a temporary table that performs a left outer join on validation_copy and report, and define an expectation that no report key values are null
B
Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for report key values in a DLT expectation for the report table
C
Define a view that performs a left outer join on validation_copy and report, and reference this view in DLT expectations for the report table
D
Define a function that performs a left outer join on validation_copy and report, and check against the result in a DLT expectation for the report table

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 61

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

A
The Jobs UI should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.
B
The only way to meaningfully troubleshoot code execution times in development notebooks is to use production-sized data and production-sized clusters with Run All execution.
C
Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.
D
Calling display() forces a job to trigger, while many transformations will only add to the logical Query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 62

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

A
In the Executor’s log file, by grepping for "predicate push-down"
B
In the Stage’s Detail screen, in the Completed Stages table, by noting the size of data read from the Input column
C
In the Query Detail screen, by interpreting the Physical Plan
D
In the Delta Lake transaction log. by noting the column statistics

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 63

A data engineer needs to capture pipeline settings from an existing setting in the workspace, and use them to create and version a JSON file to create a new pipeline.

Which command should the data engineer enter in a web terminal configured with the Databricks CLI?

A
Use list pipelines to get the specs for all pipelines; get the pipeline spec from the returned results; parse and use this to create a pipeline
B
Stop the existing pipeline; use the returned settings in a reset command
C
Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command
D
Use the clone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 64

Which Python variable contains a list of directories to be searched when trying to locate reQuired modules?

A
importlib.resource_path
B
sys.path
C
os.path
D
pypi.path

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 65

You are testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

assert(myIntegrate(lambda x: x*x, 0, 3) [0] == 9)

Which kind of test would the above line exemplify?

A
Unit
B
Manual
C
Functional
D
Integration

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 66

What is a key benefit of an end-to-end test?

A
It makes it easier to automate your test suite
B
It pinpoints errors in the building blocks of your application
C
It provides testing coverage for all code paths and branches
D
It closely simulates real world usage of your application

Premium Solution Locked

Unlock all 333 answers & explanations

QUESTION 67

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?

A
/jobs/runs/list
B
/jobs/list
C
/jobs/runs/get
D
/jobs/get

Premium Solution Locked

Unlock all 333 answers & explanations

Full Question Bank Locked

You have reached the end of the free study guide preview. Upgrade now to unlock all 333 questions and the full simulation engine.

Customer Reviews

5 / 5
(15,000+ verified)
5
100%
4
0%
3
0%
2
0%
1
0%

Global Community Feedback

DM

David M.

Verified Student

"The practice engine is incredible. It feels exactly like the real testing environment and helped me build so much confidence."

SJ

Sarah J.

Premium Member

"The PDF is very well organized and the explanations for the answers are actually helpful, not just random text."

MC

Michael C.

Verified Buyer

"I was skeptical, but the content is high quality and definitely worth the price. I passed on my first try!"

Need Assistance?

Our expert support team is available to assist you with any inquiries about our exam materials.

Contact Support
Average response: < 24 Hours

Get Exam Updates

Subscribe to receive instant notifications on new questions and exclusive flash sales.

* Join 5,000+ students getting weekly updates

Support Chat ● Active Now

👋 Hi! How can we help you pass your exam?

Enter email to start chatting