Databricks Certified Data Engineer Professional (Data-Engineer-Professional)
Get full access to the updated question bank and confidently prepare for your exam.
Vendor
Databricks
Certification
Data Engineer
Content
333 Qs
Status
Verified
Updated
1 day ago
Test the Practice Engine
Experience our interactive testing environment with free demo questions
Premium Bundle
Complete Success Suite
Save $34 Instantly
-
✓Full PDF + Interactive Engine Everything you need to pass
-
✓All Advanced Question Types Drag & Drop, Hotspots, Case Studies
-
✓Priority 24/7 Expert Support Direct line to certification leads
-
✓90 Days Free Priority Updates Stay current as exams change
Success Metric
98.4% Pass Rate
Standard Simulation
Practice Engine
One-Time Payment
-
Web-Based (Zero Install)
-
Real Testing Environment Virtual & Practice Modes
-
Interactive Engine Drag & Drop, Hotspots
-
60 Days Free Updates
Compatible with All Devices
Basic Tier
PDF Study Guide
Digital Access
- ✓ Exam Questions (PDF)
- ✓ Mobile Friendly
- ✓ 60 Days Updates
Verified 67-Question Preview (Data-Engineer-Professional)
Verified Community
The CertoMetrics Standard.
Recommend the #1 platform for verified Databricks certification resources.
Success Network
Help a Colleague Succeed.
Invite a peer to get their own updated Data-Engineer-Professional prep kit.
Exam Overview
The Databricks Certified Data Engineer Professional certification validates an individual's advanced expertise in designing, building, and managing robust, scalable data pipelines on the Databricks Lakehouse Platform. This credential signifies a deep understanding of production-grade data engineering practices, including complex data ingestion, transformation, and optimization strategies using Delta Lake, Apache Spark, and Databricks tools like Delta Live Tables. Earning this certification demonstrates the ability to implement secure, governed, and high-performance data solutions crucial for enterprise-level analytics and machine learning workloads. It's a testament to your capability in driving data initiatives, enhancing data quality, and contributing significantly to an organization's data-driven success, positioning you as a leader in the modern data landscape.
Questions
60
Passing Score
700/1000
Duration
120 Minutes
Difficulty
Expert
Level
Professional
Skills Measured
Career Path
Target Roles
Common Questions
Is the material up to date?
Yes. We update our question bank weekly to match the latest Databricks standards. You get free updates for 90 days.
What format do I get?
You get instant access to both the **PDF** (for reading) and our **Premium Test Engine** (for exam simulation).
Is there a guarantee?
Absolutely. If you fail the Data-Engineer-Professional exam using our materials, we offer a full money-back guarantee.
When do I get the download?
Instantly. The download link is available in your dashboard immediately after payment is confirmed.
Free Study Guide Samples
Previewing updated Data-Engineer-Professional bank (67 Questions).
A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constraints and multi-table inserts to validate records on write.
Which consideration will impact the decisions made by the engineer while migrating this workload?
Correct Option: D
The correct answer is D. When migrating a workload from a relational database system to the Databricks Lakehouse, several fundamental differences must be considered.
- Delta Lake transactions are ACID compliant against a single table: A core feature of Delta Lake is its transactional capabilities, but these guarantees (Atomicity, Consistency, Isolation, Durability) apply to operations within a single Delta table. Unlike traditional relational databases that might support distributed transactions across multiple tables, Databricks requires different strategies for ensuring consistency across multiple tables (e.g., orchestrating writes, idempotent operations, or using a medallion architecture). This directly impacts how 'multi-table inserts' from the source system are handled.
- Databricks does not enforce foreign key constraints: While Unity Catalog allows defining
PRIMARY KEYandFOREIGN KEYconstraints for informational purposes (e.g., for documentation, optimization hints for query engines, or integration with BI tools), these constraints are currently not enforced during data writes. This means that the data validation logic based on foreign keys in the source system must be explicitly reimplemented in the Lakehouse (e.g., using data quality checks, Delta Live Tables expectations, or explicit validation queries) to prevent referential integrity violations.
Therefore, understanding that ACID guarantees are per-table and that foreign key constraints are not enforced for validation are critical considerations for the migration.
Reference: https://docs.delta.io/latest/delta-transactions.html, https://docs.databricks.com/en/unity-catalog/manage-tables.html#constraints-on-delta-lake-tables
A data architect has heard about Delta Lake’s built-in versioning and time travel capabilities. For auditing purposes, they have a reQuirement to maintain a full record of all valid street addresses as they appear in the customers table.
The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability.
Which piece of information is critical to this decision?
Correct Option: D
✅ Option D (Correct)
Delta Lake's time travel feature, while powerful for accessing recent history and providing data recovery, is not optimally designed for indefinite, long-term versioning of all data changes, especially for fine-grained auditing purposes on frequently changing attributes like street addresses. Retaining an extremely long history of all data files to support time travel across many years can significantly increase storage costs (as old data files are retained for extended periods) and query latency (as reconstructing historical states from a very large transaction log and numerous small data files becomes more complex and resource-intensive). For explicit, long-term auditing requirements of historical attribute values, a Type 2 Slowly Changing Dimension (SCD) approach is often more performant and scalable, as it explicitly models the history within the table itself (e.g., using start/end dates and flags), making historical queries more direct and efficient.
❌ Why the other choices are incorrect:
- Option A is incorrect: Delta Lake is ACID compliant. Transactions either complete fully or are rolled back, preventing data corruption from partially completed updates, regardless of whether it's a Type 1 or Type 2 operation.
- Option B is incorrect: Shallow clones create a copy of a table's metadata pointing to the same data files. While useful for creating instant snapshots, they do not inherently accelerate historic queries for long-term versioning. Time travel itself is the mechanism for historic queries, and its performance for long-term history is the concern.
- Option C is incorrect: This statement is factually inaccurate. Delta Lake time travel is specifically designed to query previous versions of tables, even when changes logically 'modify' data in place (in Delta Lake, updates typically involve writing new files and marking old ones for deletion after a retention period). The transaction log tracks all changes, allowing any past version within the retention period to be queried.
Reference: https://docs.delta.io/latest/delta-time-travel.html
A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impressions led to monetizable clicks.
In the code below, Impressions is a streaming DataFrame with a watermark ("event_time", "10 minutes")
The data engineer notices the Query slowing down significantly.
Which solution would improve the performance?
Correct Option: A
- Both input streams must have watermarks defined on their event time columns. The question states 'Impressions' has a watermark. We assume 'Clicks' would also be watermarked.
- A time constraint must be specified in the join condition, typically in the form of a time window (e.g.,
eventTime1 BETWEEN eventTime2 AND eventTime2 + INTERVAL). This constraint allows Spark to leverage the watermarks to prune old state that will no longer match any incoming data, significantly reducing memory usage and improving performance.
clickTime >= impressionTime AND clickTime <= impressionTime + interval 3 hours. This accurately defines a reasonable window (within 3 hours after an impression) during which a click is considered correlated, enabling Spark to efficiently manage the join state.
Reference: https://docs.databricks.com/en/structured-streaming/stream-stream-joins.html
A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.
Which statement explains what is preventing this privilege transfer?
Correct Option: A
Why Option A is Correct
In Databricks, the IS_OWNER permission is unique. A job can only have exactly one owner, and that owner must be an individual workspace user or a service principal. You cannot assign a group to be the owner of a job.
This restriction exists primarily for execution context. Jobs frequently use the "Run As" feature to execute under the identity and permissions of the job owner. Because a group is a collection of users rather than a single identity with its own specific credentials, a job cannot "run as" a group. To give a group full administrative control over a job, you must grant them the CAN_MANAGE permission instead.
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:
An analyst who is not a member of the auditing group executes the following Query:
SELECT * FROM user_ltv_no_minors
Which statement describes the results returned by this Query?
Correct Option: A
✅ Option A (Correct)Reasoning: The view definition includes a WHERE clause with a CASE statement. The CASE statement checks WHEN is_member("auditing") THEN TRUE ELSE age >= 18 END. The question states that the analyst is not a member of the 'auditing' group. Therefore, is_member("auditing") evaluates to FALSE. This causes the ELSE age >= 18 condition to be applied. The query SELECT * FROM user_ltv_no_minors will thus return all columns (email, age, ltv) for records where age is 18 or greater. Records not meeting this condition (i.e., age < 18) will be omitted.
Option A correctly states that all columns will be displayed normally for records with an age greater than 17 (which is equivalent to age >= 18), and records not meeting this condition will be omitted.❌ Why the other choices are incorrect:
- Option B is incorrect: The
CASEstatement is in theWHEREclause, which filters rows based on a condition, rather than modifying column values. It does not setagevalues tonull. - Option C is incorrect: Similar to option B, the
WHEREclause filters rows. It does not nullify all values in theagecolumn. TheSELECTstatement explicitly selectsagewithout modification. - Option D is incorrect: The condition derived from the
WHEREclause for non-auditing members isage >= 18. Option D states "age greater than 18" (age > 18), which would exclude records whereageis exactly 18. Option A's "age greater than 17" correctly includesage = 18.
Reference: https://docs.databricks.com/sql/language-manual/sql-ref-functions-builtin.html#is_member
All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:
key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG
There are 5 uniQue topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.
Which solution meets the reQuirements?
Correct Option: C
Partitioning the Delta Lake table by the 'topic' field is the most effective solution. This creates distinct directories in the underlying storage for each topic (e.g., '/delta_table/topic=registration/'). This isolation allows for several benefits:
- Access Restriction: Access Control Lists (ACLs) can be applied to the specific 'topic=registration' directory at the cloud storage level (e.g., S3, ADLS Gen2), or through Databricks table ACLs, to restrict who can read or modify PII data.
- Retention Policy: Delete statements can efficiently target only the PII data. For example, a DELETE statement like
DELETE FROM my_table WHERE topic = 'registration' AND timestamp < current_timestamp() - INTERVAL '14' DAYSwould leverage partition pruning, quickly removing old PII records without affecting other topics. Non-PII topics would simply not have this delete operation applied, thus retaining them indefinitely.
Reference: https://docs.delta.io/latest/delta-partitioning.html
The data governance team is reviewing code used for deleting records for compliance with GDPR. The following logic has been implemented to propagate delete reQuests from the user_lookup table to the user_aggregates table.
Assuming that user_id is a uniQue identifying key and that all users that have reQuested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?
Correct Option: B
When records are deleted from a Delta Lake table, the DELETE command logically removes them by marking the underlying data files as no longer part of the table's current version. However, these files are not immediately physically removed from storage. Delta Lake's time travel feature allows users to query previous versions of the table, meaning the 'deleted' records can still be accessed through older table versions. To permanently remove the physical data files and make them inaccessible even via time travel, the VACUUM command must be explicitly executed. This is crucial for compliance requirements like GDPR where data must be truly 'forgotten'.
Reference: https://docs.databricks.com/delta/time-travel.html#delete-files-with-vacuum, https://docs.databricks.com/delta/data-retention.html
An external object storage container has been mounted to the location /mnt/finance_eda_bucket.
The following logic was executed to create a database for the finance team:
After the database was successfully created and permissions configured, a member of the finance team runs the following code:
If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?
Correct Option: D
CREATE DATABASE finance_eda_db LOCATION '/mnt/finance_eda_bucket'; command establishes that all managed tables created within finance_eda_db will store their data in subdirectories under the specified location, which is /mnt/finance_eda_bucket. The subsequent CREATE TABLE finance_eda_db.tx_sales AS SELECT * FROM sales WHERE state = "TX"; command creates a table without an explicit LOCATION clause for the table itself. This makes tx_sales a managed table. Consequently, its data will be stored within the database's designated location.Therefore, a managed table named
tx_sales will be created, and its data will reside in the storage container mounted to /mnt/finance_eda_bucket.
Reference: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-database.html
The data engineering team has been tasked with configuring connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user groups already created in Databricks that represent various teams within the company.
A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.
Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?
Correct Option: C
✅ Option C (Correct)
Reasoning: Databricks secret access control is managed at the secret scope level, not individual secret keys. To grant minimum necessary access, a secret scope should be created containing only the credentials relevant to a specific team. Setting "Read" permissions on this scope for the corresponding Databricks group allows team members to retrieve and use the credentials without granting excessive privileges like managing the scope or modifying secrets. This adheres to the principle of least privilege.
❌ Why the other choices are incorrect:
- Option A is incorrect: Granting all users administrator access is a significant security risk and does not represent the minimum necessary access. Administrators have full control over the workspace, including all secrets.
- Option B is incorrect: Databricks secret permissions are applied to secret scopes, not to individual secret keys. You cannot set permissions on a single key; permissions are inherited from the scope.
- Option D is incorrect: "Manage" permissions are too broad. They allow users to read, write, and manage permissions on the secret scope itself. The requirement is only to use (read) the credentials, not to administer the secret scope.
Reference: https://docs.databricks.com/en/security/secrets/index.html#secret-access-control
What is the retention of job run history?
Correct Option: C
✅ Option C (Correct)
Reasoning: Databricks retains job run history, including run outputs and logs, for a default period of 60 days. During this retention period, users can access the job run details in the UI and programmatically, which includes the ability to view and export notebook run results. For instance, notebook outputs can be downloaded or viewed directly. After 60 days, the run history is automatically deleted.
❌ Why the other choices are incorrect:
- Option A is incorrect: Job run history is not retained indefinitely until explicitly exported or deleted. There is a default retention period of 60 days.
- Option B is incorrect: The retention period is 60 days, not 30 days. While job run logs can be delivered to external storage like DBFS or S3 through log delivery configurations, the default internal retention is 60 days.
- Option D is incorrect: While the 60-day retention period is correct, the statement that logs are "archived" afterward is inaccurate. After 60 days, the run history and logs are deleted by default, not automatically archived within Databricks in a readily accessible form. Users must configure external log delivery for archiving beyond this period.
Reference: https://docs.databricks.com/en/workflows/jobs/view-job-details.html
A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.
Which statement describes the contents of the workspace audit logs concerning these events?
Premium Solution Locked
Unlock all 333 answers & explanations
A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and Query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive Queries.
In which location can one review the timeline for cluster resizing events?
Premium Solution Locked
Unlock all 333 answers & explanations
When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?
Premium Solution Locked
Unlock all 333 answers & explanations
The data engineer is using Spark's MEMORY_ONLY storage level.
Which indicators should the data engineer look for in the Spark UI's Storage tab to signal that a cached table is not performing optimally?
Premium Solution Locked
Unlock all 333 answers & explanations
Review the following error traceback:
Which statement describes the error being raised?
Premium Solution Locked
Unlock all 333 answers & explanations
What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?
Premium Solution Locked
Unlock all 333 answers & explanations
What is the first line of a Databricks Python notebook when viewed in a text editor?
Premium Solution Locked
Unlock all 333 answers & explanations
Incorporating unit tests into a PySpark application reQuires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.
Which benefit offsets this additional effort?
Premium Solution Locked
Unlock all 333 answers & explanations
What describes integration testing?
Premium Solution Locked
Unlock all 333 answers & explanations
The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response that the job run reQuest has been submitted successfully includes a field run_id.
Which statement describes what the number alongside this field represents?
Premium Solution Locked
Unlock all 333 answers & explanations
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.
If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?
Premium Solution Locked
Unlock all 333 answers & explanations
When scheduling Structured Streaming jobs for production, which configuration automatically recovers from Query failures and keeps costs low?
Premium Solution Locked
Unlock all 333 answers & explanations
A Delta Lake table was created with the below Query:
Realizing that the original Query had a typographical error, the below code was executed:
ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store
Which result will occur after running the second command?
Premium Solution Locked
Unlock all 333 answers & explanations
The data engineering team has configured a Databricks SQL Query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.
The below Query is used to create the alert:
The Query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?
Premium Solution Locked
Unlock all 333 answers & explanations
A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?
Premium Solution Locked
Unlock all 333 answers & explanations
Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().
Which of the following statements is correct?
Premium Solution Locked
Unlock all 333 answers & explanations
An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:
df = spark.read.format("parQuet").load(f"/mnt/source/(date)")
Which code block should be used to create the date Python variable used in the above code block?
Premium Solution Locked
Unlock all 333 answers & explanations
The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.
Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.
Premium Solution Locked
Unlock all 333 answers & explanations
The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.
Which code block accomplishes this task while minimizing potential compute costs?
Premium Solution Locked
Unlock all 333 answers & explanations
The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.
Which statement is a possible explanation for this behavior?
Premium Solution Locked
Unlock all 333 answers & explanations
A Delta table of weather records is partitioned by date and has the below schema:
date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT
To find all the records from within the Arctic Circle, you execute a Query with the below filter:
latitude > 66.3
Which statement describes how the Delta engine identifies which files to load?
Premium Solution Locked
Unlock all 333 answers & explanations
In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both DEEP and SHALLOW CLONE, development tables are created using SHALLOW CLONE.
A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that VACUUM was run the day before.
Which statement describes why the cloned tables are no longer working?
Premium Solution Locked
Unlock all 333 answers & explanations
A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.
Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?
Premium Solution Locked
Unlock all 333 answers & explanations
A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.
Given the current implementation, which method can be used?
Premium Solution Locked
Unlock all 333 answers & explanations
A view is registered with the following code:
Both users and orders are Delta Lake tables.
Which statement describes the results of Querying recent_orders?
Premium Solution Locked
Unlock all 333 answers & explanations
A data engineer is performing a join operation to combine values from a static userLookup table with a streaming DataFrame streamingDF.
Which code block attempts to perform an invalid stream-static join?
Premium Solution Locked
Unlock all 333 answers & explanations
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
Choose the response that correctly fills in the blank within the code block to complete this task.
Premium Solution Locked
Unlock all 333 answers & explanations
A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.
The proposed directory structure is displayed below:
Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?
Premium Solution Locked
Unlock all 333 answers & explanations
A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?
Premium Solution Locked
Unlock all 333 answers & explanations
Which statement describes the default execution mode for Databricks Auto Loader?
Premium Solution Locked
Unlock all 333 answers & explanations
Which statement describes the correct use of pyspark.sQl.functions.broadcast?
Premium Solution Locked
Unlock all 333 answers & explanations
Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.
Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?
Premium Solution Locked
Unlock all 333 answers & explanations
An upstream source writes ParQuet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:
Assume that the fields customer_id and order_id serve as a composite key to uniQuely identify each order.
If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?
Premium Solution Locked
Unlock all 333 answers & explanations
A junior data engineer on your team has implemented the following code block.
The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a uniQue key for this table.
When this Query is executed, what will happen with new records that have the same event_id as an existing record?
Premium Solution Locked
Unlock all 333 answers & explanations
A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.
Which describes how Delta Lake can help to avoid data loss of this nature in the future?
Premium Solution Locked
Unlock all 333 answers & explanations
The data engineering team maintains the following code:
Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?
Premium Solution Locked
Unlock all 333 answers & explanations
The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.
The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.
Which statement exemplifies best practices for implementing this system?
Premium Solution Locked
Unlock all 333 answers & explanations
The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.
Which approach will ensure that this reQuirement is met?
Premium Solution Locked
Unlock all 333 answers & explanations
To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical Queries.
The data engineering team has been made aware of new reQuirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.
Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?
Premium Solution Locked
Unlock all 333 answers & explanations
A Delta Lake table representing metadata about content posts from users has the following schema:
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?
Premium Solution Locked
Unlock all 333 answers & explanations
The downstream consumers of a Delta Lake table have been complaining about data Quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes.
A junior engineer has written the following code to add CHECK constraints to the Delta Lake table:
A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed.
Which statement explains the cause of this failure?
Premium Solution Locked
Unlock all 333 answers & explanations
What is true for Delta Lake?
Premium Solution Locked
Unlock all 333 answers & explanations
The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.
The following logic is used to process these records.
Which statement describes this implementation?
Premium Solution Locked
Unlock all 333 answers & explanations
A team of data engineers are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data Quality checks. One member of the team suggests reusing these data Quality rules across all tables defined for this pipeline.
What approach would allow them to do this?
Premium Solution Locked
Unlock all 333 answers & explanations
The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has reQuested access to one of these notebooks to review the production logic.
What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?
Premium Solution Locked
Unlock all 333 answers & explanations
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:
An analyst who is not a member of the marketing group executes the following Query?
SELECT * FROM email_ltv -
Which statement describes the results returned by this Query?
Premium Solution Locked
Unlock all 333 answers & explanations
The data governance team has instituted a reQuirement that all tables containing Personal Identifiable Information (PII) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.
The following SQL DDL statement is executed to create a new table:
Which command allows manual confirmation that these three reQuirements have been met?
Premium Solution Locked
Unlock all 333 answers & explanations
Premium Solution Locked
Unlock all 333 answers & explanations
The data architect has decided that once data has been ingested from external sources into the
Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.
The following logic was executed to grant privileges for interactive Queries on a production database to the core engineering group.
GRANT USAGE ON DATABASE prod TO eng;
GRANT SELECT ON DATABASE prod TO eng;
Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?
Premium Solution Locked
Unlock all 333 answers & explanations
A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.
The user attempts and fails to accomplish this by adding an expectation to the report table definition.
Which approach would allow using DLT expectations to validate all expected records are present in this table?
Premium Solution Locked
Unlock all 333 answers & explanations
A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.
Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?
Premium Solution Locked
Unlock all 333 answers & explanations
Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?
Premium Solution Locked
Unlock all 333 answers & explanations
A data engineer needs to capture pipeline settings from an existing setting in the workspace, and use them to create and version a JSON file to create a new pipeline.
Which command should the data engineer enter in a web terminal configured with the Databricks CLI?
Premium Solution Locked
Unlock all 333 answers & explanations
Which Python variable contains a list of directories to be searched when trying to locate reQuired modules?
Premium Solution Locked
Unlock all 333 answers & explanations
You are testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.
assert(myIntegrate(lambda x: x*x, 0, 3) [0] == 9)
Which kind of test would the above line exemplify?
Premium Solution Locked
Unlock all 333 answers & explanations
What is a key benefit of an end-to-end test?
Premium Solution Locked
Unlock all 333 answers & explanations
Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?
Premium Solution Locked
Unlock all 333 answers & explanations
Full Question Bank Locked
You have reached the end of the free study guide preview. Upgrade now to unlock all 333 questions and the full simulation engine.
Certification Path
Related Certifications
Customer Reviews
Global Community Feedback
David M.
"The practice engine is incredible. It feels exactly like the real testing environment and helped me build so much confidence."
Sarah J.
"The PDF is very well organized and the explanations for the answers are actually helpful, not just random text."
Michael C.
"I was skeptical, but the content is high quality and definitely worth the price. I passed on my first try!"