The modern world is gradually being digitized, and the further, the more data is produced. Its quantity is increasing daily, making storing, analyzing, and processing it more difficult. One of the practical problem-solving tools is Databricks. The platform based on the Apache Stark open-source framework attempts to combine data, artificial intelligence, and analytics.
Both the Lake and the Warehouse
Azure Databricks is an environment for storing and processing data, which is necessary for the work of data scientists, engineers, and analysts. The platform combines two data storage methods — Warehouse and Lake. The hybrid approach allows the processing of unstructured and structured big data on the same cloud service.
It is an environment that provides access to all resources, and objects are organized in directories. Data objects and compute resources can be accessed from the Databricks workspace. The folders contain the following entities:
- Databricks Notebook — documentary interface for ready-made code, text descriptions, and visualization
- Databricks Dashboard — provides organized access to visualizations
- Databricks Library — a code package for a notebook or cluster job; you can add your own to existing libraries
- Databricks Repository — a directory that is synchronized with a remote Git repository
An experiment is a collection of platform executions for machine learning models in Databricks.
You can use the Databricks job scheduling tool for the Machine Learning workspace to help you create, edit, run, and monitor workflows.
In addition, Databricks consists of layers
- Databricks Delta Lake — reliable storage in data lakes; can connect to other cloud storages such as AWS S3 and Google Cloud.
- Databricks Delta Engine — query processing system for data stored in Delta Lake
Databricks also has built-in tools to support data science, Business Intelligence reports, and machine learning practices in production.
How can you lose money in Databricks?
To understand where you find and lose on Databricks, you must understand the principle of evaluating the system resources operation. Databricks pricing is based on a consumption model, charging for the resources used to run workloads: compute, memory, and I/O. The resource is expressed in Databricks Units (DBU). It is the total number of resource units used per second. They are the units you have to pay for.
Difficulties in generating a financial account and tracking units in Databricks can occur in some places, namely:
- Done by hand — billing data integration can involve much manual work without an automated system. In this case, errors are possible that will lead to costs increase.
- No cost alerts — Databricks is not obligated to do this, so there may be a virtual account effect: it seems that you spend little, but without a limit, it turns out that you spent a lot.
- The overall complexity of the assessment is due to the double fee — the payment to Databricks is divided into two parts: for licensing the platform and its rental on EC2 Amazon. It confuses the calculation of the total process cost.
- Difficulty tracking cost details — different platform features cost differently, even if the work is carried out at the junction of two tools. It is also impossible to determine which business units affect costs in Databricks.
To facilitate the calculation, the platform offers a DBU calculator. You can calculate the load your workflow needs by selecting cluster size, instance type, and subscription plans. In such a way, Databricks cluster management works.
A common place for ways to optimize the workflow and its cost on Databricks is the right choice of the instance type. AWS provides different instance families to suit the workloads, each with a different CPU, memory, and storage resource configuration. Choosing an over-reserved instance can result in unnecessary overhead. Still, an under-reserved instance can result in performance degradation and increased processing time.
Databricks has a managed cluster policy by default. It's called Personal Compute, and it makes it easy to create single-machine computing resources for individual use so that they immediately start running workloads, minimizing the overhead of computing management.
Universal clusters have the following capabilities:
- Running single-node clusters locally, with local Spark
- Using single-user access to the cluster and compatible with a unified governance solution for all data and AI assets
- The runtime cost estimation in Databricks is configured to use the latest Databricks version of Runtime for Machine Learning
- Both standard instances and GPU-enabled instances are available
- The auto-complete period is 72 hours
A Databricks Personal Compute policy cluster is created from the same name page or a notebook.
Storage cost management in Databricks
Databricks does not store data on its own. The data is stored in the user's cloud storage. For AWS, this is S3; for Azure — Data Lake Storage Gen2; for Google Cloud — Google Cloud Storage. Databricks reads data from storage and writes it to storage. Therefore, only the user controls the data, but if he wants to access it without Databricks, then you are welcome. The platform only offers cloud cost optimization, considering storage costs and other workflow parameters.
Databricks Lakehouse works effectively across all popular clouds through tight integration with security, computing, storage, analytics, and artificial intelligence services. Cloud service providers initially offer this list. Regardless of time or method, it sets the flexibility you need to use your chosen cloud provider.
- Azure Databricks, a Databricks data and AI service available through Microsoft Azure, stores all data in open storage and connects analytics and AI workloads
- Databricks on AWS stores and manages data on a simple Lakehouse platform that makes hybrid performance to unite analytics and AI workloads
- Databricks on the Google Cloud is an open platform Lakehouse integrates with the open cloud to bring together data engineers, data scientists, and business analysts; it is also an opportunity for Google Cloud Databricks cost optimization
- Alibaba Cloud Services — Databricks integration with a huge platform for online trading
It starts at $0.07 across six Databricks pricing plans:
Databricks can create a custom tariff plan at the request of a specific customer.
Admin Tools to Reduce Costs
Databricks Lakehouse Platform implements the flexibility of a cloud service for quick access to fast and easily scalable computing. But the ease of creating resources without limit has a downside — an increase in cost. The administrator must strike this balance — the infrastructure cost optimization — without complicating users and reducing productivity. Since costs in Databricks are calculated based on compute clusters, workspaces can be managed using cluster policies.
Determining the Cluster Size
The Databricks cluster policy gives the administrator control over the configurations of the new cluster, and these policies can be assigned to individual users or groups. The right to "allow unlimited creation of clusters" is set by default in the workspace. But the need for restrictions outside of the assigned policies can lead to uncontrollable costs.
- Limiting the number of nodes — the policy enables the Databricks cluster auto-scaling with a specified minimum number of worker nodes; the single node policy is helpful for beginners and data science teams using undistributed ML libraries and for light exploratory analysis.
- Databricks auto-completion — turn off the cluster after a specified period of inactivity: the absence of any activity on the cluster; creating an SSH connection to the cluster and executing bash commands are not considered activities on the cluster. Often the auto-completion period is set to one hour.
The most common compute cost problem in Databricks is underused or idle clusters.
Cloud and Spot Instances
When creating a cluster, you can select the virtual machine instance type for the driver node and worker nodes separately. Available instances have a different DBU rate in the Databricks pricing for each respective cloud.
As per admin, data-sparse workloads should require instances with less memory, and deep learning models should require GPU clusters that consume more DBUs. The instance type constraint is used as a balancing act. When a team performs an enormous workload that requires more resources than the policy allows, the task will take longer, resulting in higher costs.
When configuring a Databricks cluster, two scaling methods are used:
- vertical uses more powerful instances
- horizontal adds the number of nodes
Spot instances also allow saving on the resources used — these are free virtual machines offered by the cloud provider, which are put up for auction in the real market. Sometimes discounts exceed a 90% reduction in the cost of instance computations. They can be returned to the provider at any time with a short notice period (2 minutes for AWS, 30 seconds for Azure, and GCP). It is the feature of a cloud provider pricing optimization in Databricks.
Cloud storage costs
The DBU concept from Databricks assumes the transfer of all costs to computing resources. But networking and storage costs in the underlying cloud don't go away.
The advantage of Databricks is that it works with available cloud storage, which is convenient and beneficial with the Delta Lake format: data management occurs at the storage level and makes Databricks performance optimization. In this case, lifecycle management should not be neglected since up to two-thirds of the data may be versioned. Old data retention can be aligned with the Delta Vacuum cycle.
If the storage lifecycle ages objects before they are vacuumed with Delta, tables can break. Testing the data lifecycle policies in Databricks on non-production data is required before wide application.
Monitoring by cluster tags
Clusters and pools can be tagged, making it easier to control costs in Databricks and allowing you to allocate resources accurately between user work teams. Labels apply to DBU usage, AWS EC2 Virtual Server Instances, and the AWS EBS Network. It makes it easier to analyze costs in Databricks.
You cannot assign a custom tag with the Name key to a cluster. Databricks has already set this default name; if the value changes, the system will no longer track it. The cluster can complete the work but still need to complete it in form, which is an additional cost.
Databricks recommends making a separate policy for each tag. It can be cumbersome, but it will be easier to keep track of the state of the clusters.
Data Quality Check
Data quality decreases as task complexity increases and code dependencies and third-party sources are added. Databricks customers can use Anomalo's comprehensive platform to monitor data quality in tables for cost anomaly monitoring in Databricks.
After connecting Anomalo to Databricks, the data monitoring table is set up, and the platform continues monitoring by key parameters:
- freshness — checking data for lateness
- volume — check the amount of data
- missing data — segment may have been discarded
- anomalies in tables — duplication, changes in table schema, or category values
If the Databricks data is not monitored or is out of policy, Anomalo will alert you in real-time.
What Other Tools Can Be Used to Reduce Costs?
In addition to the above ways to reduce the workflow cost in Databricks, several more tools exist.
The Spark cost optimization means the SQL module uses the cost optimizer to improve query plans. It is relevant for queries with multiple connections. For correct operation, monitoring the Databricks statistics of tables and columns and keeping them up to date is necessary.
- For the cost-based optimizer to be as efficient as possible, it is vital to collect statistics on columns and tables using the ANALYZE TABLE command
- Checking query plans — to check the use of statistics by the plan, the structured query language commands are used
- You can use Spark SQL UI to view statistics accuracy and plan execution
The Databricks cost-based optimizer is enabled by default and disabled by changing the spark.sql.cbo.enabled flag.
In the Databricks Enterprise 2.0 architecture, the account console allows the administrator to see resource usage by DBU or dollar amount. The diagram shows using the process in a generalized way, sorted by workspace or by stock-keeping unit.
Databricks is a first-party service on the Azure platform, so Databricks use the Databricks Cost Management tool. Unlike the Account Console, for Databricks deployments on AWS and GCP, the monitoring capabilities in Azure provide data down to the granularity of tags.
One of the best tried and tested practices is where the Databricks pipeline receives data daily and stores it in a Delta table using a scheduled process. The data helps analyze usage or trigger alerts when usage reaches a policy threshold. It’s called data pipeline cost optimization.
Databricks Cost Optimization by Examples
Finding typical recipes for AWS Databricks cost reduction in a process where each task is solved individually is problematic. You have to hit the squares, but you need to make a sniper shot. In this sense, looking at particular real-life solutions is much more enjoyable.
The company provides airport services. Their customers are airlines, and they can order through the business web portal. It is a key service for the customer, processing 4 million public requests and 26.4 million messages, or 3.42 TB of data per day. Data processing costs £1m per year for this solution alone. The customer ordered a value below £60,000.
The contractor advised changing the platform to its own, creating a solution based on Event Hub and Databricks. Costs have dropped to £700,000 a year.
Here's what turned out to be the most effective. The Apache Spark streaming engine ships with the Hadoop Distributed File System backend by default. Sometimes it pauses during garbage collection due to memory overload. But in Databricks, you can use the RocksDB database for stateful streaming. At large volumes, it processes data more efficiently. After the state management system change, the maintenance cost became five figures.
Unused resources accounted for 80% of expenses
Before moving to Delta Lake optimization in Databricks, Relogix (which provides sensors as a service to measure workplace comfort) was running a single instance of SQL Server. Over time, scaling it up to meet the demands of a modern data company has become too expensive. The company chose Delta Lake and Databricks to create a secure and intuitive analytics environment while limiting costs.
Data preparation and ad hoc analysis moved to Databricks, where data was centrally accessed, and clusters turned on and off as needed. In addition, the solution gave the opportunity for company-wide experimentation with ML and AI. Implementing a modern data architecture allowed the customer to reduce the cost of unused computing resources by 80% and expand the capabilities of the data team.
Eternal Choice: Either Convenient or Cheap
Cloud-based Databricks Lakehouse Platform covers many scenarios and subjects of use. The service tries to give the administrator accurate weights for pharmacy scales; one bowl is money, and the other is comfort.
The main Databricks tools for maintaining user balance are:
- Cluster policies to control users and scale of clusters
- Proper organization of the working environment — reducing non-DBU costs generated by Databricks spaces: storage and networking
- Constant monitoring — provides information on the compliance of the received figures with the expected ones. To approach the desired indicators, it is necessary to apply practical methods
Generally speaking, optimize cluster usage by autoscaling and auto-shutdown so that Databricks clusters run workloads only when needed and shut down when they're done.
Compare what you want with reality
Before choosing the Databricks optimization and big data cost management tool, it is essential to remember that these are the possibilities:
- Control costs for Databricks by any type, account, service, or region
- Optimizing cluster usage through autoscaling and auto shutdown
- Taking advantage of spot instances to save costs
- Make Databricks storage optimization by using compressed formats and storing infrequently accessed data at low-cost tiers
- Sizing clusters using the number of worker nodes and the desired instance type for the workloads
- Cluster Labeling Efficiency
If you need more time to deal with Azure Databricks' cost optimization details, contact DATAFOREST specialists.
Another true life story
The DATAFOREST team has optimized an American IT company's cloud infrastructure to reduce costs and increase productivity. Their AWS Cloud infrastructure cost reached $57k/month (a 265% increase in 9 months). During the study, the team looked into the company's cloud infrastructure, including compute, storage, and internal traffic, to find potential areas for cost reduction. Here is what the experts found and did:
- Computing resources — optimizing instance capacity, cleaning up unnecessary resources, stopping non-production resources during off-hours, optimizing continuous integration and deployment, auditing security
- Storage — resizing unused storage, backing up data to S3 and deleting unused volumes, optimizing storage type, reducing I/O
- Instances — upgrading to the current version, typing by load, redundancy for stable parts of the infrastructure, saving plans
- Dockerization — working with microservices and running chains of services on a single instance
The result of DATAFOREST's work was: savings of $23.5k/month, the cost of a copy was reduced by 67.5%, 916 TB of unused storage was removed, and the work speed increased by 8%. As you can see, the savings can be impressive if you work with professionals.
What is Databricks cost optimization, and why is it important?
It is the application of Databricks tools and practices to reduce overall workflow costs tied to a pay-per-second data processing unit. Cost optimization allows you to pay only for those workflows used and only when they are used. It is vital for the overall savings of project funds and their release for data analysis.
What are the factors that affect Databricks' cost optimization?
The administrator must balance the cost of work and the comfort of users. It can influence such factors:
- Databricks clusters policy — their sizes and scaling
- Databricks designing a working environment with minimal data storage and networking costs
- Databricks monitoring — tracking results and adjusting them through the introduction of new effective methods of saving
The scheduler translates the priority of the Spark framework into resource allocation in Databricks and includes either preemptive or non-preemptive releases.
What are the best practices for Databricks cost optimization?
Common places in Databricks cost reduction are: optimizing cluster utilization with autoscaling and auto-completion, taking advantage of spot instances, and reduce data storage costs with compressed formats, and storing infrequently used data at cheaper tiers.
Are there any tools and features in Databricks that can help with cost optimization?
Databricks provides an autoscaling feature that helps you save money by allocating resources based on workload requirements. With this feature, you only pay for the resources you need — you won't over-allocate or under-allocate the resources you need to complete your workload.
How can I reduce computing costs in Databricks?
To determine how much Databricks will cost, multiply the number of DBUs you use by the dollar rate. Several factors determine the number of DBUs consumed by a particular workload:
- amount of Databricks data processed
- amount of Databricks memory used
- Databricks vCPU power consumption
- type of services used
The cloud service platform is critical in how much you spend on Databricks.
What are some strategies for managing storage costs in Databricks?
Depending on the project's needs, the following strategies can be used:
- DBU Calculator — Databricks cost estimator and identifier of workloads
- Choosing the right instance type — each AWS instance has its own CPU, memory, and storage resource configuration
- Autoscaling in Databricks delivers cost savings by allocating resources based on workload requirements
- Using spot instances — virtual machines that cloud providers offer for auction. You can save up to 90% on computing costs if you use spot stations
- Databricks cluster tagging — helps to allocate costs and resource usage across teams efficiently. Before that, the admin must set policies to ensure that cluster tags are applied
- Divide datasets into small, manageable chunks that can be processed in parallel. It reduces processing time and reduces the cost of the Databricks workload.
How can I monitor and control costs in Databricks?
To control costs and accurately allocate Databricks resources across your organization (for example, for cost recovery), you can tag workspaces and resource groups: clusters and pools. The labels apply to detailed cost analysis reports accessed in the Azure portal.