Azure Databricks – The Storage of Data
Your Azure Databricks workspace includes an associated Azure storage account and blob container. Two lifecycle management policies that delete temporary and log files are created by default. You can view the rule details via the Code View tab on the Lifecycle Management blade in the Azure portal. Temporary files are removed after 7 days of no modification, and log files after 30.
Azure Databricks and Azure Delta Lake provide some helpful features for data archiving. The aforementioned approach about removing data based on a datetime stamp is valid in this context as well. For example, the following SQL syntax will remove brain wave readings from a table named BRAINWAVES that are over 1,825 days (5 years) old. You could also consider creating history tables, as described earlier.
%sql
DELETE FROM BRAINWAVES WHERE DATEDIFF(current_date(), READING_DATETIME)> 1825
Another approach is to use a Delta Lake method called vacuum. This method, by default, removes data files that are no longer referenced by any data table and that are greater than 7 days old. Use the following code snippet to remove unnecessary files:
%python
from delta.tables import *
dt = DeltaTable.forName(spark, ‘BRAINWAVES’)
dt.vacuum()
It is common that when files are overwritten using the .mode(‘overwrite’) method, duplicate files are generated. Running this vacuum method will result in the duplicates being removed. In all cases of data removal and/or archiving, you should determine whether it is worthy of being stored for longer periods of time in a history archive.
Azure Synapse Analytics Develop Hub
In the previous chapter you learned about the Manage, Data, and Integrate hubs that exist in the Azure Synapse Analytics workspace. Now it is time to learn about the Develop hub. The last remining hub, Monitor, is covered in Chapter 9, “Monitoring Azure Data Storage and Processing.”
The Develop hub is the location in Azure Synapse Analytics Studio where you build and test queries and code on data that has been ingested into your data lake. Once the output is what you require, use your findings to build integration datasets, data flows, and other activities to be included in a transformation pipeline.
SQL Scripts
When your data analytics needs to run T‐SQL, like queries using either a serverless or dedicated SQL pool, start with SQL Scripts. The Develop Hub is also the location where you can access tables and execute stored procedures and UDFs residing on your dedicated SQL pool. Complete Exercise 4.6, where you will create and execute some syntax using the SQL scripts feature.